Philosophy

Where Things Went Wrong

The existing approaches to programming and hardware design have been very productive for a number of decades, however both made a similar mistake: incorporating an optimization into design description. For programming this the use of shared memory (see below), and for hardware design it was incorporating the clock (for RTL synthesis). The common solution for the problem is a move to asynchronous/message-passing descriptions of design intent that compiler tools can optimize.

RIP SMP

A Von Neumann programming style uses shared memory explicity, and that approach usually limits its application to single cpu or SMP hardware, and if the SMP hardware has a large number of processors then there is a high complexity and power cost, particularly if memory coherency is required. Future architectures will bind processing (CPUs) more closely to individual regions of memory, and avoid the need for coherency and central management, processors may be of different types depending on the machine architecture. Current PC and games machine architectures have been trending in this direction to improve the graphics performance.

Applications that span platforms (e.g. mobile to cloud) cannot use an SMP programming paradigm, and bleeding-edge processors are moving to NUMA (non-cache-coherent) architectures fairly rapidly. Attempts to keep the paradigm going with approaches like STM (software transactional memory) add more complexity that is hard to analyze and optimize.

The shift away from Von Neumann programming requires a new programming language that supports paradigms that work well on new hardware architectures and in distributed computing environments.

Sequential Coding is so "Last Century"

Large tracts of sequential C or C++ are very hard to analyze and refactor for new parallel hardware. The best description to start with is the one with the most threads/processes, so that compilers can optimize inter-process communication and collapse threads back into sequential code when it makes sense. A mostly static description of the code/process structure makes the job much easier.

APIs Don't Help Much

APIs (e.g. OpenCL) create an artificial boundary between computing resources, and virtually guarantees a sub-optimal implementation of your algorithm on the hardware available.

Similarly, OpenMPI is useful as an implementation layer, but is not easily analyzable (i.e. communication structure is not obvious from static analysis of the code).