Philosophy
Where Things Went Wrong
The existing approaches to programming and hardware design have been
very productive for a number of decades, however both made a similar
mistake: incorporating an optimization into design description. For
programming this the use of shared memory (see below), and for hardware
design it was incorporating the clock (for RTL synthesis). The common
solution for the problem is a move to asynchronous/message-passing
descriptions of design intent that compiler tools can optimize.
RIP SMP
A Von
Neumann
programming style uses shared memory explicity, and that approach
usually limits its application to single cpu or SMP hardware, and if
the SMP hardware has a large number of processors then there is a high
complexity and power cost, particularly if memory coherency is
required. Future architectures will bind processing (CPUs) more
closely to individual regions of memory, and avoid the need for
coherency and central management, processors may be of different types
depending on the machine architecture. Current PC and games machine
architectures have been trending in this direction to
improve the graphics performance.
Applications that span platforms (e.g. mobile to cloud) cannot use
an
SMP programming paradigm, and bleeding-edge processors are moving to
NUMA (non-cache-coherent) architectures fairly rapidly. Attempts to
keep the paradigm going with approaches like STM (software transactional memory) add more complexity that is hard to analyze and optimize.
The shift away from Von
Neumann programming requires a new programming language that
supports paradigms that work well on new hardware architectures and in distributed computing environments.
Sequential Coding is so "Last Century"
Large tracts of sequential C or C++ are very hard to analyze and
refactor for new parallel hardware. The best description to start with
is the one with the most threads/processes, so that compilers can
optimize inter-process communication and collapse threads back into
sequential code when it makes sense. A mostly static description of the
code/process structure makes the job much easier.
APIs Don't Help Much
APIs (e.g. OpenCL)
create an artificial boundary between computing resources, and
virtually guarantees a sub-optimal implementation of your algorithm on
the hardware available.
Similarly, OpenMPI
is useful as an implementation layer, but is not easily analyzable
(i.e. communication structure is not obvious from static analysis of
the code).