22C:122/55:132 Notes, Lecture 38, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

The Challenge of Scale
Cache memory plus bus arbitration logic is good for constructing small-scale multiprocessors, but the parallelism that is possible is extremely limited! We can quantify this limitation!
- Given
  R_cpu -- the CPU memory reference rate.
  R_bus -- the bus memory reference rate.
  H_cache -- the hit ratio for the cache.
  N -- the number of processors
- The system will be balanced if
  N(1-H_cache)R_cpu = R_bus
If the equation holds, the bus will be fully saturated. If the left side is smaller, adding more CPUs may be possible. If the right side is smaller, adding CPUs will not improve the performance, but increasing the hit ratio, for example, by using a larger cache, will improve the performance.
Consider a system with a 70 nanosecond main memory cycle time, and a deeply pipelined CPU that runs at 100 MHz (a 10 nanosecond cycle time). If our L1 cache has a hit rate of 90%, the CPU will make a main memory reference every 100 nanoseconds, so this system will perform well with one CPU. With 2 CPU's, however, they demand a memory cycle every 50 nanoseconds, so they cannot operate at full speed with 70 nanosecond main memory. We might achieve reasonable performance, however, by using an L2 cache with a hit rate of 50% to speed the apparant main memory cycle time to something like 35 nanoseconds.
But! There is no way to extend this scheme arbitrarily! There comes a point where the cache is large enough to hold all of the instructions and variables that the processors use, so that the only traffic on the main bus is traffic involved in interprocess communications. Once we enlarge the system to the point that this traffic saturates the bus, we reach a dead end!
Up to the point of failure, bus based systems can be expanded to accomodate N CPUs at a cost of O(N) per CPU. Crossbar switches can be expanded at a cost of O(N²) with no limit, but the quadratic price becomes prohibitive at about the same point that the bus-based systems fail. Both have commonly reached N=16, and above this, bus-based systems fail to perform well and crossbar switches grow too expensive.
This problem was extremely clear in the early 1970's, as the 16 by 16 crossbar-based C.mmp processor was being completed. The next system built at Carnegie Mellon University, Cm*, was quite different. This system was a NUMA machine, with a non-uniform memory access time. The top-level structure of the machien was arranged as follows:
```
	                           _____________
	Cluster Bus               |   cluster   | | |
	------o----------------o--|  controller | | |
	 ___  |           ___  |  |_____________|-o |
	|CPU| |  up to   |CPU| |                  | |
	|___| |    10    |___| |                  | |
	 _|_  | computer  _|_  |                  | |
	|MMU|-   modules |MMU|-           inter   | |
	|___|            |___|           cluster  | |
	 _|_              _|_             busses  | |
	|RAM|     ...    |RAM|                    | |
	|___|            |___|                    | |
                                                  | |
	                           _____________  | |
	                          |   cluster   |-o |
	--------------------------|  controller | | |
	                          |_____________|-|-o
	                           _____________  | |
	                          |   cluster   |-|-o
	--------------------------|  controller | | |
	                          |_____________| | |
	
```
In this system, each computer module had a CPU and memory, connected by a memory management unit. Each segment of the address space could be mapped to local memory, in which case, access was fast, or to remote memory, in which case access was a minimum of 3 times slower. Nonlocal memory references could refer to other memory within the same cluster, or it could refer to memory in some computer module of some other cluster. In the latter case, the overhead of intercluster communications slowed the memory access to about 1/9 of the local memory speed.
Each cluster controller could access multiple intercluster busses; as built, the system had 5 clusters of 10 computer modules, with 3 intercluster busses. In theory, the system could be expanded without limit, but the operating system problems that arise on such a machine pose severe difficulties, and this design never went farther.
Butterfly or Banyan Tree Interconnection
Bolt Beranek and Newman Corporation developed an alternative model for a multiprocessor based on commodity microprocessor chips, known as the BBN Butterfly. BBN (now BBN Technologies, a division of Verizon Corporation) had pioneered the development of priority interrupt structures, timesharing systems, modems and computer networks in the 1960's. As with Cm*, this was a NUMA based on computer modules and an interconnect system, but instead of a hierarchy of clusters with an almost random interconnect structure between the clusters, it had a very systematic approach to interconnection, the butterfly switching network. A Butterfly with 128 processors was demonstrated in 1984, and unlike Cm*, the butterfly was sold commercially.
The basic butterfly network topology had been developed simultaneously by several different groups, and starting from several different sets of initial assumptions about the nature of the switches that would be used to interconnect systems. One starting point was the cube connected network:
```
         ___
        |___|--o--------o----o-
         ___   |        |    | 
        |___|--|-o------|-o--o-
         ___   | |      | |    
        |___|--|-|-o----o-|--o-
         ___   | | |      |  | 
        |___|--|-|-|-o----o--o-
         ___   | | | |         
        |___|--o-|-|-|--o----o-
         ___     | | |  |    | 
        |___|----o-|-|--|-o--o-
         ___       | |  | |    
        |___|------o-|--o-|--o-
         ___         |    |  | 
        |___|--------o----o--o-
	
```
The diagram shows a 3-cube, but this model can be extended to any number of dimensions. If each box is a computer system and each vertical line is a network interconnection, we have a cube connected multicomputer, an idea that dates back to the 1960's. The CalTech Cosmic Cube, from 1983, consisting of 64 microcomputers interconnected by a 6-dimensional hypercube; this spawned the IPSC family of hypercube systems (the Intel Personal Supercomputer family).
If each interconnection in the above figure represents a crossbar switch on the path from CPU to memory, we get the switching network that has been described as a banyan or butterfly network.
Banyan trees are tropical trees similar mangroves; like the mangrove, they have multiple roots and trunks supporting a spreading crown of branches that can extend over a very large area. The network is called a butterfly because one of the ways of drawing it has a vague resemblance to a butterfly.
Such an interconnection scheme allows 2ⁿ processors to address any of 2ⁿ memory modules with a switching delay of n. The path from one CPU to one memory module in an 8-processor butterfly network is shown here:
```
                         --           --
         ___               |            |
        |   |    |  |      |            |  |  |
        |CPU|----x--x--    |  |  |       --x--x--
        |___|    |  |       --x--x--       |  |
               --x--x--  |    |  |       --x--x--
              |  |  |    |  --x--x--    |  |  |     ___
              |  |   ----  |  |  |      |  |       |   |
             -   |         |  |   ------    -------|RAM|
                  ---------   |                    |___|
                               -----
	
```
If all memory accesses had been delayed through such a switching network, the BBN Butterfly would have had an unacceptable performance, but the machine was built with two paths to memory, a direct path from each processor to its own part of the global memory, and a longer path, through the butterfly switch, allowing each processor access to all of memory. Each computer module of the BBN Butterfly therefore had the following structure:
```
                 | 
         ___   --x----< from butterfly switch
        |   |    |  
        |CPU|----x----> to butterfly switch
        |___|    | 
                _|_
               |   |
               |RAM|
               |___|
	
```
In writing programs for a NUMA computer such as the Butterfly or the Cm* machine, the code, private variables and stack of each process should be located in the local memory of computer module that runs that process, so that the only traffic through the switching network is traffic to shared variables.
Static code placement in a large scale multiprocessor is difficult! Many algorithms dynamically produce new processes, so there has been considerable development in two directions, one centered on operating systems that can dynamically move data to the local memory of the processor where it is needed, and the other centered on replacement or supplementation of local memory with cache.
Crossbar Chips
If each bus has an A bit address and D bits of data, plus a request and an acknowledge control line, a read/write line, and a strobe, we require A+D+4 lines per connection to a crossbar switch module, as a bare minimum! An NxN crossbar would therefore require 2N(A+D+4) input/output connections.
Consider the common case where A=D=32 and N=2. In this case, A VLSI crossbar switch chip would require a minimum of 4x68 = 272 pins, ignoring the number of pins required for power and ground! Chips, and even circuit boards with this number of pins are extremely awkward! Integrated circuit realizations of a single crossbar intersection require N=1 plus at least two control lines for the arbitration logic. For the example, this would be 2x68+2 = 134 pins, a number that is far easier to deal with.
Active Crossbar Logic
A group at the Courant Institute at New York University observed that the most serious performance problems with the Butterfly architecture were caused by what they referred to as hot spots in the memory address space. These were shared variables that were heavily used by large numbers of processors. For example, consider a multiprocessor where each of the CPU's occasionally checks the value of some shared variable to see if it is time to terminate the computation. From the point of view of any one CPU, this variable is only accessed occasionally, but from the point of view of the memory module holding the variable, we see an entirely different perspective:
```
	                       ___ 
	              |     | |   | |     |
	              x--x--x | M | x--x--x
	              |  |  | |___| |  |  |
	                 |      |      |
	                 x------x------x
	                 |             |
	              |  |  |       |  |  |
	              x--x--x       x--x--x
	              |     |       |     |
	
```
The memory sees a binary tree of switching paths leading away toward the myriad processors, each of which makes occasional requests. As those requests reach toward the memory, the occasional becomes a torrent, and collisions in the switching nodes become very common.
This led to the notion of the "combining switch" element and to the NYU Ultracomputer Project. A combining switch behaves like a a small crossbar switch when it sees requests for different memory locations, but it combines requests when they are for the same location. If the only memory operations allowed are read and write, we have 4 cases:
- Read - Read: A single read occurs and the value read from memory is sent back to both clients.
- Write - Read,
- Read - Write: A single write occurs, and the value written is also sent to the client that attempted to read.
- Write - Write: One of the values written is discarded and the other write is completed.
The NYU team did not stop with this! They noted that many memory cycles are read-write-modify cycles, for example, I=I+1, and they realized that if the memory supports active semantics, that is, if the memory address bus includes not a one-bit opcode indicating that the operation is read or write, but rather, a small opcode requesting an ALU operation to be performed by the memory unit, it would be possible to combine operations in a far more interesting way!
To support this, the NYU ultracomputer supported the following model of memory:
```
	Bus from memory client        ------
                                   __|__    |
                                  |     |   |
        Address        -->--------| RAM |   |
                                  |_____|   |
                                     |      |
        Data to memory -->---------  |      |
                                  _|_|_     |
                                 |     |    |
        Operation      -->-------| ALU |    |
                                 |_____|    |
                                    |       |
        Data to client --<----------o------- 
	
```
With this, a CPU can add to memory, subtract from memory, read or write in one cycle, depending on the ALU operation selected, and appropriate combining switches that include a local ALU in each switch can combine these operations.
A large part of the work of the Ultracomputer group at NYU was devoted to developing parallel algorithms that took maximal advantage of the active memory semantics, but, with support from IBM, they developed VLSI combinign switches and developed a close working relationship with the group at IBM that developed a series of multiprocessors based on Butterfly-like topologies.
I am uncertain of the details, but it appears that the group at IBM that inherited the Sequent Symmetry product line also inherited large parts of the work that had been done in conjunction with NYU. I am uncertain of the extent to which the ideas from Product Ultra are currently incorporated into any current experimental or production parallel machines from IBM.
Tricks with Snooping Caches
Encore's snooping cache model can also be exploited to build large scale parallel machines by building a multilevel cache scheme. Encore announced a large scale multiprocessor based on this idea when the company was founded, but I am unaware of the outcome of that venture. It appears that the final Cray multiprocessor, based on DEC Alpha chips, may have been intended to be such a machine. The idea is as follows:
```
	                 |           |
	                 o-----o-----x-
	                 |   __|__   |
	                    |cache|--o   
	                    |_____|  |   even address
	                 |           |   memory cluster
	                 o-----o-----x-
	 _____           |   __|__   |
	| CPU |----o-----x- |cache|--o   _____
	|_____|  __|__   |  |_____|  |  |     |
	        |cache|--o           o--| RAM |
	        |_____|--|--o--------o  |_____|
	 _____           |  |        |  |     |
	| CPU |----o-----x- |        o--| RAM |
	|_____|  __|__   |  |        |  |_____|
	        |cache|--o  | 
	        |_____|--|--         |
	                 o-----o-----x-
	Bus based cluster|   __|__   |   Odd addresses
	                    |cache|--o
	                    |_____|  |
	
```
As shown, the system has 2 cache levels, and at each level, we have 2 CPUs and 2-way interleaved connections to memory. Thus, the example shown will allow 4 memory modules in two groups of two. One group of 2 modules (shown in full above) handles only even memory addresses, while the other (abbreviated) handles only odd addresses. If each bus supports two cached clients, the system will allow 4 CPUs.
If we assume a 50% hit ratio in each cache, the number of bus cycles per second with two clients should equal the the number of cycles in each of the two clients, so this system will balance and in theory it will scale to any depth! The difficulty is that the top-level caches must be able to simultaneously snoop on several busses, but this is quite feasible, although it increases the cost modestly.
If we assume that we can put clients on each cluster bus, and we assume a 90% hit rate in the client caches with 4-way interleaved memory connections, then building this system with depth 2 would allow us to have 64 clients.
If we build this system with a fan-in and fan-out of two at each level, so there are two cached clients connected to two memory connections on each bus, we end up with a topology identical to the butterfly or banyan interconnect structure!