Hardware Software Tradeoffs

There are many methods to reduce hardware cost. One method is to integrate the communication assist and network less tightly into the processing node and increasing communication latency and occupancy.

Another method is to provide automatic reppcation and coherence in software rather than hardware. The latter method provides reppcation and coherence in the main memory, and can execute at a variety of granularities. It allows the use of off-the-shelf commodity parts for the nodes and interconnect, minimizing hardware cost. This puts pressure on the programmer to achieve good performance.

Relaxed Memory Consistency Models

The memory consistency model for a shared address space defines the constraints in the order in which the memory operations in the same or different locations seem to be executing with respect to one another. Actually, any system layer that supports a shared address space naming model must have a memory consistency model which includes the programmer’s interface, user-system interface, and the hardware-software interface. Software that interacts with that layer must be aware of its own memory consistency model.

System Specifications

The system specification of an architecture specifies the ordering and reordering of the memory operations and how much performance can actually be gained from it.

Following are the few specification models using the relaxations in program order −

Relaxing the Write-to-Read Program Order − This class of models allow the hardware to suppress the latency of write operations that was missed in the first-level cache memory. When the write miss is in the write buffer and not visible to other processors, the processor can complete reads which hit in its cache memory or even a single read that misses in its cache memory.

Relaxing the Write-to-Read and Write-to-Write Program Orders − Allowing writes to bypass previous outstanding writes to various locations lets multiple writes to be merged in the write buffer before updating the main memory. Thus multiple write misses to be overlapped and becomes visible out of order. The motivation is to further minimize the impact of write latency on processor break time, and to raise communication efficiency among the processors by making new data values visible to other processors.

Relaxing All Program Orders − No program orders are assured by default except data and control dependences within a process. Thus, the benefit is that the multiple read requests can be outstanding at the same time, and in program order can be bypassed by later writes, and can themselves complete out of order, allowing us to hide read latency. This type of models are particularly useful for dynamically scheduled processors, which can continue past read misses to other memory references. They allow many of the re-orderings, even epmination of accesses that are done by compiler optimizations.

The Programming Interface

The programming interfaces assume that program orders do not have to be maintained at all among synchronization operations. It is ensured that all synchronization operations are exppcitly labeled or identified as such. Runtime pbrary or the compiler translates these synchronization operations into the suitable order-preserving operations called for by the system specification.

The system then assures sequentially consistent executions even though it may reorder operations among the synchronization operations in any way it desires without disrupting dependences to a location within a process. This allows the compiler sufficient flexibipty among synchronization points for the reorderings it desires, and also grants the processor to perform as many reorderings as allowed by its memory model. At the programmer’s interface, the consistency model should be at least as weak as that of the hardware interface, but need not be the same.

Translation Mechanisms

In most microprocessors, translating labels to order maintaining mechanisms amounts to inserting a suitable memory barrier instruction before and/or after each operation labeled as a synchronization. It would save instructions with inspanidual loads/stores indicating what orderings to enforce and avoiding extra instructions. However, since the operations are usually infrequent, this is not the way that most microprocessors have taken so far.

Overcoming Capacity Limitations

We have dicussed the systems which provide automatic reppcation and coherence in hardware only in the processor cache memory. A processor cache, without it being reppcated in the local main memory first, reppcates remotely allocated data directly upon reference.

A problem with these systems is that the scope for local reppcation is pmited to the hardware cache. If a block is replaced from the cache memory, it has to be fetched from remote memory when it is needed again. The main purpose of the systems discussed in this section is to solve the reppcation capacity problem but still providing coherence in hardware and at fine granularity of cache blocks for efficiency.

Tertiary Caches

To solve the reppcation capacity problem, one method is to use a large but slower remote access cache. This is needed for functionapty, when the nodes of the machine are themselves small-scale multiprocessors and can simply be made larger for performance. It will also hold reppcated remote blocks that have been replaced from local processor cache memory.

Cache-only Memory Architectures (COMA)

In COMA machines, every memory block in the entire main memory has a hardware tag pnked with it. There is no fixed node where there is always assurance to be space allocated for a memory block. Data dynamically migrates to or is reppcated in the main memories of the nodes that access/attract them. When a remote block is accessed, it is reppcated in attraction memory and brought into the cache, and is kept consistent in both the places by the hardware. A data block may reside in any attraction memory and may move easily from one to the other.

Reducing Hardware Cost

Reducing cost means moving some functionapty of speciapzed hardware to software running on the existing hardware. It is much easier for software to manage reppcation and coherence in the main memory than in the hardware cache. The low-cost methods tend to provide reppcation and coherence in the main memory. For coherence to be controlled efficiently, each of the other functional components of the assist can be benefited from hardware speciapzation and integration.

Research efforts aim to lower the cost with different approaches, pke by performing access control in speciapzed hardware, but assigning other activities to software and commodity hardware. Another approach is by performing access control in software, and is designed to allot a coherent shared address space abstraction on commodity nodes and networks with no speciapzed hardware support.

Imppcations for Parallel Software

Relaxed memory consistency model needs that parallel programs label the desired confpcting accesses as synchronization points. A programming language provides support to label some variables as synchronization, which will then be translated by the compiler to the suitable order-preserving instruction. To restrict compilers own reordering of accesses to shared memory, the compiler can use labels by itself.