Abstracts
Chris Gniady, Babak Falsafi, and T. N. Vijaykumar
Chris Gniady, Babak Falsafi, and T. N. Vijaykumar
Sequential consistency (SC) is the simplest programming interface for
shared-memory systems but imposes program order among all memory operations,
possibly precluding high performance implementations. Release consistency (RC),
however, enables the highest performance implementations but puts the burden on
the programmer to specify which memory operations need to be atomic and in
program order. This paper shows, for the first time, that SC implementations can
perform as well as RC implementations if the hardware provides enough support
for speculation. Both SC and RC implementations rely on reordering and
overlapping memory operations for high performance. To enforce order when
necessary, an RC implementation uses software guarantees, whereas an SC
implementation relies on hardware speculation. Our SC implementation, called
SC++, closes the performance gap because: (1) the hardware allows not just loads
, as some current SC implementations do, but also stores to bypass each other
speculatively to hide remote latencies, (2) the hardware provides large
speculative state for not just processor, as previously proposed, but also
memory to allow out-of-order memory operations, (3) the support for hardware
speculation does not add excessive overheads to processor pipeline critical
paths, and (4) well-behaved applications incur infrequent rollbacks of
speculative execution. Using simulation, we show that SC++ achieves an RC
implementation's performance in all the six applications we studied.
T.N. Vijaykumar and Gurindar S. Sohi
The Multiscalar architecture advocates a distributed processor organization
and task-level speculation to exploit high degrees of instruction level
parallelism (ILP) in sequential programs without impeding improvements in
clock speeds. The main goal of this paper is to understand the key
implications of the architectural features of distributed processor
organization and task-level speculation for compiler task selection from the
point of view of performance. We identify the fundamental performance issues
to be: control flow speculation, data communication, data dependence
speculation, load imbalance, and task overhead. We show that these issues are
intimately related to a few key characteristics of tasks: task size,
inter-task control flow, and inter-task data dependence. We describe compiler
heuristics to select tasks with favorable characteristics. We report
experimental results to show that the heuristics are successful in boosting
overall performance by establishing larger ILP windows.
Sridhar Gopal, T.N. Vijaykumar, J. E. Smith and G. S. Sohi
Dependences among loads and stores whose addresses are unknown hinder the
extraction of instruction level parallelism during the execution of a
sequential program. Such ambiguous memory dependences can be overcome by memory
dependence speculation which enables a load or store to be speculatively
executed before the addresses of all preceding loads and stores are known.
Furthermore, multiple speculative stores to a memory location create multiple
speculative versions of the location. Program order among the speculative
versions must be tracked to maintain sequential semantics. A previously
proposed approach, the Address Resolution Buffer (ARB) uses a centralized
buffer to support speculative versions. Our proposal, called the Speculative
Versioning Cache (SVC), uses distributed caches to eliminate the latency and
bandwidth problems of the ARB. The SVC conceptually unifies cache coherence and
speculative versioning by using an organization similar to snooping bus-based
coherent caches. A preliminary evaluation for the Multiscalar architecture
shows that hit latency is an important factor affecting performance, and
private cache solutions trade-off hit rate for hit latency.
Andreas I. Moshovos, Scott E. Breach, T.N. Vijaykumar, Gurindar S. Sohi
Data dependence speculation is used in instruction-level parallel processors
to allow the early execution of an instruction before a logically preceding
instruction on which it may be data dependent. If the instruction is
independent, data dependence speculation succeeds; if not, it fails,
and the two instructions must be synchronized. The modern dynamically
scheduled processors that use data dependence speculation do so blindly
(i.e., every load instruction with unresolved dependences is speculated) In
this paper, we demonstrate that as dynamic instruction windows get larger,
significant performance benefits can result when intelligent decisions about
data dependence speculation are made. We propose dynamic data dependence
speculation techniques: (i) to predict if the execution of an instruction
is likely to result in data dependence mis-speculation, and (ii) to provide
the synchronization needed to avoid a mis-speculation. Experimental results
evaluating the effectiveness of the proposed techniques are presented with
the context of a Multiscalar processor.
G. S. Sohi, S. Breach, and T. N. Vijaykumar
Multiscalar processors use a new, aggressive implementation paradigm
for extracting large quantities of instruction level parallelism from
ordinary high level language programs. A single program is divided
into a collection of tasks by a combination of software and hardware.
The tasks are distributed to a number of parallel processing units
which reside within a processor complex. Each of these units fetches
and executes instructions belonging to its assigned task. The
appearance of a single logical register file is maintained with a copy
in each parallel processing unit. Register results are dynamically
routed among the many parallel processing units with the help of
compiler-generated masks. Memory accesses may occur speculatively
without knowledge of preceding loads or stores. Addresses are
disambiguated dynamically, many in parallel, and processing waits only
for true data dependences.
This paper presents the philosophy of the multiscalar paradigm, the
structure of multiscalar programs, and the hardware architecture of a
multiscalar processor. The paper also discusses performance issues in
the multiscalar model, and compares the multiscalar paradigm with
other paradigms. Experimental results evaluating the performance of a
sample of multiscalar organizations are also presented.
S. Breach, T. N. Vijaykumar, and G. S. Sohi
This paper presents the operation of the register file in the
Multiscalar architecture. The register file provides the appearance
of a logically centralized register file, yet is implemented as
physically decentralized register files, queues, and control logic in
a Multiscalar processor. We address the key issues of storage,
communication, and synchronization required for a successful design
and discuss the complications that arise in the face of speculation.
In particular, the hardware required to implement the register file is
detailed, and software support to streamline the operation of the
register file is described. Illustrative examples detailing important
aspects of the operation of the register file and an evaluation of its
effectiveness are provided.