How can I instrument a program using hardware counters?

The UltraSPARC and Pentium microprocessor families contain hardware performance counters that allow the measurement of many different hardware events related to CPU behavior, including instruction and data cache misses as well as various internal states of the processor. More recent processors allow a variety of events to be captured. The counters can be configured to count user events or system events, or both. The two processor families currently share the restriction that only two event types can be measured simultaneously.

CPU Performance Counters Library (cpc) is available on Solaris 8 and above. Using this library we implement the instrumentation for polaris. The source file is interval_cpc.c, and you may also link with the libaray libinstrcpc.a directly. Both of them are provided in the

paramnt/tools/instrumentation. This libaray satisfies the interface of instrumentation in polaris.

Usage

Step 1: Using polaris to do the instrumentation.

Please switch "instrument" on. You had better switch "instr_pcl" on, if you are inerested in all the processors.

Step 2: Compile the instrumentation code, if interval_cpc.c is used directly.

To compile this code, please use Sun 6.2 c compiler. cc -fast -xarch=v8plusa -xopenmp interval_cpc.c To instrumentate the code exclusively, please define _EXCLUSIVE. e.g. cc -fast -xarch=v8plusa -xopenmp -D_EXCLUSIVE interval_cpc.c For multiple processors, please define _OPENMP and _MT cc -fast -xarch=v8plusa -xopenmp -D_OPENMP -D_MT interval_cpc.c
You may also use my precompiled object file or libarary.
For inclusive instrumentation: interval_cpc.o and libinstrcpc
For exclusive instrumentation: interval_cpc_e.o and libinstrcpce
n.b. Because the hardware counter has only 4 bytes, it will cause overflow if the interval is too long. Therfore, a timer is added to reset the hardware counter in order to avoid overflow. The timer should not be too long to avoid overflow. And it should be as long as possible to reduce intrumentation overhead. So, you may define the time by setting the macro _OVRFLSEC. e.g. "-D_OVRFLSEC=8". By default, it is 8 for 500Mhz machines.
n.b. If you want to trace the latency of each iteration, you may define _TRACE.
n.b. To avoid the thread migration, I bind all the threads to the CPUs. Otherwize, it will cause some big problems for instrumentation. The bad news is this may have some side effects on performance.

Step 3: Compile with the instrumented code.

To instrument the parallel code, you must include the following flags in your fortran compiler -fast -xarch=v8plusa -openmp -mt
Make sure you linked the cpc libarary (-lcpc).

If you want to use the instrcpc/instrcpce libarary, do not forget to link it.

Step 4: Set the environment.

CPC does counting based on the Performance Control Register (PCR). INSTRCPC libaray uses the environment variable of PERFEVENTS to specify the type of counters.
By default, it is set to be "Cycle_cnt,Instr_cnt" to get the #cycles and #instructions, and thus to count CPI.
E.g. to get the cache hit ratio, we may do setenv PERFEVENTS "pic0=EC_ref,pic1=EC_hit".
The syntax of setting counter options is
pic0=<eventspec>, pic1=<eventspec> [,sys] [,nouser] . This syntax, which reflects the simplicity of the options available using the %pcr register, forces both counter events to be selected. By default only user events are counted; however, the sys keyword allows system (kernel) events to be counted as well. User event counting can be disabled by specifying the nouser keyword. The keywords pic0 and pic1 may be omitted; they can be used to resolve ambiguities if they exist.

LAST STEP: Run the program!

Appendix: Performance Instrumentation Counter Events

(From Sun Microelectronics UltraSPARC I&II User's Manual, January 1997, STP1031)

1 Instruction Execution Rates

Cycle_cnt [PIC0,PIC1]

Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.

Instr_cnt [PIC0,PIC1]

The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.

Using the two counters to measure instruction completion and cycles allows calculation of the average number of instructions completed per cycle.

2 Grouping (G) Stage Stall Counts

These are the major cause of pipeline stalls (bubbles) from the G Stage of the pipeline. Stalls are counted for each clock that the associated condition is true.

Dispatch0_IC_miss [PIC0]

I-buffer is empty from I-Cache miss. This includes E-Cache miss processing if an E-Cache miss also occurs.

Dispatch0_mispred [PIC1]

I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count.

Dispatch0_storeBuf [PIC0]

Store buffer can not hold additional stores, and a store instruction is the first instruction in the group.

Dispatch0_FP_use [PIC1]

First instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a Load_use (see 3 ). Thus, Dispatch0_FP_use and Load_use are mutually exclusive counts.

Some less common stalls are not counted by any performance counter, including

One cycle stalls for an FGA/FGM instruction entering the G stage following an FDIV or FSQRT.

3 Load Use Stall Counts

Stalls are counted for each clock that the associated condition is true.

Load_use [PIC0]

An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages.
Load_use also counts cycles when no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic.
There are also overcounts due to, for example, mispredicted CTIs and dispatched instructions that are invalidated by traps.

Load_use_RAW [PIC1]

There is a load use in the execute stage and there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store.

Some less common stalls are not counted by any performance counter, including:

Stalls associated with WRPR/RDPR and internal ASI loads.
MEMBAR stalls.
One cycle stalls due to bad prediction around a change to the Current Window Pointer (CWP).

4 Cache Access Statistics

I-, D-, and E-Cache access statistics can be collected. Counts are updated by each cache access, regardless of whether the access will be used.

IC_ref [PIC0]

I-Cache references. I-Cache references are fetches of up to four instructions from an aligned block of eight instructions. I-Cache references are generally prefetches and do not correspond exactly to the instructions executed.

IC_hit [PIC1]

I-Cache hits.

DC_rd [PIC0]

D-Cache read references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted. Atomic, block load, "internal," and "external" bad ASIs, quad precision LDD, and MEMBARs also fall into this class.
Atomic instructions, block loads, "internal" and "external" bad ASIs, quad LDD, and MEMBARs also fall into this class.

DC_rd_hit [PIC1]

D-Cache read hits are counted in one of two places:

When they access the D-Cache tags and do not enter the load buffer (because it is already empty)

When they exit the load buffer (due to a D-Cache miss or a nonempty load buffer).

Loads that hit the D-Cache may be placed in the load buffer for a number of reasons; for example, the load buffer was not empty. Such loads may be turned into misses if a snoop occurs during their stay in the load buffer (due to an external request or to an E-Cache miss). In this case they do not count as D-Cache read hits.

DC_wr [PIC0]

D-Cache write references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted.

DC_wr_hit [PIC1]

D-Cache write hits.

EC_ref [PIC0]

Total E-Cache references. Non-cacheable accesses are not counted.

EC_hit [PIC1]

Total E-Cache hits.

EC_write_hit_RDO [PIC0]

E-Cache hits that do a read for ownership UPA transaction.

EC_wb [PIC1]

E-Cache misses that do writebacks.

EC_snoop_inv [PIC0]

E-Cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ.

EC_snoop_cb [PIC1]

E-Cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.

EC_rd_hit [PIC0]

E-Cache read hits from D-Cache misses.

EC_ic_hit [PIC1]

E-Cache read hits from I-Cache misses.
The E-Cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-Cache hit count. The E-Cache write reference count is determined by subtracting the D-Cache read miss (D-Cache read references minus D-Cache read hits) and I-Cache misses (I-Cache references minus I-Cache hits) from the total E-Cache references. Because of store buffer compression, this is not the same as D-Cache write misses.