Only members of the ParaMount group, i.e. the unix group paramnt, have proper permissions to directly modify the FAQ. The FAQ is contained in the directory: /home/yare/re/paramnt/WWW/FAQ/
This answer will tell you how to add a question to an existing topic. If you want to add a new topic, see "How do I add a new topic to the FAQ?"
The FAQ directory contains subdirectories that correspond to topics, e.g. SEC_Running_OpenMP_Programs. To add a question/answer to an existing topic, you can simply add a file to the corresponding directory. The file should be in html format and be named *.html, where the * cannot contain the string "HEADER". I typically just name things Qnnn.html where nnn is a number.
The question is automatically identified by a script to update the faq. The question must appear on a line by itself. The line preceeding the question must be <h3> and the line after the question must be </h3>. No other level 3 headings can be used in the file. Other than this requirement, any html elements may be used throughout the file. Multiple questions can be linked to the same file by listing multiple questions in h3 block. An example question/answer is shown below
<h3> Can I Get My Own Copy of Polaris? Can I Download Polaris From Somewhere? </h3> If you are outside of Purdue see <a href="http://polaris.cs.uiuc.edu/polaris/README"> http://polaris.cs.uiuc.edu/polaris/README</a>. If you are within Purdue send mail to polaris@ecn.purdue.edu (we can give you our modified version, i.e. with the OpenMP directives etc...) If you simply want to use Polaris, there are easier ways then to install it yourself: (1) if you're at Purdue we already have an public copy and (2) regardless of where you are, you can use the <a href="http://punch.ecn.purdue.edu/ParHub/"> Parallel Programming Hub</a>.
The example will cause two questions: "Can I Get My Own Copy of Polaris?" and "Can I Download Polaris From Somewhere?" to be included in the FAQ and be linked to the same answer.
After adding a question, you must update the FAQ by running the update_faq script from within the /home/yara/re/paramnt/WWW/FAQ directory.
Only members of the ParaMount group, i.e. the unix group paramnt, have proper permissions to directly modify the FAQ. The FAQ is contained in the directory: /home/yara/re/paramnt/WWW/FAQ/
The FAQ directory contains subdirectories that correspond to topics, e.g. SEC_Running_OpenMP_Programs. To add a new topic, first create a new directory. The name of the directory is not really important, but by convention I have started them all with SEC_
In the new directory, you must add a HEADER.html file to provide the name of the topic. The file should look like the one below:
<hr> <center> <h3> The Topic Name Goes Here </h3> </center>It is probably safest to simply copy an existing HEADER.html and just modify the topic name.
You can automatically generate a new FAQ webpage by executing the update_faq script from within the /home/yara/re/paramnt/WWW/FAQ directory.
This EPS file can be included in LaTex documents.
(If you know of an easier way, please let us know.)
Here is another easy way to make an EPS from any kind of document: You need Adobe Acrobat and Distiller 4.0 or newer on a Windows machine.
KAI the makers of the KAP/Pro toolset maintain their own FAQ here .
You can check to see if your program is running in parallel, i.e. it is using more than 1 thread by typing the following command:
ps -Lu username -o pid,gid,lwp,psr,s,comm
where username is your login.
This will give a table of your currently active processes and their lightweight threads. If your program is running in parallel then multiple instances of it should exist that have 0 as their status, i.e. they are running on a processor. For example, below the serial version of swim has only 1 thread. The parallel version always has multiple threads, however when running on p processors only p threads have a 0 status. The PSR column should generally be -, however if the lwps are bound to a processor it will appear here.
peta.ecn.purdue.edu 140: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm PID GID LWP PSR S COMMAND 1135 1 1 - S -tcsh 29884 1 1 - S -tcsh 2242 1 1 - O swim_serial 29611 1 1 - S emacs (a) A Serial Version of Swim peta.ecn.purdue.edu 141: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm PID GID LWP PSR S COMMAND 1135 1 1 - S -tcsh 29884 1 1 - S -tcsh 2250 1 1 - O swim_parallel_on_1 2250 1 2 - S swim_parallel_on_1 2250 1 3 - S swim_parallel_on_1 2250 1 4 - S swim_parallel_on_1 2250 1 5 - S swim_parallel_on_1 29611 1 1 - S emacs (b) A Parallel Version of swim running on 1 processor peta.ecn.purdue.edu 146: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm PID GID LWP PSR S COMMAND 1135 1 1 - S -tcsh 29884 1 1 - S -tcsh 2257 1 1 - O swim_parallel_on_2 2257 1 2 - S swim_parallel_on_2 2257 1 3 - S swim_parallel_on_2 2257 1 4 - S swim_parallel_on_2 2257 1 5 - S swim_parallel_on_2 2257 1 6 - O swim_parallel_on_2 2257 1 7 - S swim_parallel_on_2 29611 1 1 - S emacs (c) A Parallel Version of Swim running on 2 processors
/usr/bin/mpstat [ interval [ count ] ]For example, using "mpstat 1 5" will return the activity of the processors 5 times, with 1 second between each sample. Generally, the first sample or two are not accurate. (The first sample shows the average since the system startup time.) An example is shown below for a 6-processor system. The last column shows the percentage of each cpu that is currently idle.
peta.ecn.purdue.edu 167: mpstat 1 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 17 0 83 13 0 218 175 18 140 0 297 40 3 2 55 1 14 0 301 13 1 237 160 18 164 0 12 23 3 3 71 4 13 0 55 13 0 194 166 18 153 0 379 33 2 3 62 5 12 0 383 38 26 247 161 18 162 0 8 34 2 3 61 8 13 0 379 219 9 183 130 19 156 0 323 28 2 3 67 12 13 0 28 26 14 255 159 19 171 0 177 37 3 2 58 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 3 0 10 0 0 24 0 0 0 0 25 0 0 0 100 1 0 0 0 2 2 39 0 0 1 0 24 0 0 0 100 4 0 0 0 6 1 12 5 3 0 0 0 83 0 0 17 5 0 0 0 10 8 16 2 5 0 0 1 17 0 0 83 8 0 0 20 222 22 30 0 1 2 0 20 0 0 0 100 12 0 0 0 7 0 7 7 0 0 0 0 100 0 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 5 0 11 5 1 0 0 15 83 0 0 17 1 0 0 0 3 3 53 0 3 0 0 38 0 0 0 100 4 0 0 0 2 0 3 2 1 0 0 3 17 0 0 83 5 0 0 0 2 2 8 0 0 0 0 0 0 0 0 100 8 0 0 20 219 19 34 0 0 1 0 20 0 0 0 100 12 0 0 0 7 0 7 7 0 0 0 0 100 0 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 2 0 4 2 0 0 0 0 17 0 0 83 1 0 0 0 2 2 58 0 0 0 0 44 0 0 0 100 4 0 0 0 0 0 4 0 2 0 0 11 3 0 0 97 5 0 0 0 9 4 6 5 1 0 0 0 83 0 0 17 8 0 0 20 219 19 25 0 1 0 0 25 0 0 0 100 12 0 0 0 7 0 7 7 0 0 0 0 97 0 0 3 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 5 0 6 5 1 0 0 0 83 0 0 17 1 0 0 8590 3 3 40 0 0 1 0 31 0 5 0 95 4 0 0 0 7 0 7 7 0 0 0 0 100 0 0 0 5 0 0 0 6 4 8 2 3 0 0 0 17 0 0 83 8 0 0 20 219 19 54 0 1 2 0 43 0 0 0 100 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100
Starting from the source code:
program trivial REAL a(10),b(10) DO 100 k=1,5 DO 10 i=1,10 a(i)=i 10 ENDDO DO 20 j=1,10 b(j)=a(j) 20 ENDDO print *,b 100 CONTINUE ENDI enclose each loop with a "start-timer/stop-timer" pair. In addition, there is an "init" and "finalize" call at the beginning and end of the program, resp. The init call initializes the library, the finalize call writes statistics about the collected times to a file.
The instrumented program looks like this:
PROGRAM trivial REAL a(10), b(10) CALL instrument() DO 100 k=1,5 CALL start_interval(1) C LOOPLABEL 'TRIVIAL_do10' DO 10 i = 1, 10 a(i) = i 10 ENDDO CALL end_interval(1) CALL start_interval(2) C LOOPLABEL 'TRIVIAL_do20' DO 20 j = 1, 10 b(j) = a(j) 20 ENDDO CALL end_interval(2) PRINT *, b 100 CONTINUE CALL exit_intervals('TRIVIAL.sum') END SUBROUTINE instrument c This subroutine maps loop names to loop numbers. This info is used by c exit_intervals() to generate the printable summary. CALL init_intervals('') CALL enter_interval(1, 'TRIVIAL_do10') CALL enter_interval(2, 'TRIVIAL_do20') ENDThis kind of instrumentation can also be generated automatically by the Polaris compiler. Polaris can be run through the web at http://punch.ecn.purdue.edu/ParHub/
Now, you need to compile the program together with the instrumentation library functions, such as:
f77 trivial.f interval.f -o trivialThen, when you run the program "trivial" it generates a file TRIVIAL.sum with the following content:
TRIVIAL_do10 5 AVE: 0.000026 MIN: 0.000022 MAX: 0.000041 TOT: 0.000132 TRIVIAL_do20 5 AVE: 0.000017 MIN: 0.000017 MAX: 0.000017 TOT: 0.000087 OVERALL time - 0.022050 - - - - - -This gives you, for each instrumented program section, the number of invocations and the average, minimum, maximum, and total execution time.
The source code of the library (the file interval.f, above) is this:
subroutine init_intervals(filename) character*(*) filename common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals integer int_number character*30 int_name if (filename .eq. ' ') then nintervals = 0 return endif open(file=filename,status='old',unit=83) nintervals = 0 100 read(83,*,end=200) int_number, int_name nintervals = nintervals + 1 if (nintervals .ne. int_number) then print *, 'Warning: Interval number .ne. record number: ', * int_name endif intvlname(int_number)(:) = int_name(:) count(int_number) = 0 total(int_number) = 0 min(int_number) = 1e31 max(int_number) = 0 goto 100 200 overall_start = etime(tt) close(unit=83) return end c-------------------------------- subroutine enter_interval ( number, name ) character*(*) name integer number common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals nintervals = nintervals + 1 intvlname(number) = name count(number) = 0 total(number) = 0 min(number) = 1e31 max(number) = 0 end c-------------------------------- subroutine start_interval ( interval ) integer interval common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals start(interval) = etime(tt) return end c-------------------------------- subroutine end_interval ( interval ) integer interval common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals real period period = etime(tt) - start(interval) total(interval) = total(interval) + period count(interval) = count(interval) + 1 if (period.lt.min(interval)) min(interval)=period if (period.gt.max(interval)) max(interval)=period return end c-------------------------------- subroutine exit_intervals(filename) character*(*) filename common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname parameter(overhead_etime=0.71E-6) real start, total, overall_start, min, max, tt(2) real overhead_etime integer*4 count, nintervals real overall_end character*200 buffer, output_line overall_end = etime(tt) open(file=filename, unit=83, status='unknown') do i=1,nintervals if (count(i) .ne. 0) then buffer(:) = ' ' write(buffer,10) intvlname(i)(:), * count(i), * (total(i)-overhead_etime*count(i))/count(i), * min(i)-overhead_etime, * max(i)-overhead_etime, * total(i)-overhead_etime*count(i) 10 format(1x,a30,1x,i7,' AVE: ',f12.6,' MIN: ', f12.6, * ' MAX: ', f12.6,' TOT: ',f12.6) C call xqueexe(buffer, output_line, length) C write(83,15) output_line(1:length) write(83,15) buffer 15 format(1x,a) endif end do write(83,20) 20 format(1x) write(83,30) overall_end-overall_start-overhead_etime 30 format(1x,'OVERALL time - ', f11.6,' - - - - - - ') return end subroutine xqueexe(string_in, string_out, length) character*(*) string_in character*(*) string_out integer length integer length_in, out, in logical inblanks length_in = len(string_in) inblanks = .false. out = 0 do in=1,length_in if (string_in(in:in) .ne. ' ') then out = out + 1 string_out(out:out) = string_in(in:in) inblanks = .false. elseif (.not. inblanks) then out = out + 1 string_out(out:out) = string_in(in:in) inblanks = .true. endif end do length = out return end
CPU Performance Counters Library (cpc) is available on Solaris 8 and above. Using this library we implement the instrumentation for polaris. The source file is interval_cpc.c, and you may also link with the libaray libinstrcpc.a directly. Both of them are provided in the
Please switch "instrument" on. You had better switch "instr_pcl" on, if you are inerested in all the processors.
Step 2: Compile the instrumentation code, if interval_cpc.c
is used directly.
To compile this code, please use Sun 6.2 c compiler.Step 3: Compile with the instrumented code.cc -fast -xarch=v8plusa -xopenmp interval_cpc.c To instrumentate the code exclusively, please define _EXCLUSIVE. e.g.cc -fast -xarch=v8plusa -xopenmp -D_EXCLUSIVE interval_cpc.c For multiple processors, please define _OPENMP and _MTcc -fast -xarch=v8plusa -xopenmp -D_OPENMP -D_MT interval_cpc.c You may also use my precompiled object file or libarary.
For inclusive instrumentation: interval_cpc.o and libinstrcpc
For exclusive instrumentation: interval_cpc_e.o and libinstrcpcen.b. Because the hardware counter has only 4 bytes, it will cause overflow if the interval is too long. Therfore, a timer is added to reset the hardware counter in order to avoid overflow. The timer should not be too long to avoid overflow. And it should be as long as possible to reduce intrumentation overhead. So, you may define the time by setting the macro _OVRFLSEC. e.g. "-D_OVRFLSEC=8". By default, it is 8 for 500Mhz machines.
n.b. If you want to trace the latency of each iteration, you may define _TRACE.
n.b. To avoid the thread migration, I bind all the threads to the CPUs. Otherwize, it will cause some big problems for instrumentation. The bad news is this may have some side effects on performance.
To instrument the parallel code, you must include the following flags in your fortran compiler-fast -xarch=v8plusa -openmp -mt
Make sure you linked the cpc libarary (-lcpc).
Step 4: Set the environment.
CPC does counting based on the Performance Control Register (PCR). INSTRCPC libaray uses the environment variable of PERFEVENTS to specify the type of counters.
By default, it is set to be "Cycle_cnt,Instr_cnt" to get the #cycles and #instructions, and thus to count CPI.
E.g. to get the cache hit ratio, we may dosetenv PERFEVENTS "pic0=EC_ref,pic1=EC_hit". The syntax of setting counter options is
pic0=<eventspec>, pic1=<eventspec> [,sys] [,nouser] . This syntax, which reflects the simplicity of the options available using the %pcr register, forces both counter events to be selected. By default only user events are counted; however, the sys keyword allows system (kernel) events to be counted as well. User event counting can be disabled by specifying the nouser keyword. The keywords pic0 and pic1 may be omitted; they can be used to resolve ambiguities if they exist.
LAST STEP: Run the program!
1 Instruction Execution Rates
Cycle_cnt [PIC0,PIC1]
Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.Instr_cnt [PIC0,PIC1]
The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.Using the two counters to measure instruction completion and cycles allows calculation of the average number of instructions completed per cycle.
2 Grouping (G) Stage Stall Counts
These are the major cause of pipeline stalls (bubbles) from the G Stage of the pipeline. Stalls are counted for each clock that the associated condition is true.
Dispatch0_IC_miss [PIC0]
I-buffer is empty from I-Cache miss. This includes E-Cache miss processing if an E-Cache miss also occurs.Dispatch0_mispred [PIC1]
I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count.Dispatch0_storeBuf [PIC0]
Store buffer can not hold additional stores, and a store instruction is the first instruction in the group.Dispatch0_FP_use [PIC1]
First instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a Load_use (see 3 ). Thus, Dispatch0_FP_use and Load_use are mutually exclusive counts.Some less common stalls are not counted by any performance counter, including
3 Load Use Stall CountsOne cycle stalls for an FGA/FGM instruction entering the G stage following an FDIV or FSQRT.
Stalls are counted for each clock that the associated condition is true.
Load_use [PIC0]
An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages.Load_use_RAW [PIC1]Load_use also counts cycles when no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic.
There are also overcounts due to, for example, mispredicted CTIs and dispatched instructions that are invalidated by traps.
There is a load use in the execute stage and there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store.Some less common stalls are not counted by any performance counter, including:
4 Cache Access Statistics
I-, D-, and E-Cache access statistics can be collected. Counts are updated by each cache access, regardless of whether the access will be used.
IC_ref [PIC0]
I-Cache references. I-Cache references are fetches of up to four instructions from an aligned block of eight instructions. I-Cache references are generally prefetches and do not correspond exactly to the instructions executed.IC_hit [PIC1]
I-Cache hits.DC_rd [PIC0]
D-Cache read references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted. Atomic, block load, "internal," and "external" bad ASIs, quad precision LDD, and MEMBARs also fall into this class.DC_rd_hit [PIC1]Atomic instructions, block loads, "internal" and "external" bad ASIs, quad LDD, and MEMBARs also fall into this class.
D-Cache read hits are counted in one of two places:DC_wr [PIC0]Loads that hit the D-Cache may be placed in the load buffer for a number of reasons; for example, the load buffer was not empty. Such loads may be turned into misses if a snoop occurs during their stay in the load buffer (due to an external request or to an E-Cache miss). In this case they do not count as D-Cache read hits.
- When they access the D-Cache tags and do not enter the load buffer (because it is already empty)
- When they exit the load buffer (due to a D-Cache miss or a nonempty load buffer).
D-Cache write references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted.DC_wr_hit [PIC1]
D-Cache write hits.EC_ref [PIC0]
Total E-Cache references. Non-cacheable accesses are not counted.EC_hit [PIC1]
Total E-Cache hits.EC_write_hit_RDO [PIC0]
E-Cache hits that do a read for ownership UPA transaction.EC_wb [PIC1]
E-Cache misses that do writebacks.EC_snoop_inv [PIC0]
E-Cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ.EC_snoop_cb [PIC1]
E-Cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.EC_rd_hit [PIC0]
E-Cache read hits from D-Cache misses.EC_ic_hit [PIC1]
E-Cache read hits from I-Cache misses.The E-Cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-Cache hit count. The E-Cache write reference count is determined by subtracting the D-Cache read miss (D-Cache read references minus D-Cache read hits) and I-Cache misses (I-Cache references minus I-Cache hits) from the total E-Cache references. Because of store buffer compression, this is not the same as D-Cache write misses.