Only members of the ParaMount group, i.e. the unix group paramnt, have proper permissions to directly modify the FAQ. The FAQ is contained in the directory: /home/yare/re/paramnt/WWW/FAQ/
This answer will tell you how to add a question to an existing topic. If you want to add a new topic, see "How do I add a new topic to the FAQ?"
The FAQ directory contains subdirectories that correspond to topics, e.g. SEC_Running_OpenMP_Programs. To add a question/answer to an existing topic, you can simply add a file to the corresponding directory. The file should be in html format and be named *.html, where the * cannot contain the string "HEADER". I typically just name things Qnnn.html where nnn is a number.
The question is automatically identified by a script to update the faq. The question must appear on a line by itself. The line preceeding the question must be <h3> and the line after the question must be </h3>. No other level 3 headings can be used in the file. Other than this requirement, any html elements may be used throughout the file. Multiple questions can be linked to the same file by listing multiple questions in h3 block. An example question/answer is shown below
<h3> Can I Get My Own Copy of Polaris? Can I Download Polaris From Somewhere? </h3> If you are outside of Purdue see <a href="http://polaris.cs.uiuc.edu/polaris/README"> http://polaris.cs.uiuc.edu/polaris/README</a>. If you are within Purdue send mail to polaris@ecn.purdue.edu (we can give you our modified version, i.e. with the OpenMP directives etc...) If you simply want to use Polaris, there are easier ways then to install it yourself: (1) if you're at Purdue we already have an public copy and (2) regardless of where you are, you can use the <a href="http://punch.ecn.purdue.edu/ParHub/"> Parallel Programming Hub</a>.
The example will cause two questions: "Can I Get My Own Copy of Polaris?" and "Can I Download Polaris From Somewhere?" to be included in the FAQ and be linked to the same answer.
After adding a question, you must update the FAQ by running the update_faq script from within the /home/yara/re/paramnt/WWW/FAQ directory.
Only members of the ParaMount group, i.e. the unix group paramnt, have proper permissions to directly modify the FAQ. The FAQ is contained in the directory: /home/yara/re/paramnt/WWW/FAQ/
The FAQ directory contains subdirectories that correspond to topics, e.g. SEC_Running_OpenMP_Programs. To add a new topic, first create a new directory. The name of the directory is not really important, but by convention I have started them all with SEC_
In the new directory, you must add a HEADER.html file to provide the name of the topic. The file should look like the one below:
<hr> <center> <h3> The Topic Name Goes Here </h3> </center>It is probably safest to simply copy an existing HEADER.html and just modify the topic name.
You can automatically generate a new FAQ webpage by executing the update_faq script from within the /home/yara/re/paramnt/WWW/FAQ directory.
This EPS file can be included in LaTex documents.
(If you know of an easier way, please let us know.)
Here is another easy way to make an EPS from any kind of document: You need Adobe Acrobat and Distiller 4.0 or newer on a Windows machine.
KAI the makers of the KAP/Pro toolset maintain their own FAQ here .
You can check to see if your program is running in parallel, i.e. it is using more than 1 thread by typing the following command:
ps -Lu username -o pid,gid,lwp,psr,s,comm
where username is your login.
This will give a table of your currently active processes and their lightweight threads. If your program is running in parallel then multiple instances of it should exist that have 0 as their status, i.e. they are running on a processor. For example, below the serial version of swim has only 1 thread. The parallel version always has multiple threads, however when running on p processors only p threads have a 0 status. The PSR column should generally be -, however if the lwps are bound to a processor it will appear here.
peta.ecn.purdue.edu 140: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm PID GID LWP PSR S COMMAND 1135 1 1 - S -tcsh 29884 1 1 - S -tcsh 2242 1 1 - O swim_serial 29611 1 1 - S emacs (a) A Serial Version of Swim peta.ecn.purdue.edu 141: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm PID GID LWP PSR S COMMAND 1135 1 1 - S -tcsh 29884 1 1 - S -tcsh 2250 1 1 - O swim_parallel_on_1 2250 1 2 - S swim_parallel_on_1 2250 1 3 - S swim_parallel_on_1 2250 1 4 - S swim_parallel_on_1 2250 1 5 - S swim_parallel_on_1 29611 1 1 - S emacs (b) A Parallel Version of swim running on 1 processor peta.ecn.purdue.edu 146: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm PID GID LWP PSR S COMMAND 1135 1 1 - S -tcsh 29884 1 1 - S -tcsh 2257 1 1 - O swim_parallel_on_2 2257 1 2 - S swim_parallel_on_2 2257 1 3 - S swim_parallel_on_2 2257 1 4 - S swim_parallel_on_2 2257 1 5 - S swim_parallel_on_2 2257 1 6 - O swim_parallel_on_2 2257 1 7 - S swim_parallel_on_2 29611 1 1 - S emacs (c) A Parallel Version of Swim running on 2 processors
/usr/bin/mpstat [ interval [ count ] ]For example, using "mpstat 1 5" will return the activity of the processors 5 times, with 1 second between each sample. Generally, the first sample or two are not accurate. (The first sample shows the average since the system startup time.) An example is shown below for a 6-processor system. The last column shows the percentage of each cpu that is currently idle.
peta.ecn.purdue.edu 167: mpstat 1 5 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 17 0 83 13 0 218 175 18 140 0 297 40 3 2 55 1 14 0 301 13 1 237 160 18 164 0 12 23 3 3 71 4 13 0 55 13 0 194 166 18 153 0 379 33 2 3 62 5 12 0 383 38 26 247 161 18 162 0 8 34 2 3 61 8 13 0 379 219 9 183 130 19 156 0 323 28 2 3 67 12 13 0 28 26 14 255 159 19 171 0 177 37 3 2 58 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 3 0 10 0 0 24 0 0 0 0 25 0 0 0 100 1 0 0 0 2 2 39 0 0 1 0 24 0 0 0 100 4 0 0 0 6 1 12 5 3 0 0 0 83 0 0 17 5 0 0 0 10 8 16 2 5 0 0 1 17 0 0 83 8 0 0 20 222 22 30 0 1 2 0 20 0 0 0 100 12 0 0 0 7 0 7 7 0 0 0 0 100 0 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 5 0 11 5 1 0 0 15 83 0 0 17 1 0 0 0 3 3 53 0 3 0 0 38 0 0 0 100 4 0 0 0 2 0 3 2 1 0 0 3 17 0 0 83 5 0 0 0 2 2 8 0 0 0 0 0 0 0 0 100 8 0 0 20 219 19 34 0 0 1 0 20 0 0 0 100 12 0 0 0 7 0 7 7 0 0 0 0 100 0 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 2 0 4 2 0 0 0 0 17 0 0 83 1 0 0 0 2 2 58 0 0 0 0 44 0 0 0 100 4 0 0 0 0 0 4 0 2 0 0 11 3 0 0 97 5 0 0 0 9 4 6 5 1 0 0 0 83 0 0 17 8 0 0 20 219 19 25 0 1 0 0 25 0 0 0 100 12 0 0 0 7 0 7 7 0 0 0 0 97 0 0 3 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 5 0 6 5 1 0 0 0 83 0 0 17 1 0 0 8590 3 3 40 0 0 1 0 31 0 5 0 95 4 0 0 0 7 0 7 7 0 0 0 0 100 0 0 0 5 0 0 0 6 4 8 2 3 0 0 0 17 0 0 83 8 0 0 20 219 19 54 0 1 2 0 43 0 0 0 100 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100
Starting from the source code:
program trivial REAL a(10),b(10) DO 100 k=1,5 DO 10 i=1,10 a(i)=i 10 ENDDO DO 20 j=1,10 b(j)=a(j) 20 ENDDO print *,b 100 CONTINUE ENDI enclose each loop with a "start-timer/stop-timer" pair. In addition, there is an "init" and "finalize" call at the beginning and end of the program, resp. The init call initializes the library, the finalize call writes statistics about the collected times to a file.
The instrumented program looks like this:
PROGRAM trivial REAL a(10), b(10) CALL instrument() DO 100 k=1,5 CALL start_interval(1) C LOOPLABEL 'TRIVIAL_do10' DO 10 i = 1, 10 a(i) = i 10 ENDDO CALL end_interval(1) CALL start_interval(2) C LOOPLABEL 'TRIVIAL_do20' DO 20 j = 1, 10 b(j) = a(j) 20 ENDDO CALL end_interval(2) PRINT *, b 100 CONTINUE CALL exit_intervals('TRIVIAL.sum') END SUBROUTINE instrument c This subroutine maps loop names to loop numbers. This info is used by c exit_intervals() to generate the printable summary. CALL init_intervals('') CALL enter_interval(1, 'TRIVIAL_do10') CALL enter_interval(2, 'TRIVIAL_do20') ENDThis kind of instrumentation can also be generated automatically by the Polaris compiler. Polaris can be run through the web at http://punch.ecn.purdue.edu/ParHub/
Now, you need to compile the program together with the instrumentation library functions, such as:
f77 trivial.f interval.f -o trivialThen, when you run the program "trivial" it generates a file TRIVIAL.sum with the following content:
TRIVIAL_do10 5 AVE: 0.000026 MIN: 0.000022 MAX: 0.000041 TOT: 0.000132 TRIVIAL_do20 5 AVE: 0.000017 MIN: 0.000017 MAX: 0.000017 TOT: 0.000087 OVERALL time - 0.022050 - - - - - -This gives you, for each instrumented program section, the number of invocations and the average, minimum, maximum, and total execution time.
The source code of the library (the file interval.f, above) is this:
subroutine init_intervals(filename) character*(*) filename common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals integer int_number character*30 int_name if (filename .eq. ' ') then nintervals = 0 return endif open(file=filename,status='old',unit=83) nintervals = 0 100 read(83,*,end=200) int_number, int_name nintervals = nintervals + 1 if (nintervals .ne. int_number) then print *, 'Warning: Interval number .ne. record number: ', * int_name endif intvlname(int_number)(:) = int_name(:) count(int_number) = 0 total(int_number) = 0 min(int_number) = 1e31 max(int_number) = 0 goto 100 200 overall_start = etime(tt) close(unit=83) return end c-------------------------------- subroutine enter_interval ( number, name ) character*(*) name integer number common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals nintervals = nintervals + 1 intvlname(number) = name count(number) = 0 total(number) = 0 min(number) = 1e31 max(number) = 0 end c-------------------------------- subroutine start_interval ( interval ) integer interval common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals start(interval) = etime(tt) return end c-------------------------------- subroutine end_interval ( interval ) integer interval common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname real start, total, overall_start, min, max, tt(2) integer*4 count, nintervals real period period = etime(tt) - start(interval) total(interval) = total(interval) + period count(interval) = count(interval) + 1 if (period.lt.min(interval)) min(interval)=period if (period.gt.max(interval)) max(interval)=period return end c-------------------------------- subroutine exit_intervals(filename) character*(*) filename common /intvldata/ start(1000),count(1000), * total(1000), overall_start, * min(1000), max(1000), * nintervals, intvlname(1000) character*30 intvlname parameter(overhead_etime=0.71E-6) real start, total, overall_start, min, max, tt(2) real overhead_etime integer*4 count, nintervals real overall_end character*200 buffer, output_line overall_end = etime(tt) open(file=filename, unit=83, status='unknown') do i=1,nintervals if (count(i) .ne. 0) then buffer(:) = ' ' write(buffer,10) intvlname(i)(:), * count(i), * (total(i)-overhead_etime*count(i))/count(i), * min(i)-overhead_etime, * max(i)-overhead_etime, * total(i)-overhead_etime*count(i) 10 format(1x,a30,1x,i7,' AVE: ',f12.6,' MIN: ', f12.6, * ' MAX: ', f12.6,' TOT: ',f12.6) C call xqueexe(buffer, output_line, length) C write(83,15) output_line(1:length) write(83,15) buffer 15 format(1x,a) endif end do write(83,20) 20 format(1x) write(83,30) overall_end-overall_start-overhead_etime 30 format(1x,'OVERALL time - ', f11.6,' - - - - - - ') return end subroutine xqueexe(string_in, string_out, length) character*(*) string_in character*(*) string_out integer length integer length_in, out, in logical inblanks length_in = len(string_in) inblanks = .false. out = 0 do in=1,length_in if (string_in(in:in) .ne. ' ') then out = out + 1 string_out(out:out) = string_in(in:in) inblanks = .false. elseif (.not. inblanks) then out = out + 1 string_out(out:out) = string_in(in:in) inblanks = .true. endif end do length = out return end
CPU Performance Counters Library (cpc) is available on Solaris 8 and above. Using this library we implement the instrumentation for polaris. The source file is interval_cpc.c, and you may also link with the libaray libinstrcpc.a directly. Both of them are provided in the
Please switch "instrument" on. You had better switch "instr_pcl" on, if you are inerested in all the processors.
Step 2: Compile the instrumentation code, if interval_cpc.c
is used directly.
To compile this code, please use Sun 6.2 c compiler.Step 3: Compile with the instrumented code.cc -fast -xarch=v8plusa -xopenmp interval_cpc.c To instrumentate the code exclusively, please define _EXCLUSIVE. e.g.cc -fast -xarch=v8plusa -xopenmp -D_EXCLUSIVE interval_cpc.c For multiple processors, please define _OPENMP and _MTcc -fast -xarch=v8plusa -xopenmp -D_OPENMP -D_MT interval_cpc.c You may also use my precompiled object file or libarary.
For inclusive instrumentation: interval_cpc.o and libinstrcpc
For exclusive instrumentation: interval_cpc_e.o and libinstrcpcen.b. Because the hardware counter has only 4 bytes, it will cause overflow if the interval is too long. Therfore, a timer is added to reset the hardware counter in order to avoid overflow. The timer should not be too long to avoid overflow. And it should be as long as possible to reduce intrumentation overhead. So, you may define the time by setting the macro _OVRFLSEC. e.g. "-D_OVRFLSEC=8". By default, it is 8 for 500Mhz machines.
n.b. If you want to trace the latency of each iteration, you may define _TRACE.
n.b. To avoid the thread migration, I bind all the threads to the CPUs. Otherwize, it will cause some big problems for instrumentation. The bad news is this may have some side effects on performance.
To instrument the parallel code, you must include the following flags in your fortran compiler-fast -xarch=v8plusa -openmp -mt
Make sure you linked the cpc libarary (-lcpc).
Step 4: Set the environment.
CPC does counting based on the Performance Control Register (PCR). INSTRCPC libaray uses the environment variable of PERFEVENTS to specify the type of counters.
By default, it is set to be "Cycle_cnt,Instr_cnt" to get the #cycles and #instructions, and thus to count CPI.
E.g. to get the cache hit ratio, we may dosetenv PERFEVENTS "pic0=EC_ref,pic1=EC_hit". The syntax of setting counter options is
pic0=<eventspec>, pic1=<eventspec> [,sys] [,nouser] . This syntax, which reflects the simplicity of the options available using the %pcr register, forces both counter events to be selected. By default only user events are counted; however, the sys keyword allows system (kernel) events to be counted as well. User event counting can be disabled by specifying the nouser keyword. The keywords pic0 and pic1 may be omitted; they can be used to resolve ambiguities if they exist.
LAST STEP: Run the program!
1 Instruction Execution Rates
Cycle_cnt [PIC0,PIC1]
Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.Instr_cnt [PIC0,PIC1]
The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.Using the two counters to measure instruction completion and cycles allows calculation of the average number of instructions completed per cycle.
2 Grouping (G) Stage Stall Counts
These are the major cause of pipeline stalls (bubbles) from the G Stage of the pipeline. Stalls are counted for each clock that the associated condition is true.
Dispatch0_IC_miss [PIC0]
I-buffer is empty from I-Cache miss. This includes E-Cache miss processing if an E-Cache miss also occurs.Dispatch0_mispred [PIC1]
I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count.Dispatch0_storeBuf [PIC0]
Store buffer can not hold additional stores, and a store instruction is the first instruction in the group.Dispatch0_FP_use [PIC1]
First instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a Load_use (see 3 ). Thus, Dispatch0_FP_use and Load_use are mutually exclusive counts.Some less common stalls are not counted by any performance counter, including
3 Load Use Stall CountsOne cycle stalls for an FGA/FGM instruction entering the G stage following an FDIV or FSQRT.
Stalls are counted for each clock that the associated condition is true.
Load_use [PIC0]
An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages.Load_use_RAW [PIC1]Load_use also counts cycles when no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic.
There are also overcounts due to, for example, mispredicted CTIs and dispatched instructions that are invalidated by traps.
There is a load use in the execute stage and there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store.Some less common stalls are not counted by any performance counter, including:
4 Cache Access Statistics
I-, D-, and E-Cache access statistics can be collected. Counts are updated by each cache access, regardless of whether the access will be used.
IC_ref [PIC0]
I-Cache references. I-Cache references are fetches of up to four instructions from an aligned block of eight instructions. I-Cache references are generally prefetches and do not correspond exactly to the instructions executed.IC_hit [PIC1]
I-Cache hits.DC_rd [PIC0]
D-Cache read references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted. Atomic, block load, "internal," and "external" bad ASIs, quad precision LDD, and MEMBARs also fall into this class.DC_rd_hit [PIC1]Atomic instructions, block loads, "internal" and "external" bad ASIs, quad LDD, and MEMBARs also fall into this class.
D-Cache read hits are counted in one of two places:DC_wr [PIC0]Loads that hit the D-Cache may be placed in the load buffer for a number of reasons; for example, the load buffer was not empty. Such loads may be turned into misses if a snoop occurs during their stay in the load buffer (due to an external request or to an E-Cache miss). In this case they do not count as D-Cache read hits.
- When they access the D-Cache tags and do not enter the load buffer (because it is already empty)
- When they exit the load buffer (due to a D-Cache miss or a nonempty load buffer).
D-Cache write references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted.DC_wr_hit [PIC1]
D-Cache write hits.EC_ref [PIC0]
Total E-Cache references. Non-cacheable accesses are not counted.EC_hit [PIC1]
Total E-Cache hits.EC_write_hit_RDO [PIC0]
E-Cache hits that do a read for ownership UPA transaction.EC_wb [PIC1]
E-Cache misses that do writebacks.EC_snoop_inv [PIC0]
E-Cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ.EC_snoop_cb [PIC1]
E-Cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.EC_rd_hit [PIC0]
E-Cache read hits from D-Cache misses.EC_ic_hit [PIC1]
E-Cache read hits from I-Cache misses.The E-Cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-Cache hit count. The E-Cache write reference count is determined by subtracting the D-Cache read miss (D-Cache read references minus D-Cache read hits) and I-Cache misses (I-Cache references minus I-Cache hits) from the total E-Cache references. Because of store buffer compression, this is not the same as D-Cache write misses.
There is a command called "perfex", which could be used to get access to the hardware counters. Using its library -- libperfex we implement the instrumentation as well for polaris. The source file is interval_perfex.c provided in
Please switch "instrument" on. You had better switch "instr_pcl" on, if you are inerested in all the processors.
Step 2: Compile the instrumentation code.
Step 4: Set the environment. You can use two different counters at once. There are two corresponding environment variables T5_EVENT0 and T5_EVENT1. In order to know the values to be set, please see the man page of perfex.
LAST STEP: Run the program!
The OMP_NUM_THREADS environment variable determines the number of processors that will be used during execution. If you are using csh, you can set this variable to, for example, 4 by typing:
setenv OMP_NUM_THREADS 4
You should also set the environment variable PARALLEL to 1. This variable must be set or else any timers used by the program will return incorrect timings (see the etime man page for more details).
In order for your application to find the Guide libaries, you must set your LD_LIBRARY_PATH environment variable to ~paramnt/tools/guide38/lib/32, see the FAQ on how to use the Purdue installed version of the KAPPro tools under KAPPro_Toolset.
You can find more details about other OpenMP variables at the OpenMP webpage.
Of course there are many reasons that a program may crash, but a common reason is that the stack size is too small. If your program crashes as soon as you begin to execute it (I mean instantly!), this may be the problem. Try increasing the stacksize by typing:
>> limit stacksize n
where n is the size you'd like (in kbytes). You can see what the current size is by typing "limit" by itself:
peta.ecn.purdue.edu 52: limit cputime unlimited filesize unlimited datasize 2097148 kbytes stacksize 8192 kbytes coredumpsize unlimited vmemoryuse unlimited descriptors 64
Sometimes the backend compiler will warn you if it thinks that the stacksize is very large, other times it won't. It usually is a good idea to try increasing the stacksize before spending hours trying to find the "bug" in your program.
There are several reasons that this may happen: (1) you aren't really running the program in parallel (believe me that this is not that uncommon for a new user! Or sometimes even an experienced user), (2) Polaris did not do a good job of parallelizing your application (this isn't that uncommon either, especially for very large applications) or (3) your system is heavily loaded and using extra processors actually slows the application down. I'll briefly discuss each of these.
1) Your program isn't really running in parallel: see the question "How can I tell if my program is running in parallel?" to check this. If you are new to using polaris or the environment at Purdue, check this first, it may keep you and your application from wasting a lot of time!
2) Polaris did not do a good job of parallelizing your application: Studies have shown that automatic parallelization does well in only about 1 in 2 programs, and these studies were done with benchmark programs, not really big applications. To see if this is the problem, you'll probably need to characterize your application and look into tuning it by hand. You can look at http://min.ecn.purdue.edu/~ipark/UMinor/meth_index.html for help on this subject.
3) Your system is heavily loaded: Parallel programs use more than 1 processor. If your multiprocessor is heavily loaded then running an application in parallel may only increase memory and bus contention. If you are trying to tune and time a new program, try to do it on a quiet machine or in a single-user environment if possible. Also see "How can I tell if the machine I'm using is heavily loaded?".
Purdue is an associate member of the SPEC High-Performance group. The ParaMount group participates actively in SPEC activities. We have access to most SPEC benchmark suites, including the High-Performance benchmarks (SPEChpc96), the CPU benchmarks (SPECfp2000 and SPECint2000, also the older SPEC95 benchmarks), the SPEC Web benchmarks (SPECweb99), and several SPEC graphics benchmarks.
All Purdue members are allowed to use these benchmarks. The benchmarks come with certain rules. For example, you are not allowed to distribute the benchmarks outside Purdue; you are not allowed to use certain SPEC metrics for quoting performance results (unless you adhere strictly to the SPEC runrules). However it is allowed, and common, to use the benchmarks in research experiments and report these results (not using SPEC metrics). SPEC has some recommendations for using its benchmarks in research papers.
To get a copy of SPEC benchmarks, you can borrow our SPEC CD. Send mail to eigenman@purdue.edu.
You can find more information about SPEC and its benchmarks at www.spec.org.
Both versions of Standford's SUIF compiler (versions 1 and 2) are installed.
this page under development.
This driver executes a series of passes to parallelize a code.
(See the man pages provided with the SUIF distribution for more. Much of this data was gathered from the man pages.)
However, some jobs on the ECN supported SUN's cannot be stopped or delayed. See the FAQ for more on what may be influencing your timings on our SUN's.
begin-time | end-time | single-user(s) |
ECN-related tasks are not always predictable. That is, the backups may occur in
the early morning (around 2 or 3am on peta,) but could also occur later
if there is some delay in the backups.
The times to generally watch out for are late evening (8pm or 10pm) and 2-3:30am in
the morning.
What is more, you cannot easily see if a backup is occuring by using "ps"
to look at the processes because they occur from remote machines.
peta is most affected by backups and remote accesses since it is home for a
software RAID array. Every access to a disk causes a CPU load, even when
the accesses are from a remote machine.
condor and PUNCH must have certain daemons running all the time or the process of the Hub will not work properly. You can see these jobs by doing 'ps' on the system (see the FAQ on "How can I tell if the machine that I'm using is heavily loaded?".)
Our systems are connected to a NFS meaning that the disks can be mounted by other systems. When a remote machine accesses the local disks the disk access latency increases. On peta, there is a CPU load for every access to the 32 GB of disk space because it uses a software RAID array.
SYNOPSIS
You can supply your executable with parameters, just as if you executed a command line.
-m machine | single-user machine to run your job on (currently runs on lagavulin) |
-d run-directory | directory to run your executable in |
-omp | number of OpenMP threads, sets OMP_NUM_THREADS |
-np | number of processors, sets NTASK used by mpi |
-t | maximum time to allow your code to execute, default is 5 minutes |
-csh -sh -tcsh -ksh | You may specify a subshell to execute your executable in, such as '/bin/sh myshellscript' |
-p | copy path; by default, all environment variables other than path and LD_LIBRARY_PATH, host, and pwd variables are copied. If you also want the two path variables to be copied then include the -p option. |
executable & parameters
After the options you must include an executable, which can be any executable
file. It can be a script that sets certain environment variables and
executes an executable or a script that executes several executables,
or just an executable file.
If parameters are necessary for the executable then just place them
after the executable as you would on the command line. For example,
qsub will create the specified run directory if it does not already exist.
Your job will be killed using at -s KILL after the MAXTIME specified (or 5 minutes by default) from the time your script begins to execute.
NOTE: you must have your account set up to allow sut to log in as you. sut will do the following to run the script that qsub creates:
/home/peak/a/sut/suQueue_completed where all the completed jobs get placed.
/home/peak/a/sut/suQ_sh/qsub
/home/peak/a/sut/suQ_sh/suQ-monitor.sh
This answer is stolen from the polaris web page at UIUC:
"The Polaris compiler takes a Fortran77 program as input, transforms this program so that it runs efficiently on a parallel computer, and outputs this program version in one of several possible parallel Fortran dialects. The input language includes several directives which allow the user of Polaris to specify parallelism explicitly in the source program. The output language of Polaris is typically in the form of Fortran77 plus parallel directives as well. For example, a generic parallel directive set includes the directives "CSRD$ PARALLEL" and "CSRD$ PRIVATE a,b", specifying that the iterations of the subsequent loop shall be executed concurrently and that the variables a and b shall be declared "private to the current loop", respectively. Another output language that Polaris can generate is the Fortran plus the directive language available on the SGI Challenge machine. (The Purdue version also supports the OpenMP API). Polaris performs its transformations in several "compilation passes". In addition to many commonly known passes, Polaris includes advanced capabilities performing the following tasks: array privatization, data dependence testing, induction variable recognition, interprocedural analysis, and symbolic program analysis. An extensive set of options allow the user and the developer of Polaris to experiment with the tool in a flexible way. An overview of the Polaris transformations is given in the Publication Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing. The implementation of Polaris consists of 170,000 lines of C++ code. A basic infrastructure provides a hierarchy of C++ classes that the developers of the individual compilation passes can use for manipulating and analyzing the input program. This infrastructure is described in the Publication "The Polaris Internal Representation".
The most up to date version of the polaris executable is kept in the /home/peak/a/paramnt/tools/bin directory. You should add this to your path if it is not already there. Next, in order to access the polaris man pages, you should add the /home/peak/a/paramnt/tools/man directory to your man path. To ensure that polaris uses the most up to date switch file, set the POLARIS_ROOT environment variable to /home/peak/a/paramnt/tools/polaris.
An example, using the csh is given below. You may want to add these commands to your .cshrc so that you don't have to retype them each time you log in:
>> setenv PATH /home/peak/a/paramnt/tools/bin:$PATH >> setenv MANPATH /home/peak/a/paramnt/tools/man:$MANPATH >> setenv POLARIS_ROOT /home/peak/a/paramnt/tools/polaris
The command line interface is explained in detail in the polaris man page (read by typing: ``man polaris'').
In this section I will give a step by step tutorial on how to generate a parallel program using polaris. First, enter the sample program shown below into a file named example.f. If you're not familiar with Fortran 77, you need to know that spaces are important!!. Executable statements can only use the first 6 columns as stamement labels or to signify that the line continues. Therefore all of the lines in the figure start with at least 6 spaces.
PROGRAM EXAMPLE REAL A(100),B(100) INTEGER I DO I = 1,100 A(I) = I B(I) = 100 - I ENDDO DO I = 2,100 A(I) = A(I-1) ENDDO DO I = 1,100 WRITE(6,*) A(I), B(I) ENDDO END Figure: Example Program
You can run polaris on example.f by simply typing:
>> polaris -version new example.f
The polaris command found in /home/peak/a/paramnt/tools/bin is actually a script that allows you to select from several versions of polaris. The -version flag controls which version is used, and using "new" will cause the newest version to be called. Typing the command as given above should generate many lines of output, including the resulting program shown in the figure below (the code may differ slightly depending on what the current default setup of polaris is).
PROGRAM example INTEGER*4 i, numthreads, omp_get_max_threads REAL*4 a, b DIMENSION a(100), b(100) COMMON /polaris/ numthreads numthreads = omp_get_max_threads() !$OMP PARALLEL !$OMP+DEFAULT(SHARED) !$OMP+PRIVATE(I) !$OMP DO CSRD$ LOOPLABEL 'EXAMPLE_do#1' DO i = 1, 100, 1 a(i) = i b(i) = 100+(-i) ENDDO !$OMP END DO NOWAIT !$OMP END PARALLEL CSRD$ LOOPLABEL 'EXAMPLE_do#2' DO i = 2, 100, 1 a(i) = a((-1)+i) ENDDO CSRD$ LOOPLABEL 'EXAMPLE_do#3' DO i = 1, 100, 1 WRITE (6, *) a(i), b(i) ENDDO STOP END Figure: Example OutputYou can see in the example that Polaris finds the first loop, EXAMPLE_do#1 to be parallel, while finds that the remaining two loops are serial. The code part of the output is also stored into a file called example_P.f.
Polaris allows the user to have greater control over the parallelization process through the use of command line switches. Switches appear on the command line after -s. The various switches are described in detail in the polaris man page. As an example, we could parallelize a program using the SGI directives, instead of the default OpenMP directives by calling polaris as follows:
>> polaris -version new -s output_lang=6 example.f
The default settings of all of the switches are kept in a file called the switches file. This file is located in /home/peak/a/paramnt/tools/polaris. Since the script allows you to run different versions of polaris, there are several versions of the "switches" file. The script will automatically select the correct version. If you want to modify this file you can copy it to your local directory and then set the evironment variable POLARIS_ROOT to the directory in which it resides. However, it is strongly suggested that you do not do this since changes in the default switch file will not propagate into your version.
You can use the polaris executable directly by using the executable found in: /home/yara/re/mv/Master/bin/sparc-sun-solaris2.6. Make sure that you then change your POLARIS_ROOT to /home/yara/re/mv/Master and update your PATH variable so that the correct executable is chosen.
You need not install polaris to use it. You can use polaris across the internet using the Parallel Programming Hub: http://punch.ecn.purdue.edu/ParHub. You can get a free acount and try it out if you'd like. There are more details on the Hub itself.
Polaris by default places the output into a file call *_P.f where * is the source file name. Polaris is a source-to-source restructurer, meaning that it accepts Fortran as input and generates Fortran as output. The output Fortran then must be compiled with a backend compiler. We use the OpenMP parallel directives by default and programs using these directives must be compiled using guidef77. (This only applies to user at ECN, other users will have to find there own OpenMP compiler!) The guidef77 executable is also in /home/peak/a/paramnt/tools/bin so you probably do not need to change your path. The guidef77 manual can be found in /home/yara/re/mv/tools/guide38/docs/GuideF_Reference.pdf or also as a postscript file in the same directory. It takes almost the same options as f77 (man f77).
We only have a license for guidef77 on one server: peta.ecn.purdue.edu, so you must be logged into peta in order to compile the output. You do need to have some environment variables set (or add the following to them if you already have them set):
>> setenv GUIDE_HOME /home/yara/re/mv/tools/guide38 >> setenv LD_LIBRARY_PATH /home/yara/re/mv/tools/guide38/lib
To compile a program stored in a file called example_P.f you can type:
>> guidef77 -fast -stackvar -mt -o example example_P.f
I'll briefly explain the flags I used, although a better description can be found in the f77 man page. First, -fast is short for a collection of switches that should be used to generate a well optimized executable (it does machine specific things though, so if you compile using -fast you should not run it on a different machine). -stackvar forces all local variable to be allocated on the stack (this is recommended when running in parallel). And finally, -mt links in the thread-safe libraries. A typical compilation proceeds as shown below:
peta.ecn.purdue.edu 4: guidef77 -fast -stackvar -mt -o example example_P.f WARNING: guidef77 does not support switch, passed through to Fortran compiler: -fast Guide 3.6 k310744 19990126 19-Aug-1999 16:54:45 Guide 3.6 k310744 19990126: 0 errors in file example_P.f G_output.f: MAIN example: pkexample_: peta.ecn.purdue.edu 5:
In this directory there is a script, runtests. For each file in poltests that ends with .f, it will run polaris, then try to compile the file with guidef77 and f95, and run each compiled version on 4 processors. The output of the programs placed in this directory should go to standard out and contain 'PASSED' if they validate. The tests should be self contained and not require other files to be linked.
There are at least 3 examples in the directory: end.f, noname.f, and TFS.f. noname.f is an example that fails in Polaris. TFS.f compiles but fails at runtime when compiled with f95.
The script will identify if Polaris fails, f95 fails, guidef77 fails, or the run does not validate, and send the list of files that fail to the appropriate developers of Polaris.
So the output for the current examples would be:
Thu May 24 18:30:18 EST 2001
------------------------
TFS.f_failed_at_runtime_with_f95
TFS.f_failed_at_runtime_with_f95.out
noname.f_failed_in_polaris
This output is also stored in a file, which is in this case: regressions-2001-May-24
If you find bugs, or just have good tests for Polaris, feel free to populate this directory. Also, when you find a bug, try to generate a small test case that can work in this test directory. We can then track bugs as they appear and disappear.
To run a single test you can use the runtest script. To test just noname.f, you'd type:
> runtest noname.f
No match
Abort (core dumped)
This would cause generation of noname.f_failed_in_polaris that holds the output from polaris:
POLARIS: Assertion failed: file String.cc, line 320 :
Assign NULL value through String::operator =
Polaris is set up with CVS (Concurrent Versioning System). On Purdue ECN machines, you can check out a version of Polaris by doing the following:
To begin compiling Polaris, you must set one variable (either by the environment or by adding it to Rules.make in the cvdl directory).
In Rules.make:
POLARIS_DIR =
You will also need to make the omega library.
Then, to make polaris:
In order to commit changes to polaris, we have to wait until we have regression tests in place.
For simplicity, assume you would like to add a directive of the form:
CSRD$ VARRANGE var,lb,ubWith this directive, you intend for users to be able to indicate the range a variable can have, i.e., lb <= var <= ub.
Files you will have to create:
Files you will have to modify:
parser_csrd_range[ParserContext &context] "VARRANGE directive" : <<Put any delcarations needed for this rule here.
>> "VARRANGE"Using the grammar rules, you can follow the "CSRD$ VARRANGE".
"@" << s = &context.s_next;s is the statement the assertion will be attached to. You can also attach it to a do-loop statement, (i.e., require that the directive directly preceed a do-loop statement.)
>> ;
enum AssertionType { ... AS_VARRANGE, ...Add the assertion to AssertType.
... #pragma implementation "AssertRange.h" ... #include "AssertRange.h" ...Add the pragma and include the appropriate header file, just as the existing assertions. Also implement the constructors, destructor, print, listable_clone, operator= and clone methods (see other assertions.)
void generate_csrd_range_directive( Statement &s, Assertion &ap ); // generate a CSRD$ VARRANGE directive
... void Directive::generate_csrd_directive( Statement &s ) { ... case AS_VARRANGE: generate_csrd_range_directive( s, iter.current() ); break; ... }Add the case to the "generate_csrd_directive" method.
See base/Expression/intrin.h and scanner/intrin.h NOTE: This answer is not yet complete. xtarget=ultra2 -xcache=16/32/1:4096/64/1