The ParaMount Group FAQ

Here you find frequently-asked questions and answers for ParaMount group members, Purdue staff, as well as non-Purdue people. If you see references to internal machine names, filenames, or tools, they may not be accessible outside the ParaMount group or Purdue. If you are interested in such items, feel free to send us mail at paramnt@ecn.purdue.edu
Last Update: Wed Jun 4 16:11:56 EST 2003



FAQs about using the FAQ

How do I add a question/answer to the FAQ?
How do I add a new topic to the FAQ?
I've added a question, how do I update the FAQ webpage?


Graphing and Displaying Data

How Do I Create an EPS File From an Excel Chart?
How do I create an EPS file on Linux?


The KAP/Pro Toolset

Where is KAI's KAP/Pro FAQ?
How do I compile with the KAPPro Guide tools on the ParaMount machines?


Monitoring a Parallel Machine

How can I tell if my program is running in parallel?
How can I tell if the machine that I'm using is heavily loaded?


Performance Tuning Tricks

How can I profile a program
How can I instrument a program using hardware counters?
Instrumentation using hardware counters on SGI Origin machines?
My application is 32-bit but runs out of heap space after only 2GB. What is the problem?


Running OpenMP Programs

How should I set up my environment to run an OpenMP program.
My program crashes as soon as I run it, what's up?
I know that I did everthing right but I get no speedup, why?


Questions about SPEC Benchmarks

How can I obtain SPEC Benchmarks.


The SUIF Compiler

Which version of SUIF is installed in the ParaMount directories?
How do I run SUIF from ParaMount's installation?
How do I parallelize a Fortran code with SUIF 1?
What other passes/utilities are provided with SUIF and installed?


Single-User Access

How can I get single-user access to ParaMount SUN machines?
Why are my timings in single-user time longer than expected?
How do I use the submit to the single-user batch queue? (qsub)


The Polaris Compiler

What is Polaris?
Can I Get My Own Copy of Polaris?
I have an ECN Account, how do I setup my environment so that I can run polaris?
How do I run polaris (as a command-line tool)?
How do I change how polaris behaves?
What is a switch?
What is a switches file?
Can I run the polaris executable directly, i.e. without using the script?
How can I run polaris on the Web?
What is PUNCH?
What/where is the Parallel Programming Hub?
How do I compile the output of Polaris (the _P.f) file?
Where are the Polaris regression tests?
How are the Polaris regression tests run?
How do I add a test to Polaris' regression tests?
How can I obtain a development copy of Polaris?
How can I add a user directive to Polaris' vocabulary?
How can I control Polaris passes at a fine level?
How can I make Polaris aware of an intrinsic function?

FAQs about using the FAQ

How do I add a question/answer to the FAQ?

Only members of the ParaMount group, i.e. the unix group paramnt, have proper permissions to directly modify the FAQ. The FAQ is contained in the directory: /home/yare/re/paramnt/WWW/FAQ/

This answer will tell you how to add a question to an existing topic. If you want to add a new topic, see "How do I add a new topic to the FAQ?"

The FAQ directory contains subdirectories that correspond to topics, e.g. SEC_Running_OpenMP_Programs. To add a question/answer to an existing topic, you can simply add a file to the corresponding directory. The file should be in html format and be named *.html, where the * cannot contain the string "HEADER". I typically just name things Qnnn.html where nnn is a number.

The question is automatically identified by a script to update the faq. The question must appear on a line by itself. The line preceeding the question must be <h3> and the line after the question must be </h3>. No other level 3 headings can be used in the file. Other than this requirement, any html elements may be used throughout the file. Multiple questions can be linked to the same file by listing multiple questions in h3 block. An example question/answer is shown below

<h3>
Can I Get My Own Copy of Polaris?
Can I Download Polaris From Somewhere?
</h3>

If you are outside of Purdue see 
<a href="http://polaris.cs.uiuc.edu/polaris/README">
http://polaris.cs.uiuc.edu/polaris/README</a>.
If you are within Purdue send mail to polaris@ecn.purdue.edu
(we can give you our modified version, i.e. with the 
OpenMP directives etc...)  If you simply want to use
Polaris, there are easier ways then to install it
yourself: (1) if you're at Purdue we already have
an public copy and (2) regardless of where you are,
you can use the <a href="http://punch.ecn.purdue.edu/ParHub/">
Parallel Programming Hub</a>. 

The example will cause two questions: "Can I Get My Own Copy of Polaris?" and "Can I Download Polaris From Somewhere?" to be included in the FAQ and be linked to the same answer.

After adding a question, you must update the FAQ by running the update_faq script from within the /home/yara/re/paramnt/WWW/FAQ directory.

How do I add a new topic to the FAQ?

Only members of the ParaMount group, i.e. the unix group paramnt, have proper permissions to directly modify the FAQ. The FAQ is contained in the directory: /home/yara/re/paramnt/WWW/FAQ/

The FAQ directory contains subdirectories that correspond to topics, e.g. SEC_Running_OpenMP_Programs. To add a new topic, first create a new directory. The name of the directory is not really important, but by convention I have started them all with SEC_

In the new directory, you must add a HEADER.html file to provide the name of the topic. The file should look like the one below:

<hr>
<center>
<h3>
The Topic Name Goes Here
</h3>
</center>
It is probably safest to simply copy an existing HEADER.html and just modify the topic name.

I've added a question, how do I update the FAQ webpage?

You can automatically generate a new FAQ webpage by executing the update_faq script from within the /home/yara/re/paramnt/WWW/FAQ directory.


Graphing and Displaying Data

How Do I Create an EPS File From an Excel Chart?

Using Microsoft Office 2000 on a Windows NT machine, this is what you can do:
  1. In the PageSetup menu remove all margins and any header/footer.
  2. Also in PageSetup make the printing orientation Portrait.
  3. In the Print menu choose a postscript printer, such as puccps or mathg109chp.
  4. Print to file, give it a name with a ".ps" suffix.
  5. The printer file from the previous step will be located in the directory that Excel is pointing to, (i.e., the directory you see when you choose Open from the the File menu.)
  6. Open the printer file in GSview.
  7. Choose the PStoEPS option in the File menu.
  8. Automatically calculate the bounding box (this gets rid of the white space.)
  9. Open the new EPS file in a text editor and delete all the lines inbetween:
Voila!

This EPS file can be included in LaTex documents.

(If you know of an easier way, please let us know.)

Here is another easy way to make an EPS from any kind of document: You need Adobe Acrobat and Distiller 4.0 or newer on a Windows machine.

  1. Make whatever document you want.
  2. Print that document to a pdf file using the printer named "Acrobat Distiller".
  3. Open the pdf file using Adobe Acrobat.
  4. Use Crop tool to capture the part you need.
  5. Export as an EPS file. (Save as an EPS file for Acrobat 5.0 or newer.)
That's it. You may change resolution and other options in Property for the printer.

How do I create an EPS file on Linux?

  1. Open the document you want to use to create the EPS file. This can be an Excel chart, a web page, or anything else.
  2. Go to another desktop and start the GIMP.
  3. Select Acquire from the file menu. This will bring up several options.
  4. Select capture window without decorations and set the delay to 4 seconds. Click OK.
  5. Go back to the first desktop, and when the cursor changes to a crosshairs, click on the window containing your document.
  6. After a few seconds, a GIMP window will appear with the image. Select only the part of the image you need, then right click on it and select Copy from the Edit menu.
  7. Right click again, and this time choose Paste as New from the Edit menu.
  8. Another image window will appear. Close the first image window and say OK when it asks for confirmation.
  9. Right click on the new image and select Flatten from the Layers menu.
  10. Right click on the new image and select Save As from the File menu.
  11. Set the file format to Postscript and type in a filename that ends with .eps, then click Save.
  12. Check Encapsulated Postscript and click OK. You're done.

The KAP/Pro Toolset

Where is KAI's KAP/Pro FAQ?

KAI the makers of the KAP/Pro toolset maintain their own FAQ here .

How do I compile with the KAPPro Guide tools on the ParaMount machines?

Currently, ParaMount Research has KAPPro Guide (and assure) licensed for lagavulin and peta only.
  1. You only need to set your path to include ~paramnt/tools/bin and
  2. Compile with -lguide or -lguide_stats

Monitoring a Parallel Machine

How can I tell if my program is running in parallel?

You can check to see if your program is running in parallel, i.e. it is using more than 1 thread by typing the following command:

ps -Lu username -o pid,gid,lwp,psr,s,comm

where username is your login.

This will give a table of your currently active processes and their lightweight threads. If your program is running in parallel then multiple instances of it should exist that have 0 as their status, i.e. they are running on a processor. For example, below the serial version of swim has only 1 thread. The parallel version always has multiple threads, however when running on p processors only p threads have a 0 status. The PSR column should generally be -, however if the lwps are bound to a processor it will appear here.

peta.ecn.purdue.edu 140: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm
  PID   GID    LWP PSR S COMMAND
 1135     1      1   - S -tcsh
29884     1      1   - S -tcsh
 2242     1      1   - O swim_serial
29611     1      1   - S emacs

(a) A Serial Version of Swim

peta.ecn.purdue.edu 141: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm
  PID   GID    LWP PSR S COMMAND
 1135     1      1   - S -tcsh
29884     1      1   - S -tcsh
 2250     1      1   - O swim_parallel_on_1
 2250     1      2   - S swim_parallel_on_1
 2250     1      3   - S swim_parallel_on_1
 2250     1      4   - S swim_parallel_on_1
 2250     1      5   - S swim_parallel_on_1
29611     1      1   - S emacs

(b) A Parallel Version of swim running on 1 processor

peta.ecn.purdue.edu 146: ps -Lu mjvoss -o pid,gid,lwp,psr,s,comm
  PID   GID    LWP PSR S COMMAND
 1135     1      1   - S -tcsh
29884     1      1   - S -tcsh
 2257     1      1   - O swim_parallel_on_2
 2257     1      2   - S swim_parallel_on_2
 2257     1      3   - S swim_parallel_on_2
 2257     1      4   - S swim_parallel_on_2
 2257     1      5   - S swim_parallel_on_2
 2257     1      6   - O swim_parallel_on_2
 2257     1      7   - S swim_parallel_on_2
29611     1      1   - S emacs

(c) A Parallel Version of Swim running on 2 processors

How can I tell if the machine that I'm using is heavily loaded?

You can use the mpstat command (on multi-processor SUN's) to see how busy the processors are:
/usr/bin/mpstat [ interval [ count ] ]
For example, using "mpstat 1 5" will return the activity of the processors 5 times, with 1 second between each sample. Generally, the first sample or two are not accurate. (The first sample shows the average since the system startup time.) An example is shown below for a 6-processor system. The last column shows the percentage of each cpu that is currently idle.
peta.ecn.purdue.edu 167: mpstat 1 5
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0   17   0   83    13    0  218  175   18  140    0   297   40   3   2  55
  1   14   0  301    13    1  237  160   18  164    0    12   23   3   3  71
  4   13   0   55    13    0  194  166   18  153    0   379   33   2   3  62
  5   12   0  383    38   26  247  161   18  162    0     8   34   2   3  61
  8   13   0  379   219    9  183  130   19  156    0   323   28   2   3  67
 12   13   0   28    26   14  255  159   19  171    0   177   37   3   2  58
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    3   0   10     0    0   24    0    0    0    0    25    0   0   0 100
  1    0   0    0     2    2   39    0    0    1    0    24    0   0   0 100
  4    0   0    0     6    1   12    5    3    0    0     0   83   0   0  17
  5    0   0    0    10    8   16    2    5    0    0     1   17   0   0  83
  8    0   0   20   222   22   30    0    1    2    0    20    0   0   0 100
 12    0   0    0     7    0    7    7    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0     5    0   11    5    1    0    0    15   83   0   0  17
  1    0   0    0     3    3   53    0    3    0    0    38    0   0   0 100
  4    0   0    0     2    0    3    2    1    0    0     3   17   0   0  83
  5    0   0    0     2    2    8    0    0    0    0     0    0   0   0 100
  8    0   0   20   219   19   34    0    0    1    0    20    0   0   0 100
 12    0   0    0     7    0    7    7    0    0    0     0  100   0   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0     2    0    4    2    0    0    0     0   17   0   0  83
  1    0   0    0     2    2   58    0    0    0    0    44    0   0   0 100
  4    0   0    0     0    0    4    0    2    0    0    11    3   0   0  97
  5    0   0    0     9    4    6    5    1    0    0     0   83   0   0  17
  8    0   0   20   219   19   25    0    1    0    0    25    0   0   0 100
 12    0   0    0     7    0    7    7    0    0    0     0   97   0   0   3
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0     5    0    6    5    1    0    0     0   83   0   0  17
  1    0   0 8590     3    3   40    0    0    1    0    31    0   5   0  95
  4    0   0    0     7    0    7    7    0    0    0     0  100   0   0   0
  5    0   0    0     6    4    8    2    3    0    0     0   17   0   0  83
  8    0   0   20   219   19   54    0    1    2    0    43    0   0   0 100
 12    0   0    0     0    0    0    0    0    0    0     0    0   0   0 100

Performance Tuning Tricks

How can I profile a program

Here is an example of program instrumentation and time profiling as I often do it. The example makes use on a simple instrumentation library, the source code of which I also include directly in this memo.

Starting from the source code:

      program trivial

      REAL a(10),b(10)

      DO 100 k=1,5

      DO 10 i=1,10
         a(i)=i
 10   ENDDO

      DO 20 j=1,10
         b(j)=a(j)
 20   ENDDO

      print *,b

100   CONTINUE
      END
I enclose each loop with a "start-timer/stop-timer" pair. In addition, there is an "init" and "finalize" call at the beginning and end of the program, resp. The init call initializes the library, the finalize call writes statistics about the collected times to a file.

The instrumented program looks like this:

      PROGRAM trivial
      REAL a(10), b(10)
      CALL instrument()

      DO 100 k=1,5

      CALL start_interval(1)
C  LOOPLABEL 'TRIVIAL_do10'
      DO 10 i = 1, 10
        a(i) = i
10    ENDDO
      CALL end_interval(1)
      CALL start_interval(2)
C  LOOPLABEL 'TRIVIAL_do20'
      DO 20 j = 1, 10
        b(j) = a(j)
20    ENDDO
      CALL end_interval(2)
      PRINT *, b

100   CONTINUE

      CALL exit_intervals('TRIVIAL.sum')
      END

      SUBROUTINE instrument
c This subroutine maps loop names to loop numbers. This info is used by 
c exit_intervals() to generate the printable summary.
      CALL init_intervals('')
      CALL enter_interval(1, 'TRIVIAL_do10')
      CALL enter_interval(2, 'TRIVIAL_do20')
      END
This kind of instrumentation can also be generated automatically by the Polaris compiler. Polaris can be run through the web at http://punch.ecn.purdue.edu/ParHub/

Now, you need to compile the program together with the instrumentation library functions, such as:

f77 trivial.f interval.f -o trivial
Then, when you run the program "trivial" it generates a file TRIVIAL.sum with the following content:
  TRIVIAL_do10 5 AVE: 0.000026 MIN: 0.000022 MAX: 0.000041 TOT: 0.000132 
  TRIVIAL_do20 5 AVE: 0.000017 MIN: 0.000017 MAX: 0.000017 TOT: 0.000087 

 OVERALL time -    0.022050 - - - - - - 
This gives you, for each instrumented program section, the number of invocations and the average, minimum, maximum, and total execution time.

The source code of the library (the file interval.f, above) is this:

	subroutine init_intervals(filename)
	character*(*) filename

	common /intvldata/ start(1000),count(1000),
     *                     total(1000), overall_start,
     *                     min(1000), max(1000),
     *                     nintervals, intvlname(1000)
	character*30 intvlname
	real start, total, overall_start, min, max, tt(2)
	integer*4 count, nintervals

	integer int_number
	character*30 int_name

	if (filename .eq. ' ') then
	  nintervals = 0
	  return
	endif

	open(file=filename,status='old',unit=83)

	nintervals = 0
100	read(83,*,end=200) int_number, int_name

	nintervals = nintervals + 1

	if (nintervals .ne. int_number) then
		print *, 'Warning: Interval number .ne.  record number: ', 
     *			  int_name
	endif	
	intvlname(int_number)(:) = int_name(:)
	count(int_number) = 0
	total(int_number) = 0
	min(int_number) = 1e31
	max(int_number) = 0
	goto 100

200	overall_start = etime(tt)

	close(unit=83)
	return
	end


c--------------------------------
       subroutine enter_interval ( number, name )
       character*(*) name
       integer number

       common /intvldata/ start(1000),count(1000),
     *                     total(1000), overall_start,
     *                     min(1000), max(1000),
     *                     nintervals, intvlname(1000)
       character*30 intvlname
       real start, total, overall_start, min, max, tt(2)
       integer*4 count, nintervals
       nintervals = nintervals + 1
       intvlname(number) = name
       count(number) = 0
       total(number) = 0
       min(number) = 1e31
       max(number) = 0
       end

c--------------------------------
	subroutine start_interval ( interval )

	integer interval

	common /intvldata/ start(1000),count(1000),
     *                     total(1000), overall_start,
     *                     min(1000), max(1000),
     *                     nintervals, intvlname(1000)
	character*30 intvlname
	real start, total, overall_start, min, max, tt(2)
	integer*4 count, nintervals

	start(interval) = etime(tt)

	return
	end

c--------------------------------
	subroutine end_interval ( interval )

	integer interval

	common /intvldata/ start(1000),count(1000),
     *                     total(1000), overall_start,
     *                     min(1000), max(1000),
     *                     nintervals, intvlname(1000)
	character*30 intvlname
	real start, total, overall_start, min, max, tt(2)
	integer*4 count, nintervals

	real period

        period = etime(tt) - start(interval)
        total(interval) = total(interval) + period
        count(interval) = count(interval) + 1
	if (period.lt.min(interval)) min(interval)=period
	if (period.gt.max(interval)) max(interval)=period
	return
	end


c--------------------------------
	subroutine exit_intervals(filename)

	character*(*) filename

	common /intvldata/ start(1000),count(1000),
     *                     total(1000), overall_start,
     *                     min(1000), max(1000),
     *                     nintervals, intvlname(1000)
	character*30 intvlname
        parameter(overhead_etime=0.71E-6)
	real start, total, overall_start, min, max, tt(2)
	real overhead_etime
	integer*4 count, nintervals

	real overall_end
	character*200 buffer, output_line


	overall_end = etime(tt)

	open(file=filename, unit=83, status='unknown')

       do i=1,nintervals
           if (count(i) .ne. 0) then
                buffer(:) = ' '
                write(buffer,10) intvlname(i)(:),
     *                  count(i),
     *                  (total(i)-overhead_etime*count(i))/count(i),
     *                  min(i)-overhead_etime,
     *                  max(i)-overhead_etime,
     *                  total(i)-overhead_etime*count(i)
10              format(1x,a30,1x,i7,' AVE: ',f12.6,' MIN: ', f12.6,
     *                 ' MAX: ', f12.6,' TOT: ',f12.6)
C               call xqueexe(buffer, output_line, length)
C               write(83,15) output_line(1:length)
                write(83,15) buffer
15              format(1x,a)
           endif
        end do

	write(83,20)
20	format(1x)
	write(83,30) overall_end-overall_start-overhead_etime
30	format(1x,'OVERALL time - ', f11.6,' - - - - - - ')
	return
	end

	subroutine xqueexe(string_in, string_out, length)
	character*(*) string_in
	character*(*) string_out
	integer length

	integer length_in, out, in
	logical inblanks

	length_in = len(string_in)

	inblanks = .false.
	out = 0
	do in=1,length_in
		if (string_in(in:in) .ne. ' ') then
			out = out + 1
			string_out(out:out) = string_in(in:in)
			inblanks = .false.
		elseif (.not. inblanks) then
			out = out + 1
			string_out(out:out) = string_in(in:in)
			inblanks = .true.
		endif
	end do

	length = out
	return
	end

How can I instrument a program using hardware counters?

The UltraSPARC and Pentium microprocessor families contain hardware performance counters that allow the measurement of many different hardware events related to CPU behavior, including instruction and data cache misses as well as various internal states of the processor. More recent processors allow a variety of events to be captured. The counters can be configured to count user events or system events, or both. The two processor families currently share the restriction that only two event types can be measured simultaneously.

CPU Performance Counters Library (cpc) is available on Solaris 8 and above. Using this library we implement the instrumentation for polaris. The source file is interval_cpc.c, and you may also link with the libaray libinstrcpc.a directly. Both of them are provided in the

paramnt/tools/instrumentation.
This libaray satisfies the interface of instrumentation in polaris.

Usage

Step 1: Using polaris to do the instrumentation.
Please switch "instrument" on. You had better switch "instr_pcl" on, if you are inerested in all the processors.


Step 2: Compile the instrumentation code, if interval_cpc.c is used directly.

To compile this code, please use Sun 6.2 c compiler.
cc -fast -xarch=v8plusa -xopenmp interval_cpc.c
To instrumentate the code exclusively, please define _EXCLUSIVE. e.g.
cc -fast -xarch=v8plusa -xopenmp -D_EXCLUSIVE interval_cpc.c
For multiple processors, please define _OPENMP and _MT
cc -fast -xarch=v8plusa -xopenmp -D_OPENMP -D_MT interval_cpc.c

You may also use my precompiled object file or libarary.
For inclusive instrumentation:  interval_cpc.o and libinstrcpc
For exclusive instrumentation:  interval_cpc_e.o and libinstrcpce

n.b. Because the hardware counter has only 4 bytes, it will cause overflow if the interval is too long. Therfore, a timer is added to reset the hardware counter in order to avoid overflow. The timer should not be too long to avoid overflow. And it should be as long as possible to reduce intrumentation overhead. So, you may define the time by setting the macro _OVRFLSEC. e.g. "-D_OVRFLSEC=8". By default, it is 8 for 500Mhz machines.

n.b. If you want to trace the latency of each iteration, you may define _TRACE.

n.b. To avoid the thread migration, I bind all the threads to the CPUs. Otherwize, it will cause some big problems for instrumentation. The bad news is this may have some side effects on performance.

Step 3: Compile with the instrumented code.
To instrument the parallel code, you must include the following flags in your fortran compiler
-fast -xarch=v8plusa -openmp -mt
Make sure you linked the cpc libarary (-lcpc).
If you want to use the instrcpc/instrcpce libarary, do not forget to link it.

Step 4: Set the environment.

CPC does counting based on the Performance Control Register (PCR). INSTRCPC libaray uses the environment variable of PERFEVENTS to specify the type of counters.
By default, it is set to be "Cycle_cnt,Instr_cnt" to get the #cycles and #instructions, and thus to count CPI.
E.g. to get the cache hit ratio, we may do
setenv PERFEVENTS "pic0=EC_ref,pic1=EC_hit".

The syntax of setting counter options is

pic0=<eventspec>, pic1=<eventspec> [,sys] [,nouser] .
This syntax, which reflects the simplicity of the options available using the %pcr register, forces both counter events to be selected. By default only user events are counted; however, the sys keyword allows system (kernel) events to be counted as well. User event counting can be disabled by specifying the nouser keyword. The keywords pic0 and pic1 may be omitted; they can be used to resolve ambiguities if they exist.


LAST STEP: Run the program!

Appendix: Performance Instrumentation Counter Events

(From Sun Microelectronics UltraSPARC I&II User's Manual,  January 1997, STP1031)

1 Instruction Execution Rates

Cycle_cnt [PIC0,PIC1]

Accumulated cycles. This is similar to the SPARC-V9 TICK register, except that cycle counting is controlled by the PCR.UT and PCR.ST fields.
Instr_cnt [PIC0,PIC1]
The number of instructions completed. Annulled, mispredicted or trapped instructions are not counted.
Using the two counters to measure instruction completion and cycles allows calculation of the average number of instructions completed per cycle.

2 Grouping (G) Stage Stall Counts

These are the major cause of pipeline stalls (bubbles) from the G Stage of the pipeline. Stalls are counted for each clock that the associated condition is true.

Dispatch0_IC_miss [PIC0]

I-buffer is empty from I-Cache miss. This includes E-Cache miss processing if an E-Cache miss also occurs.
Dispatch0_mispred [PIC1]
I-buffer is empty from Branch misprediction. Branch misprediction kills instructions after the dispatch point, so the total number of pipeline bubbles is approximately twice as big as measured from this count.
Dispatch0_storeBuf [PIC0]
Store buffer can not hold additional stores, and a store instruction is the first instruction in the group.
Dispatch0_FP_use [PIC1]
First instruction in the group depends on an earlier floating point result that is not yet available, but only while the earlier instruction is not stalled for a Load_use (see 3 ). Thus, Dispatch0_FP_use and Load_use are mutually exclusive counts.
Some less common stalls  are not counted by any performance counter, including
  • One cycle stalls for an FGA/FGM instruction entering the G stage following an FDIV or FSQRT.
  • 3 Load Use Stall Counts

    Stalls are counted for each clock that the associated condition is true.

    Load_use [PIC0]

    An instruction in the execute stage depends on an earlier load result that is not yet available. This stalls all instructions in the execute and grouping stages.

    Load_use also counts cycles when no instructions are dispatched due to a one cycle load-load dependency on the first instruction presented to the grouping logic.

    There are also overcounts due to, for example, mispredicted CTIs and dispatched instructions that are invalidated by traps.

    Load_use_RAW [PIC1]
    There is a load use in the execute stage and there is a read-after-write hazard on the oldest outstanding load. This indicates that load data is being delayed by completion of an earlier store.
    Some less common stalls are not counted by any performance counter, including:


    4 Cache Access Statistics

    I-, D-, and E-Cache access statistics can be collected. Counts are updated by each cache access, regardless of whether the access will be used.

    IC_ref [PIC0]

    I-Cache references. I-Cache references are fetches of up to four instructions from an aligned block of eight instructions. I-Cache references are generally prefetches and do not correspond exactly to the instructions executed.
    IC_hit [PIC1]
    I-Cache hits.
    DC_rd [PIC0]
    D-Cache read references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted. Atomic, block load, "internal," and "external" bad ASIs, quad precision LDD, and MEMBARs also fall into this class.

    Atomic instructions, block loads, "internal" and "external" bad ASIs, quad LDD, and MEMBARs also fall into this class.

    DC_rd_hit [PIC1]
    D-Cache read hits are counted in one of two places:
    1. When they access the D-Cache tags and do not enter the load buffer (because it is already empty)
    2. When they exit the load buffer (due to a D-Cache miss or a nonempty load buffer).
    Loads that hit the D-Cache may be placed in the load buffer for a number of reasons; for example, the load buffer was not empty. Such loads may be turned into misses if a snoop occurs during their stay in the load buffer (due to an external request or to an E-Cache miss). In this case they do not count as D-Cache read hits.
    DC_wr [PIC0]
    D-Cache write references (including accesses that subsequently trap). NonD-Cacheable accesses are not counted.
    DC_wr_hit [PIC1]
    D-Cache write hits.
    EC_ref [PIC0]
    Total E-Cache references. Non-cacheable accesses are not counted.
    EC_hit [PIC1]
    Total E-Cache hits.
    EC_write_hit_RDO [PIC0]
    E-Cache hits that do a read for ownership UPA transaction.
    EC_wb [PIC1]
    E-Cache misses that do writebacks.
    EC_snoop_inv [PIC0]
    E-Cache invalidates from the following UPA transactions: S_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQS_INV_REQ, S_CPI_REQ.
    EC_snoop_cb [PIC1]
    E-Cache snoop copy-backs from the following UPA transactions: S_CPB_REQ, S_CPI_REQ, S_CPD_REQ, S_CPB_MSI_REQ.
    EC_rd_hit [PIC0]
    E-Cache read hits from D-Cache misses.
    EC_ic_hit [PIC1]
    E-Cache read hits from I-Cache misses.

    The E-Cache write hit count is determined by subtracting the read hit and the instruction hit count from the total E-Cache hit count. The E-Cache write reference count is determined by subtracting the D-Cache read miss (D-Cache read references minus D-Cache read hits) and I-Cache misses (I-Cache references minus I-Cache hits) from the total E-Cache references. Because of store buffer compression, this is not the same as D-Cache write misses.

    Instrumentation using hardware counters on SGI Origin machines?

    There is a command called "perfex", which could be used to get access to the hardware counters. Using its library -- libperfex we implement the instrumentation as well for polaris. The source file is interval_perfex.c provided in

    paramnt/tools/instrumentation.
    This libaray satisfies the interface of instrumentation in polaris.

    Usage

    Step 1: Using polaris to do the instrumentation.
    Please switch "instrument" on. You had better switch "instr_pcl" on, if you are inerested in all the processors.


    Step 2: Compile the instrumentation code.

    cc -O2 -mp interval_perfex.c
    For multiple processors, please define _OPENMP and _MT
    cc -O2 -mp -D_OPENMP -D_MT interval_perfex.c
    Step 3: Compile with the instrumented code.
    Make sure you linked the perfex libarary (-lperfex).

    Step 4: Set the environment. You can use two different counters at once. There are two corresponding environment variables T5_EVENT0 and T5_EVENT1. In order to know the values to be set, please see the man page of perfex.


    LAST STEP: Run the program!

    My application is 32-bit but runs out of heap space after only 2GB. What is the problem?

    On our SUN systems, you can set limits on the stack space and heap space using the limit command: The above example sets the maximum stack space to be 0.5 GB. You can also "unlimit" the stack space with: You can also "unlimit" the heap space: Since SUN Solaris 2.8 is 64-bit, you can have a swap space file greater than 2 GB, and can therefore have a heap space of more than 2 GB. However, if you unlimit the stack space, it will set it to 2 GB, meaning that the remaining space in a 32-bit address for the heap space is 2 GB. To give more space to your heap for a 32-bit application, limit your stack. With a stack size of 1 GB, my heap space could grow to 3 GB. This allowed me to use the 32-bit compilation of Polaris without attempting to compile with the 64-bit gcc compilers.

    Running OpenMP Programs

    How should I set up my environment to run an OpenMP program.

    The OMP_NUM_THREADS environment variable determines the number of processors that will be used during execution. If you are using csh, you can set this variable to, for example, 4 by typing:

    setenv OMP_NUM_THREADS 4

    You should also set the environment variable PARALLEL to 1. This variable must be set or else any timers used by the program will return incorrect timings (see the etime man page for more details).

    In order for your application to find the Guide libaries, you must set your LD_LIBRARY_PATH environment variable to ~paramnt/tools/guide38/lib/32, see the FAQ on how to use the Purdue installed version of the KAPPro tools under KAPPro_Toolset.

    You can find more details about other OpenMP variables at the OpenMP webpage.

    My program crashes as soon as I run it, what's up?

    Of course there are many reasons that a program may crash, but a common reason is that the stack size is too small. If your program crashes as soon as you begin to execute it (I mean instantly!), this may be the problem. Try increasing the stacksize by typing:

    >> limit stacksize n
    

    where n is the size you'd like (in kbytes). You can see what the current size is by typing "limit" by itself:

    peta.ecn.purdue.edu 52: limit
    cputime unlimited
    filesize unlimited
    datasize 2097148 kbytes
    stacksize 8192 kbytes
    coredumpsize unlimited
    vmemoryuse unlimited
    descriptors 64
    

    Sometimes the backend compiler will warn you if it thinks that the stacksize is very large, other times it won't. It usually is a good idea to try increasing the stacksize before spending hours trying to find the "bug" in your program.

    I know that I did everthing right but I get no speedup, why?

    There are several reasons that this may happen: (1) you aren't really running the program in parallel (believe me that this is not that uncommon for a new user! Or sometimes even an experienced user), (2) Polaris did not do a good job of parallelizing your application (this isn't that uncommon either, especially for very large applications) or (3) your system is heavily loaded and using extra processors actually slows the application down. I'll briefly discuss each of these.

    1) Your program isn't really running in parallel: see the question "How can I tell if my program is running in parallel?" to check this. If you are new to using polaris or the environment at Purdue, check this first, it may keep you and your application from wasting a lot of time!

    2) Polaris did not do a good job of parallelizing your application: Studies have shown that automatic parallelization does well in only about 1 in 2 programs, and these studies were done with benchmark programs, not really big applications. To see if this is the problem, you'll probably need to characterize your application and look into tuning it by hand. You can look at http://min.ecn.purdue.edu/~ipark/UMinor/meth_index.html  for help on this subject.

    3) Your system is heavily loaded: Parallel programs use more than 1 processor. If your multiprocessor is heavily loaded then running an application in parallel may only increase memory and bus contention. If you are trying to tune and time a new program, try to do it on a quiet machine or in a single-user environment if possible. Also see "How can I tell if the machine I'm using is heavily loaded?".


    Questions about SPEC Benchmarks

    How can I obtain SPEC Benchmarks.

    Purdue is an associate member of the SPEC High-Performance group. The ParaMount group participates actively in SPEC activities. We have access to most SPEC benchmark suites, including the High-Performance benchmarks (SPEChpc96), the CPU benchmarks (SPECfp2000 and SPECint2000, also the older SPEC95 benchmarks), the SPEC Web benchmarks (SPECweb99), and several SPEC graphics benchmarks.

    All Purdue members are allowed to use these benchmarks. The benchmarks come with certain rules. For example, you are not allowed to distribute the benchmarks outside Purdue; you are not allowed to use certain SPEC metrics for quoting performance results (unless you adhere strictly to the SPEC runrules). However it is allowed, and common, to use the benchmarks in research experiments and report these results (not using SPEC metrics). SPEC has some recommendations for using its benchmarks in research papers.

    To get a copy of SPEC benchmarks, you can borrow our SPEC CD. Send mail to eigenman@purdue.edu.

    You can find more information about SPEC and its benchmarks at www.spec.org.


    The SUIF Compiler

    Which version of SUIF is installed in the ParaMount directories?

    Both versions of Standford's SUIF compiler (versions 1 and 2) are installed.

    SUIF 1 in suif1

    SUIF 2 in suif2

    How do I run SUIF from ParaMount's installation?

    SUIF 1

    SUIF 2

    this page under development.

    How do I parallelize a Fortran code with SUIF 1?

    You can use the provided parallelization driver:
    pscc -parallel program.f

    This driver executes a series of passes to parallelize a code.

    1. SF2C
    2. : first pass of the SUIF Fortran 77 front end
    3. CPP
    4. : preprocess
    5. SNOOT
    6. : translate pre-processed C to SUIF
    7. FIXFORTRAN
    8. : final processing for the SUIF Fortran 77 front end
    9. ENTRY_TYPES
    10. : propagate parameter type info for Fortran entry points
    11. PORKY_PRE_DEFAULTS
      PORKY_DEFAULTS
    12. : This does the default options to be used right after the front end, to turn some non-standard SUIF that the front end produces into standard SUIF. It also does some things, like constant folding and removing empty symbol tables, to make the code as simple as possible without losing information. It is equivalent to all of the options:
    13. PORKY_LINE_FIX
    14. : This removes all mark instructions that contain nothing but line information that is followed immediately by another line information mark.
    15. LINKSUIF
    16. : combine SUIF files into a file set.
    17. PORKY_UCF_OPT
    18. : Do simple optimizations on unstructured control flow (branches and labels).
    19. PORKY_DEAD_CODE1
    20. : Simple dead-code elimination.
    21. STRUCTURE
    22. : structure control flow.
    23. PORKY_FOLD1
    24. : This folds constants wherever possible.
    25. PORKY_FIND_FORS
    26. : This builds tree_for nodes out of tree_loop nodes for which a suitable index variable and bounds can be found.
    27. PORKY_CONST1
    28. : Simple constant propagation.
    29. PORKY_FOLD2
    30. :
    31. PORKY_COPY_PROP
    32. : This does copy propagation, which is the same as forward propagation limited to expressions that are simple local variables (i.e. if there is a simple copy from one local variable into another, uses of the source variable will replace the destination variable where the copy is live).
    33. PORKY_DEAD_CODE2
    34. :
    35. PORKY_UNUSED1
    36. : This removes types and symbols that are never referenced and have no external linkage, or that have external linkage but are not defined in this file (i.e. no procedure body or var_def).
    37. PORKY_EMPTY_TABLE1
    38. : This dismantles all TREE_BLOCKs that have empty symbol tables.
    39. PORKY_EMPTY_FOR1
    40. : This dismantles TREE_FORs with empty bodies.
    41. PORKY_CONTROL_SIMP1
    42. : This simplifies TREE_IFs for which this pass can tell that one branch or the other always executes, leaving only the instructions from the branch that executes and any parts of the test section that might have side effects.
    43. PORKY_FORWARD_PROP1
    44. : move as much computation as possible into the bound expressions of each loop
    45. PORKY_FOLD3
    46. :
    47. PORKY_DEAD_CODE3
    48. :
    49. PORKY_LOOP_COND
    50. : Move all loop-invariant conditionals that are inside a TREE_LOOP or TREE_FOR outside the outermost loop.
    51. PORKY_FORWARD_PROP2
    52. :
    53. PORKY_FOLD4
    54. :
    55. PORKY_DEAD_CODE4
    56. :
    57. PORKY_UNUSED2
    58. :
    59. PORKY_EMPTY_TABLE2
    60. :
    61. PORKY_EMPTY_FOR2
    62. :
    63. PORKY_CONTROL_SIMP2
    64. :
    65. PORKY_LOOP_INVARIANTS1
    66. :This moves the calculation of loop-invariant expressions outside loop bodies.
    67. PORKY_FORWARD_PROP3
    68. :
    69. PORKY_CSE1
    70. : This does simple common sub-expression elimination.
    71. PORKY_CONST2
    72. : This does simple constant propagation.
    73. PORKY_SCALARIZE
    74. : This turns local array variables into collections of element variables when all uses of the array are loads or stores of known elements. It will partly scalarize multi-dimensional arrays if they can be scalarized in some but not all dimensions.
    75. PORKY_FORWARD_PROP4
    76. :
    77. PORKY_CONST3
    78. :
    79. NORMALIZE
    80. :
    81. PORKY_IVAR1
    82. : This does simple induction variable detection.
    83. PORKY_IVAR2
    84. :
    85. PORKY_IVAR3
    86. :
    87. PORKY_KNOW_BOUNDS
    88. : This replaces comparisons of upper and lower bounds of a loop inside the loop body with the known result of that comparison. This is particularly useful after multi-level induction variables have been replaced.
    89. PORKY_CONST4
    90. :
    91. PORKY_FOLD5
    92. :
    93. SCE0
    94. :
    95. REDUCTIONS
    96. :
    97. PORKY_EMPTY_FOR3
    98. :
    99. PREDEP
    100. :
    101. PORKY_DEAD_CODE5
    102. :
    103. SKWEEL
    104. :
    105. PORKY_UNCBR
    106. : Replace call-by-reference scalar variables with copy-in, copy-out. This is useful when a later pass, such as a back-end compiler after s2c will not have access to call-by-ref form.
    107. PGEN
    108. :
    109. PORKY_FOLD6
    110. :
    111. PORKY_FORWARD_PROP5
    112. :
    113. PORKY_DEAD_CODE6
    114. :
    115. PORKY_LOOP_INVARIANTS2
    116. :
    117. PORKY_FORWARD_PROP6
    118. :
    119. PORKY_CSE2
    120. :
    121. PORKY_DEAD_CODE7
    122. :
    123. CFORM
    124. : translate inside SUIF from Fortran to C form.
    125. PORKY_UNUSED3
    126. :
    127. S2C
    128. : convert a SUIF file to C
    129. BACKEND_CC
    130. : Use cc or gcc.
    131. LD
    132. : ld.

    (See the man pages provided with the SUIF distribution for more. Much of this data was gathered from the man pages.)

    What other passes/utilities are provided with SUIF and installed?


    Single-User Access

    How can I get single-user access to ParaMount SUN machines?

    We currently have single-user times set up on peta.ecn, lagavulin.ecn, bernina.ecn, and longmorn.ecn. Anyone with group paramnt can set up a single-user time as described below. Single-user time is set up to allow your shell or script to have access to the machine while all other users are restricted from logging in and all currently running jobs are suspended.

    However, some jobs on the ECN supported SUN's cannot be stopped or delayed. See the FAQ for more on what may be influencing your timings on our SUN's.

    SINGLE-USER TIMES

    Please follow the guidelines below carefully when entering a single-user time in the /home/peak/a/sut/single-user-times/SUTimes/MACHINE file. (MACHINE is "peta", "lagavulin", ....) (The scripts affect all users of the machine.)
      FORMAT OF THIS FILE
      This is a comma-deliminated file with three fields:
      begin-time end-time single-user(s)
    1. VALID TIMES:
      begin-time can be any time at least 3 minutes after the end time of a previous end time (on the same day.) end-time needs to be at least 3 minutes after the begin time.
      The single-user scripts are currently being run every odd hour (1am, 3am, ..., 23), so begin-time must be later than the next time the script is run.
      The times are used by "at" (see spawnSUTs.awk) and can take any explicit time format that "at" accepts, such as: 3:25pm or 15:25 .You also can specify a date, if it is later than the next time the single-user setup script gets run. This date can be a day of the week or the explicit month and day. For example:
      1. 3:25am thu --> means 3:25am on the next Thursday morning.
        3:25am mar 5 --> means 3:25am on March 5th.
        NOTE THAT "thu" (NOT "thr") IS Thursday.
      single-user(s) This field is optional. The single-user can be one or more logins (separated by spaces.) The designated single-users will not have their jobs stopped during this single-user time.
      If you do not specify a single-user (i.e., if only a begin and end time are specified,) then there will be no conversion of processes from time-shared to real-time mode . Also, all jobs (including any of yours that are running at begin-time,) will be stopped at the begin time.
      If an asterix ("*") is the first character in this field then the "sut" account will mail everyone on the mailing list about your single-user time. (See below about notifying other users.)
        eg.,
        4:00am , 8:00am, barmstro
        8:05am , 8:00pm, *barmstro
        9:00pm tue, 9:45pm tue, aslot barmstro seon
    2. RE-OCCURING SINGLE-USER TIMES

    3. These take place every day (they never get commented out of the SUTimes file.) They are indicated by an "*" in front of the begin time. (Don't overlap your requested times with these.)
        eg.,
        * 3:00am , 7:00am, mjvoss
    4. SINGLE-USER TIME USING THE SINGLE-USER QUEUE
      If there is no specific end-time to the single-user time slot then prefix the second field with an "*", such as: 5:55pm friday,* noon saturday This will cause the single-user time slot to end when no more jobs are present in the s-queue, with a hard-limit of noon.
      Single-users could submit jobs to the single-user queue to run any time before the single-user time ends. These jobs will get executed between 5:55pm Friday and noon on Saturday in the order of their submit time. By noon on Saturday, or earlier if the single-user queue is drained quicker, multi-user time will begin. (This method of single-user time has not been tested with all the newer features yet.)
    5. VALID SINGLE-USER TIMES CANNOT OVERLAP
      Just make sure that your new time does not overlap with any of the others.
      You can use "sudo atq" to check which jobs are queued by sut. There should be two or three jobs in the "at" queue for the begin time and two or three for the end time. (You can also see anyone else's jobs waiting to be executed in this list.)
      Previously submitted single-user times are commented out with "#SUBMITTED" in this file. Previous requests are also logged in SUTimes.log .
      Also, you should begin your single-user time a little bit after a previous one ends. Exactly at the end time, the sudoStartMUT (which continues multi-user time) is executed and it may take a couple of minutes to execute these scripts. So, give a little buffer (like 5 minutes) between single-user times.
    6. NOTIFY USERS IF THIS IS NOT A NORMAL SINGLE-USER TIME

    7. In /home/peak/a/sut is a file: .mailrc which contains a mail alias with all current users of the system. You can use this alias (just add it to the .mailrc in your home directory and then send mail to multi_peta, multi_lagavulin, etc., depending on which machine you are on) to notify everyone on the mailing list.
      Alternatively, if the first character of the 'single-users' field is an asterix ("*"), the sut account will automatically mail a notification to all on the list.
        eg.,
        3:00pm,9:00pm,*mjvoss barmstro
        (This will set up a single-user time for 3 - 9 pm and mail the users on the list that Mike and Brian are using the machine.)
    8. ALLOCATE SINGLE-USER TIMES IN ADVANCE
      We need to kindly notify all users in advance, preferrably a day in advance if you want time outside of the normal time allocated to single-users (which is 3:25am till 8:25am every weekday morning.)
      For extra time on the weekends, it is good to notify everyone before the end of the workday on Friday.
    9. NORMAL SINGLE-USER TIME
      Every day 3:25am till 8:30am is single-user time. You can take any amount of this time without notifying all users of the machine. Other times you will need to notify all users in advance.
    10. INTERACTIVE SINGLE-USER TIME
      You can allocate an interactive single-user time for yourself by including your login in the list of single-users and starting an xterm on the machine some time before your single-user time begins. When logins are restricted then your xterm will remain active.
    11. REALTIME
      The SUN systems managed by ECN are set up for time-shared multi-user use. So, jobs with a higher priority do not necessarily execute before jobs with a lower priority. The scripts for single-user time make sure that your processes are set to real-time and are given a high priority. Then, your jobs will gain access to resources and execute  before almost any other process.
      Switching your processes to real-time only occurs if you include your login in the list of single-users and your jobs are running (or the script that executes your jobs is running) at the time of begin-time.

    DETAILS:

    Here are the actions that take place for each line in this file:
    1. sut runs submitSUTimes.sh:
      this shell script executes an awk script (spawnSUTs.awk) to create tmp/temp.sh
    2. sut then immediately runs tmp/temp.sh:
      which uses "at" to submit jobs and then comments out all single-user times just submitted in this file.
    3. The "at" jobs submitted for each single-user time are:
      1. a stdin job that sets tmp/SUT_users and tmp/MUT_users, and which creates a login restriction message (tmp/nologin).
      2. startSUT.sh -- stops all jobs owned by users in tmp/MUT_users and restricts logins.
      3. realTime.sh -- converts all jobs owned by users in tmp/SUT_users to real-time with high-priority.
      4. startMUT.sh -- continues all jobs stopped by startSUT.sh and removes the login restriction.
      5. timeShared.sh -- converst all jobs owned by users in tmp/SUT_users to time-shared.
      6. a stdin job that removes the tmp/nologin file, making the scripts believe that it is now multi-user time.

    Why are my timings in single-user time longer than expected?

    Many factors can possibly affect timings on the SUN systems supported by ECN, which are set up as multi-user servers. There are backups, remote accesses to the disks (through NFS), security checks, unrestricted accounts (condor and punch daemons and jobs), and clumsy OS handling of resources.

    ECN-related tasks are not always predictable. That is, the backups may occur in the early morning (around 2 or 3am on peta,) but could also occur later if there is some delay in the backups.
    The times to generally watch out for are late evening (8pm or 10pm) and 2-3:30am in the morning.
    What is more, you cannot easily see if a backup is occuring by using "ps" to look at the processes because they occur from remote machines.
    peta is most affected by backups and remote accesses since it is home for a software RAID array. Every access to a disk causes a CPU load, even when the accesses are from a remote machine.

    condor and PUNCH must have certain daemons running all the time or the process of the Hub will not work properly. You can see these jobs by doing 'ps' on the system (see the FAQ on "How can I tell if the machine that I'm using is heavily loaded?".)

    Our systems are connected to a NFS meaning that the disks can be mounted by other systems. When a remote machine accesses the local disks the disk access latency increases. On peta, there is a CPU load for every access to the 32 GB of disk space because it uses a software RAID array.

    How do I use the submit to the single-user batch queue? (qsub)

    NAME

    SYNOPSIS

    DESCRIPTION OPTIONS DETAILS OUTPUT LOCATION OF QUEUES

    The Polaris Compiler

    Last Update: Mon Sep 30 12:49:27 EST 2002

    What is Polaris?

    This answer is stolen from the polaris web page at UIUC:

    "The Polaris compiler takes a Fortran77 program as input, transforms this program so that it runs efficiently on a parallel computer, and outputs this program version in one of several possible parallel Fortran dialects. The input language includes several directives which allow the user of Polaris to specify parallelism explicitly in the source program. The output language of Polaris is typically in the form of Fortran77 plus parallel directives as well. For example, a generic parallel directive set includes the directives "CSRD$ PARALLEL" and "CSRD$ PRIVATE a,b", specifying that the iterations of the subsequent loop shall be executed concurrently and that the variables a and b shall be declared "private to the current loop", respectively. Another output language that Polaris can generate is the Fortran plus the directive language available on the SGI Challenge machine.  (The Purdue version also supports the OpenMP API). Polaris performs its transformations in several "compilation passes". In addition to many commonly known passes, Polaris includes advanced capabilities performing the following tasks: array privatization, data dependence testing, induction variable recognition, interprocedural analysis, and symbolic program analysis. An extensive set of options allow the user and the developer of Polaris to experiment with the tool in a flexible way. An overview of the Polaris transformations is given in the Publication Automatic Detection of Parallelism: A Grand Challenge for High-Performance Computing. The implementation of Polaris consists of 170,000 lines of C++ code. A basic infrastructure provides a hierarchy of C++ classes that the developers of the individual compilation passes can use for manipulating and analyzing the input program. This infrastructure is described in the Publication "The Polaris Internal Representation".

    Can I Get My Own Copy of Polaris?

    If you are outside of Purdue see http://polaris.cs.uiuc.edu/polaris/README. If you are within Purdue send mail to polaris@ecn.purdue.edu (we can give you our modified version, i.e. with the OpenMP directives etc...) If you simply want to use Polaris, there are easier ways then to install it yourself: (1) if you're at Purdue we already have an public copy and (2) regardless of where you are, you can use the Parallel Programming Hub.

    I have an ECN Account, how do I setup my environment so that I can run polaris?

    The most up to date version of the polaris executable is kept in the /home/peak/a/paramnt/tools/bin directory. You should add this to your path if it is not already there. Next, in order to access the polaris man pages, you should add the /home/peak/a/paramnt/tools/man directory to your man path. To ensure that polaris uses the most up to date switch file, set the POLARIS_ROOT environment variable to /home/peak/a/paramnt/tools/polaris.

    An example, using the csh is given below. You may want to add these commands to your .cshrc so that you don't have to retype them each time you log in:

    >> setenv PATH /home/peak/a/paramnt/tools/bin:$PATH
    >> setenv MANPATH /home/peak/a/paramnt/tools/man:$MANPATH
    >> setenv POLARIS_ROOT /home/peak/a/paramnt/tools/polaris
    

    How do I run polaris (as a command-line tool)?

    The command line interface is explained in detail in the polaris man page (read by typing: ``man polaris'').

    In this section I will give a step by step tutorial on how to generate a parallel program using polaris. First, enter the sample program shown below into a file named example.f. If you're not familiar with Fortran 77, you need to know that spaces are important!!. Executable statements can only use the first 6 columns as stamement labels or to signify that the line continues. Therefore all of the lines in the figure start with at least 6 spaces.

          PROGRAM EXAMPLE
          REAL A(100),B(100)
          INTEGER I
          DO I = 1,100
           A(I) = I
           B(I) = 100 - I
          ENDDO
          DO I = 2,100
           A(I) = A(I-1)
          ENDDO
          DO I = 1,100
           WRITE(6,*) A(I), B(I)
          ENDDO
          END
    
    Figure: Example Program
    

    You can run polaris on example.f by simply typing:

    >> polaris -version new example.f

    The polaris command found in /home/peak/a/paramnt/tools/bin is actually a script that allows you to select from several versions of polaris.  The -version flag controls which version is used, and using "new" will cause the newest version to be called.  Typing the command as given above should generate many lines of output, including the resulting program shown in the figure below (the code may differ slightly depending on what the current default setup of polaris is).

          PROGRAM example
          INTEGER*4 i, numthreads, omp_get_max_threads
          REAL*4 a, b
          DIMENSION a(100), b(100)
          COMMON /polaris/ numthreads
          numthreads = omp_get_max_threads()
    !$OMP PARALLEL
    !$OMP+DEFAULT(SHARED)
    !$OMP+PRIVATE(I)
    !$OMP DO
    CSRD$ LOOPLABEL 'EXAMPLE_do#1'
          DO i = 1, 100, 1
            a(i) = i
            b(i) = 100+(-i)
          ENDDO
    !$OMP END DO NOWAIT
    !$OMP END PARALLEL
    CSRD$ LOOPLABEL 'EXAMPLE_do#2'
          DO i = 2, 100, 1
            a(i) = a((-1)+i)
          ENDDO
    CSRD$ LOOPLABEL 'EXAMPLE_do#3'
          DO i = 1, 100, 1
           WRITE (6, *) a(i), b(i)
          ENDDO
          STOP
          END
    
    Figure: Example Output
    
    You can see in the example that Polaris finds the first loop, EXAMPLE_do#1 to be parallel, while finds that the remaining two loops are serial. The code part of the output is also stored into a file called example_P.f.

    How do I change how polaris behaves? What is a switch? What is a switches file?

    Polaris allows the user to have greater control over the parallelization process through the use of command line switches. Switches appear on the command line after -s. The various switches are described in detail in the polaris man page. As an example, we could parallelize a program using the SGI directives, instead of the default OpenMP directives by calling polaris as follows:

    >> polaris -version new -s output_lang=6 example.f

    The default settings of all of the switches are kept in a file called the switches file. This file is located in /home/peak/a/paramnt/tools/polaris. Since the script allows you to run different versions of polaris, there are several versions of the "switches" file. The script will automatically select the correct version. If you want to modify this file you can copy it to your local directory and then set the evironment variable POLARIS_ROOT to the directory in which it resides. However, it is strongly suggested that you do not do this since changes in the default switch file will not propagate into your version.

    Can I run the polaris executable directly, i.e. without using the script?

    You can use the polaris executable directly by using the executable found in: /home/yara/re/mv/Master/bin/sparc-sun-solaris2.6. Make sure that you then change your POLARIS_ROOT to /home/yara/re/mv/Master and update your PATH variable so that the correct executable is chosen.

    How can I run polaris on the Web? What is PUNCH? What/where is the Parallel Programming Hub?

    You need not install polaris to use it. You can use polaris across the internet using the Parallel Programming Hub: http://punch.ecn.purdue.edu/ParHub. You can get a free acount and try it out if you'd like. There are more details on the Hub itself.

    How do I compile the output of Polaris (the _P.f) file?

    Polaris by default places the output into a file call *_P.f where * is the source file name. Polaris is a source-to-source restructurer, meaning that it accepts Fortran as input and generates Fortran as output. The output Fortran then must be compiled with a backend compiler. We use the OpenMP parallel directives by default and programs using these directives must be compiled using guidef77. (This only applies to user at ECN, other users will have to find there own OpenMP compiler!) The guidef77 executable is also in /home/peak/a/paramnt/tools/bin so you probably do not need to change your path. The guidef77 manual can be found in /home/yara/re/mv/tools/guide38/docs/GuideF_Reference.pdf or also as a postscript file in the same directory. It takes almost the same options as f77 (man f77).

    We only have a license for guidef77 on one server: peta.ecn.purdue.edu, so you must be logged into peta in order to compile the output. You do need to have some environment variables set (or add the following to them if you already have them set):

    >> setenv GUIDE_HOME /home/yara/re/mv/tools/guide38
    >> setenv LD_LIBRARY_PATH /home/yara/re/mv/tools/guide38/lib
    

    To compile a program stored in a file called example_P.f you can type:

    >> guidef77 -fast -stackvar -mt -o example example_P.f
    

    I'll briefly explain the flags I used, although a better description can be found in the f77 man page. First, -fast is short for a collection of switches that should be used to generate a well optimized executable (it does machine specific things though, so if you compile using -fast you should not run it on a different machine). -stackvar forces all local variable to be allocated on the stack (this is recommended when running in parallel). And finally, -mt links in the thread-safe libraries. A typical compilation proceeds as shown below:

    peta.ecn.purdue.edu 4: guidef77 -fast -stackvar -mt -o example example_P.f
    
    WARNING: guidef77 does not support switch, passed through to
    Fortran compiler:  -fast
    Guide 3.6 k310744 19990126
    19-Aug-1999 16:54:45
    Guide 3.6 k310744 19990126: 0 errors
    in file example_P.f
    G_output.f:
    MAIN example:
    pkexample_:
    
    peta.ecn.purdue.edu 5:
    

    Where are the Polaris regression tests? How are the Polaris regression tests run? How do I add a test to Polaris' regression tests?

    Where are the Polaris regression tests and how are they run?

    The directory of regression tests is:/home/vanleer/a/polaris/poltests

    In this directory there is a script, runtests. For each file in poltests that ends with .f, it will run polaris, then try to compile the file with guidef77 and f95, and run each compiled version on 4 processors. The output of the programs placed in this directory should go to standard out and contain 'PASSED' if they validate. The tests should be self contained and not require other files to be linked.

    There are at least 3 examples in the directory: end.f, noname.f, and TFS.f. noname.f is an example that fails in Polaris. TFS.f compiles but fails at runtime when compiled with f95.

    The script will identify if Polaris fails, f95 fails, guidef77 fails, or the run does not validate, and send the list of files that fail to the appropriate developers of Polaris.

    So the output for the current examples would be:

    Thu May 24 18:30:18 EST 2001
    ------------------------

    TFS.f_failed_at_runtime_with_f95
    TFS.f_failed_at_runtime_with_f95.out
    noname.f_failed_in_polaris

    This output is also stored in a file, which is in this case: regressions-2001-May-24

    How do I add another test to the regression tests?

    If you find bugs, or just have good tests for Polaris, feel free to populate this directory. Also, when you find a bug, try to generate a small test case that can work in this test directory. We can then track bugs as they appear and disappear.

    To run a single test you can use the runtest script. To test just noname.f, you'd type:

    > runtest noname.f
    No match
    Abort (core dumped)

    This would cause generation of noname.f_failed_in_polaris that holds the output from polaris:

    POLARIS: Assertion failed: file String.cc, line 320 :
    Assign NULL value through String::operator =

    How can I obtain a development copy of Polaris?

    Polaris is set up with CVS (Concurrent Versioning System). On Purdue ECN machines, you can check out a version of Polaris by doing the following:

    1. Create a directory for your copy of polaris.
      mkdir polaris
    2. Enter that directory.
      cd polaris
    3. There should be a bzipped tar file of all the helper utilities that Polaris uses (such as the PCCTS scanner, OmegaTest, etc.). Copy the helper routines over to your private copy of polaris.
      bzcat /home/yara/re/paramnt/tools/polaris1.7.0/polaris_helpers.tar.bz2 | tar -xvf -
    4. Check out the Polaris source files from the CVS repository: /home/yara/re/paramnt/tools/polaris1.7.0/cvdl-cvsroot . Check them out into a directory named "cvdl".
      /usr/local/bin/cvs -d /home/yara/re/paramnt/tools/polaris1.7.0/cvdl-cvsroot checkout -d cvdl cvdl-1.7.0
    You should now have a directory tree of:

    To begin compiling Polaris, you must set one variable (either by the environment or by adding it to Rules.make in the cvdl directory).
    In Rules.make:
    POLARIS_DIR = /polaris

    You will also need to make the omega library.

    1. cd polaris/omega
    2. make libomega.a
    3. make POLARIS_DIR=/polaris install

    Then, to make polaris:

    1. cd polaris/cvdl
    2. make polaris

    In order to commit changes to polaris, we have to wait until we have regression tests in place.

    How can I add a user directive to Polaris' vocabulary?

    NOTE: This question is intended for developers of Polaris.

    For simplicity, assume you would like to add a directive of the form:

        CSRD$ VARRANGE var,lb,ub
    
    With this directive, you intend for users to be able to indicate the range a variable can have, i.e., lb <= var <= ub.

    Files you will have to create:

    Files you will have to modify:

    How can I control Polaris passes at a fine level?

    Polaris internal directives:

    There are quite a few directives that you can manually insert into the code to control what Polaris does with a statement or a block of statements. These directives are not all created with an end user in mind. Most are simply internal directives used to pass information from one compiler pass to another. However, the user attempting to parallelize a code or a developer of Polaris may find these directives useful at times.
    CSRD$ PRIVATE var1, var2, ...
    Must be within a do-loop or a block. Takes a list of assignable variables.
    CSRD$ ASSERT expr1, expr2, ...
    Takes a list of expressions.
    CSRD$ SAFE CONDITION
    Takes a list of expressions. The AssertSafeCondition class is meant to provide a specific class for representing a Boolean condition under which the Polaris analysis can be considered "safe". That is, the Polaris analysis ran into some condition which it could not prove to be true, but which had to be true for the analysis to be correct. Such conditions are noted on the AssertSafeCondition assertion. It is intended that such assertions be turned into code guarded by an IF (condition) THEN version1 ELSE version2 ENDIF construct in the Polaris-generated source.
    CSRD$ SERIAL [ ( var1 ) ( var2 ) ... ]
    Appears immediately before a loop. Prevents the ddtest pass from parallelizing the loop.
    CSRD$ CRITICAL [ ( var1 ) , ( var2 ) , ... ]
    CSRD$ BEGIN BLOCK
    CSRD$ BEGIN PARALLEL ( var1 )
    Appears within a do-loop.
    CSRD$ END
    CSRD$ LAST VALUE var1, var2, ...
    Immediately follows a do-loop. Takes a list of assignable variables.
    CSRD$ FIRST VALUE var1, var2, ...
    Immediately preceeds a do-loop. Takes a list of assignable variables. The AssertFirstValue class is meant to provide a specific class for assertions about the first value calculations for variables. These are variables which are made private, with the stipulation that at least part of it be copied into the loop prior to loop execution. The portion to be copied in is specified on the AssertFirstValue assertion.
    CSRD$ FORWARD expr1, expr2, ...
    The AssertForward class is meant to provide a specific class for assertions about forwarding
    CSRD$ DYN LAST VALUE var1, var2, ...
    Appears within a do-loop. Takes a list of assignable variables. The AssertDynLastValue class is meant to provide a specific class for assertions about the dynamic last value calculations for variables. This is meant for situations in which it cannot be proven that the last value for a variable is assigned in the last iteration of a loop.
    CSRD$ VARRANGE var, lb, ub
    This directive is translated into an assertion that indicates that the variable, var, takes on values between lb and ub.
    CSRD$ RANGE WRITTEN var1, var2, ...
    Immediately follows a do-loop. Takes a list of assignable variables. The AssertRangeWritten class is meant to provide a specific class for assertions about the range written to an array during an entire loop execution.
    CSRD$ PARALLEL [ ( var1 ) ( var2 ) ... ]
    CSRD$ REDUCTION var1, var2, ...
    Appears within a do-loop. Takes a list of assignable variables.
    CSRD$ LOOPLABEL
    This directive gives a label for the following loop. The label will be used by the instrumentation pass to determine which loops to instrument and how to output the timings. The label also appears in the compiler output if you specify that Polaris should output loop scheduling data.
    CSRD$ SCHEDULE string
    Immediately preceeds a do-loop.
    The AssertSchedule class is meant to provide a specific class for specifying the execution schedule for a loop. The possible values are:
    CSRD$ SHARED var1, var2, ...
    Appears within a do-loop or a block. Takes a list of assignable variables. The AssertShared class is meant to provide a specific class for assertions about variables that are meant to be shared between processors.
    CSRD$ INDUCTION expr1, expr2, ...
    CSRD$ PARALLEL CONDITION expression
    The AssertParallelCondition class is meant to provide a specific class holding an expression, which if evaluated as .TRUE., could allow the loop to be executed in parallel.
    CSRD$ NOMOD var1, var2, ...
    Takes a list of assignable variables.
    CSRD$ MAYMOD var1, var2, ...
    Takes a list of assignable variables. The AssertMayMod class is meant to provide a specific class for assertions about the expressions which may be modified in the course of calling a given routine (the invocation of the routine occurs in the statement to which this assertion is attached).
    CSRD$ PRIVATE_REFS ( expr1 ) , ( expr2 ) , ...
    Appears within a do-loop or a block.
    CSRD$ READ_ONLY_REFS ( expr1 ) , ( expr2 ), ...
    Appears within a do-loop or a block.
    CSRD$ SHARED_REFS ( expr1 ) , ( expr2 ), ...
    Appears within a do-loop or a block.
    CSRD$ PREAMBLE
    CSRD$ POSTAMBLE
    CSRD$ INSTRUMENT
    CSRD$ RT SHADOW var1, var2, ...
    Takes a list of assignable variables. This forces the runtime pass to shadow the specified array.
    CSRD$ INLINE
    Force inlining of the following subroutine call.
    CSRD$ RECURSIVE INLINE
    The AssertRecursiveInline class is meant to provide a means to instruct the inliner to inline all the routine calls (subroutine and/or function calls) recursively in the following statement or loop.
    CSRD$ NO PUTGET
    CSRD$ NO INLINE
    Appears immediately before a call to a subroutine. Prevents the inlining pass from inlining the subroutine call.
    CSRD$ RUN_TIME_TEST
    The AssertRunTimeTest class is meant to instruct the run-time pass to insert code into the loop to which this assertion is attached.
    CSRD$ NO DEPENDENCE ( var1, var2, ... )
    Appears within a do-loop or block. The AssertNoDependence class is meant to provide a specific class for declaring that there is no dependence between certain variables.
    CSRD$ SIDE_EFFECT_FREE
    The AssertSideEffectFree class is meant to provide a specific class for asserting that certain routines may be called in parallel.
    CSRD$ DEP_IO
    The AssertDepIO class is meant to provide a means to record the presence of a dependence solely due to at least one I/O statement in a loop.

    How can I make Polaris aware of an intrinsic function?

    NOTE: This question is intended for developers of Polaris.

    See base/Expression/intrin.h and scanner/intrin.h NOTE: This answer is not yet complete. xtarget=ultra2 -xcache=16/32/1:4096/64/1