SLIDE 1) TROUBLESHOOTING


This talk is meant to help you help yourself out of trouble when using
the Pittsburgh Supercomputing Center's computers and software.

This talk focusses on problems that can occur with programs written
for the Cray YMP; some of the information is useful for other situations
as well.

We will talk about 

  * When errors occur

  * What you can do to understand the error report

  * Things you can do without changing your program at all

  * Things that require you to alter your program

  * Techniques to try to isolate the error

  * Debugging programs that help you see the error as it occurs


SLIDE 2) Errors can happen anytime:


  NQS:
    Wrong username or password in remote JOB file;
    Error in file transfer: wrong name, or binary file;
    Time limit exceeded;
    Memory limit exceeded;

  System commands:
    Maybe your shell isn't right?
      "set -x" in the Bourne shell is "set echo" in the C shell.
    Maybe your path isn't right?
      echo $PATH
        maybe /usr/local/bin, or other directories, aren't there.
      try using full name of command, for instance:
        /usr/local/bin/ftnchek
    Interactive work on Cray has 4MW memory limit.

  File system:
    Exceed quota in home directory;
    Use file names that are too long;
    Overwrite existing data;
    Use wrong name for file, or capitalized name;

  I/O:
    That file doesn't exist;
    Print data in wrong format;

  Programming:
    Illegal arithmetic;
    Uninitialized data;
    Arrays out of bounds;
    Incorrect optimization;
    Subroutine interface:
      Data that is supposed to retain its value between calls;
      Data of wrong type;
      Duplicate names;
      Undeclared arrays;
      Constants treated as variables;

  
SLIDE 3) Make sure you understand the error message!  


Try the EXPLAIN command.  Most error messages from Cray programs are now 
printed in a standard format:

  program-number     one line message

For instance:

  cft77-521 The trip count exceeds the compiler limit

If you type

  explain cft77-521

then the Cray will print out several paragraphs, explaining that you have
written an implied DO loop for a DATA statement, for which the number of
iterations is too high.  It even tells you that "too high" means greater
than 16,777,215.  

You should use EXPLAIN if you don't think you understand the short
version of the error message.  Many times, EXPLAIN will be helpful enough
that you can figure out and correct the error.

There are a few exceptions to this system.  The following errors are not
handled by EXPLAIN:

  "Operand range error"
  "Floating point exception"
  "Program range error"
  "Error exit"


SLIDE 4) A bad traceback


The hard errors to fix are going to occur inside of programs you write.
And most of your errors will make themselves obvious by causing the program
to abruptly stop executing.  You should pay careful attention to the traceback
that is produced, and learn how to get information from it.

On the Cray, the traceback is not always produced, especially if the 
problem was caused by a severe memory overwrite.  But supposing we see it,
we can hope that it tells us the exact line where the error occurred:

Floating point exception
TB001 - BEGINNING OF TRACEBACK
      - $TRBK    WAS CALLED BY f_sig    AT  174044b (LINE NUMBER      102)
      - f_sig    WAS CALLED BY __handlr AT  104765b 
      - __handlr WAS CALLED BY TAXPY    AT    1125d 
      - TAXPY    WAS CALLED BY MULJIK   AT     734a (LINE NUMBER       76)
      - MULJIK   WAS CALLED BY SAMPLE   AT     425d (LINE NUMBER       22)
      - SAMPLE   WAS CALLED BY $START$  AT     303a 
TB002 - END OF TRACEBACK
Floating exception (core dumped)

Now the sequence of events is roughly:  SAMPLE called MULJIK called TAXPY,
where something bad happened.  However, the line numbers only tell us
about where the calls were made.  When we read the really important line,

      - __handlr WAS CALLED BY TAXPY    AT    1125d 

we don't see an associated source code line number, and so we really
can't point to a line of TAXPY as causing the problem.  Whether or not
you get this most important line number simply depends on the type of
error that has occurred.  Now we are stuck examining an entire subroutine!


SLIDE 5) A worse traceback


Here is an example of how little information you can get from a traceback
on the Cray:

Floating point exception
TB001 - BEGINNING OF TRACEBACK
      - $TRBK    WAS CALLED BY f_sig    AT  174044b (LINE NUMBER      102)
      - f_sig    WAS CALLED BY __handlr AT  104765b 
Operand range error (core dumped)

Notice that there is NO information about what routine of yours caused
the trouble.  You have NOTHING to go on, except that your program died.


SLIDE 6) The same problem, on a VAX/VMS


Compare the results when we run this same problem on the VAX/VMS front end:

%SYSTEM-F-ACCVIO, access violation, reason mask=01, virtual address=0BB60436, 
PC=00300ADF, PSL=03C00020

%TRACE-F-TRACEBACK, symbolic stack dump follows

module name     routine name                     line       rel PC    abs PC

TAXPY           TAXPY                               9      0000001F  00300ADF
MULJIK          MULJIK                             11      000000C1  00300AB5
SAMPLE          SAMPLE                             22      0000004E  0030084E

Now, what actually has happened is that TAXPY is storing values in a vector
C, which is a dummy argument from MULJIK, which is a dummy argument from SAMPLE.
But SAMPLE never sets aside the storage for the vector.  So TAXPY is writing
numbers in some arbitrary place.  The VAX is much better at catching this
error and pointing to the offending line, which is where TAXPY tries to store
numbers in the vector C.

One simple moral from this exercise: 

  Shop around for good error messages.  If you can't get help from one
  computer, try running the same program on another one, and see if you
  get better diagnostics.


SLIDE 7) Ask the compiler for advice


Normally, the compiler only prints out errors and warnings, where it is
sure that something is wrong, or undesirable.  You can ask that the compiler
also print out lower priority messages, which sometimes help you to find
a problem.

For instance, with the default settings, the compiler will tell you if
it sees a variable which is definitely used before having a value assigned to 
it, but it will not tell you if it sees a variable which MAY be used before
having a value assigned to it.

However, you can issue a command like

  cf77 -Wf"-m0" myprog.f

to get ALL messages, in which case you might get information like:


   2371     99.           WORK(J,K,2) = WORK(J,K,2) - (UU(J,K-1)+SN*CC(J,K-1))
 cft77-8128 cf77: CAUTION STEPFY, Line = 99, File = srs.f, Line = 2371
   Variable "SN" may be used before it is assigned.


SLIDE 8) Run your program with static memory


The Cray FORTRAN compiler has an unusual memory feature called "stack" memory
allocation.  Using this method, variables local to a subroutine are not
saved in between calls to that subroutine.  This feature was added to the 
compiler so that programs created by it would be more suitable for
parallel processing.  

By contrast, the static memory allocation scheme permanently set aside a
space in memory for each variable, whether "global" or "local".  It was
possible to assign a value during one call to a subroutine, and use that
value on a later call.

In most cases, users should notice no difference between programs compiled with
stack and static memory allocation.

However, when users do have problems that can't be explained immediately
from the error message, it has been found that stack memory allocation is
frequently the cause of problems.

You can check this out very easily.  Try compiling with the statement

  cf77 -Wf"-e v" myprog.f

If nothing gets better, you'll have to try other fixes.  But if the program
is cured, then you have three choices:

  Always use the -Wf"-e v" option for static memory;

  Find the offending subroutine, and insert a SAVE statement in it, which
  will save the values of all local variables;

  Find the offending variable in the offending subroutine, (say "X"), and 
  either insert a "SAVE X" statement, or rewrite the subroutine so that it 
  does not assume that X is preserved between calls.


SLIDE 9) Run with vectorization off


In order to run your program efficiently on the Cray, the compiler analyzes
your DO loops, and makes educated guesses as to what operations on vectors
can be executed in a pipelined fashion.

The compiler is usually very careful about this.  But occasionally, it
will make a mistake, causing you to get the wrong results, or causing the
program to fail catastrophically.

You can tell the compiler to do no vectorization at all.  You should only
do this when you are debugging, because your program will run extremely
slowly.  You should be interested in seeing whether the problem goes
away if vectorization is turned off.  

If the problem is caused by vectorization, it will be harder to fix
than most, since your program is probably perfectly legal, and would execute
correctly, except for some flaw in the vectorization portion of the compiler.

To try turning off vectorization, use the command

  cf77 -Wf"-o novector" myprog.f
  

SLIDE 10) Run with scalar optimization off


Scalar optimization refers to everything the compiler does to speed up
your program, except for vectorization and parallelization.
 
Most of the methods of scalar optimization are standard, and are used with all 
compilers.  These methods include 

  discarding computations whose results are never used;
  moving assignment of constants outside of DO loops; 
  replacing certain exponentiations by simpler forms (X**0.5 becomes SQRT(X)). 

It is highly unlikely that these changes would cause you problems.  However, 
to turn off scalar optimization you can type

  cf77 -Wf"-o noscalar" myprog.f

Turning off scalar optimization should not slow you down nearly as much
as turning off vectorization.


SLIDE 11) Run the program with run time checking


The compiler can check that arrays don't go out of bounds, that subroutines
are called with the right number and type of arguments, and that arrays
are used properly in FORTRAN 90 expressions.  To do all of these checks,
compile with a statement like this:

  cf77 -Wf"-Rabc" myprog.f

Then, when you run the program, you can expect to see diagnostics like this:
 
  lib-1950 a.out: WARNING 
    At line 14 in Fortran routine "TEST01", subscript value 11
    is out of bounds for array 'A'.

In order to do bounds checking, however, you must be sure that arrays
are declared with their true extent.  In a subroutine, it is common for
dummy argument arrays to be declared with size * or 1.  This will make
it impossible for the bounds checker to work properly.
  

SLIDE 12) Run your program through a FORTRAN standards checker


  * cf77 -Wf"-en" myprog.f
      Reports all non-ANSI usages in program.

  * FTNCHEK
      Public domain package, can be installed anywhere.
      Available at PSC on UNICOS, VMS and ULTRIX.

      ftnchek -f77 myprog.f
        will report all non-ANSI usages inprogram.

  * TOOLPACK 
      Distributed through Numerical Algorithms Group (NAG).
      Available at PSC on VMS only.  
      Use the ISTPF tool to do a PFORT FORTRAN 77 check.  
        See the example "PFORT.COM".


SLIDE 13) Make your program "orthodox"


  * Don't use "reserved" names for variables: DATA, INDEX, READ, OPEN...

  * Don't use out of date FORTRAN features:
    Assigned GO TO, Arithmetic IF, DECODE/ENCODE, Hollerith data.

  * Declare the type of every variable.

  * Declare the exact extent of every vector and array, particularly
    in subroutines.

  * Use the ERR= keywords on I/O statements involving files (OPEN, READ, 
    WRITE, CLOSE), and print a clear error message.  

  * Replace Hollerith constants by quoted strings.  
    Store alphanumeric data in CHARACTER variables, not INTEGERS.

  * End every DO loop with a CONTINUE statement.  
    Nested DO loops should NOT share CONTINUE statements.

The TOKENS program and the TOOLPACK program ISTCN can rename any variable in 
a FORTRAN program.

The TOOLPACK tool ISTPL can set up a standardized declaration section
in each of your routines, replace Hollerith data by CHARACTER strings,
indent DO loops and IF statements, label statements in increments of 10,
and put CONTINUE statements at the end of every DO loop.


SLIDE 14) Convert DOUBLE PRECISION to REAL.


Useful programs:

  * CF77
    You can usually get the Cray FORTRAN compiler to convert DOUBLE
    PRECISION to REAL "on the fly" using the compile time switch "-dp":

      cf77 -Wf"-dp" myprog.f

  * Any editor
      You can TRY to replace all DOUBLE PRECISION usages with REAL ones.
      But that includes function names ("DCOS") and constant formats
      "1.0D-3".

  * D2S
      converts all DOUBLE PRECISION usages to REAL ones.  
      (S2D does the reverse)
      d2s myprog.f > newprog.f

  * TOOLPACK
      The tool ISTPT converts from one precision to another.
      See the example "D2S.COM"


SLIDE 15) Compare the online example with your problem


If you're using a library or piece of software for which the PSC has installed
an example, then perhaps you should compare your work with the example.


Is anything obviously different?  Sometimes just looking at an example
will make your mistake obvious, or suggest a different way of solving
your problem.  Or you can at least make some guesses as to why your
program might be failing (the example has a square matrix, you are using a 
rectangular one...)


Can you run the example?  Then at least you know that the software is not
completely ruined.  Perhaps the library is working, and it really is your
fault.

Does the example fail?  Then you can blame us!  It's probably not your fault.
But check to see whether the error that occurs when the example fails is
similar to what you are seeing.  And then report the problem to us.


Try to alter the online example to solve your problem.  If you do this, one
step at a time, you may find the crucial variable setting or misunderstanding
or illegal step that is causing your problem.


SLIDE 16) Can you make a smaller test case to compile?


If you have a large program with many subroutines, it can be a real pain
to run an editor, find various lines to change, and recompile.  If you
want a compiled listing, or if you increase the amount of warning messages
the compiler prints out, you can get swamped.

You may also suspect beforehand that your errors are occuring in one or two
routines, which you want to try to compile in a special way, while leaving
the rest of the program alone.

In such a case, it's wise to try to isolate the suspected troublemaking
routines, and handle them separately.  One way to do this involves using
the FSPLIT program, which splits a FORTRAN program up; each routine becomes
a separate file.  

The command

  fsplit myprog.f

will create, for instance, the files "main.f", "taxpy.f", "root.f", "random.f",
and so on.  

You may need to put most of these files back into a single file.  For instance,

  mv myprog.f myprog.for
  fsplit myprog.for
  rm myprog.for

  mv taxpy.f part1.for
  cat *.f > part2.for
  rm *.f
 
  mv part1.for part1.f
  mv part2.for part2.f

Now "part1.f" contains "taxpy.f", and "part2.f" contains everything else.
Then I could do things like

  cf77 -c -Wf"-m 0" part1.f

which would compile just part1.f with all messages printed out.


SLIDE 17) Can you make a smaller test case to run?


If your error occurs during a job with a long time limit, or big memory
requirements, you justed wasted a lot of system resources once.  Trying
to debug the program could cause you to waste a lot more resources, over
and over.

It's very helpful to try to cut down the time and memory requirements of
your test case.  That way, you can try out experimental fixes fast and
cheaply.

But if the error disappears in the smaller program, you need to think
about what you've just changed.  It may be necessary to repeat your downsizing
effort one step at a time, to see when the error disappears, so you
can try to understand WHY it disappears.


SLIDE 18) Try a debugger: DEBUG


If all your simple tactics fail, you might want to try using a debugger.
There are several debuggers on the Cray, but we will talk about just two,
a simple one named DEBUG, and a fancy one called CDBX.

To use DEBUG, you need to use the "-g" switch on the compiler in order 
that the names of variables be retained.  (WARNING: "-g" turns off 
optimization!  This slows down your program a lot!)  Then, when the program 
"crashes", it will create a "core" file containing a record of the values of 
all the program variables (and their names).

The DEBUG program will then print out the values of the variables that
were declared in the subroutine that crashed.

Naturally, if the subroutine had hundreds of variables, the output
can be unmanageable.  

Luckily, the program has a default limit on the number of entries of vectors 
that it will print out.  Otherwise the output would be truly unmanageable.

One way to use DEBUG is as follows:

  cf77 -g myprog.f
  a.out || debug

The parallel bars mean that a.out will be run, and DEBUG will be run only if 
a.out terminates with an error condition.


SLIDE 19) Sample DEBUG output


+ cf77 -g dbugprb.f
+ a.out
Operand range error
TB001 - BEGINNING OF TRACEBACK
      - $TRBK    WAS CALLED BY f_sig    AT  173644b (LINE NUMBER      102)
      - f_sig    WAS CALLED BY __handlr AT  105425b 
      - __handlr WAS CALLED BY DBUGPRB  AT     404a 
      - DBUGPRB  WAS CALLED BY $START$  AT     303a 
TB002 - END OF TRACEBACK
Operand range error (core dumped)
+ debug
***** START OF SYMBOLIC DUMP *****

  Displaying variables for routine DBUGPRB

    A                         Array ( 10,10 ) of REAL
      (1,1):           1.                      (2,1):           0.
      (3,1):           0.                      (4,1):           0.
      (5,1):           **********************  (6,1):           0.
      (7,1):           0.                      (8,1):           0.
      (9,1):           0.                      (10,1):          0.
      (1,2):           2.                      (2,2):           0.
      (3,2):           0.                      (4,2):           0.
      (5,2):           0.                      (6,2):           0.
    I                1                          
    J                649                        
    K                3474872583381988662       
    LIMIT            3143                      
    Y                .57150173335061
    Z                1.5715017333506

***** END OF SYMBOLIC DUMP *******

Here, it turns out that A is a 10 by 10 array, and that we are indexing entries
A(I,J), where J has gone out of bounds.  Strangely, the program didn't fail 
until J reached the value 649!  

DEBUG is very suitable for debugging programs that fail in a batch job.


SLIDE 20) Try a debugger: CDBX


Programmers with a UNIX background will be interested in the Cray version of
the DBX program, called CDBX.  CDBX is intended as an interactive debugger,
and is best suited for X Windows output, although a command line option
is also available.

You can type

  cdbx -L

to run the command line interface, or to run the X window version, type

  cdbx -display DISPLAY_NAME

CDBX will look for a CORE file and an A.OUT file in the current directory,
and assumes that these are from the program to be debugged.

Once you begin running CDBX, you type commands, or select them with the mouse,
including:

  run       - Begins execution of the program, proceeds up to error.
  cont      - Resume execution, starting with current statement.
  next      - Executes up to the next statement.
  return    - Continues program execution up to a return from a subroutine.
  step      - Executes the next statement.
  stop at # - stop (wait for user command) at line number #.
 
  trace     - prints out each line as it is executed.  (Can be slow!)
  status    - prints out list of current "events".
  delete #  - removes event number #.  (How you can stop TRACE)

  up        - moves up the calling tree.
  down      - moves down the calling tree.
  func SUB  - moves to the subroutine or function named SUB.

  &VAR,&VAR+I /f 
            - Print out memory locations corresonding to variable VAR,
              and the next I entries, using floating point format. 

  list I,J  - Prints lines I through J of current file.
  show vars - Prints names of all local variables.
  dump      - Prints values of all local variables.

  assign VAR=VALUE   
            - Changes the value of a variable.

  quit      - exits the CDBX session.


SLIDE 21) Prepare your report to User Services


If you are running the program interactively, you should try to turn it
into a batch job if possible.  This will help you by recording clearly
the sequence of commands that were used, and every message that was
printed out.  

By the way, you should include the line 

  set echo

in your job file, if you use the C shell, or if you use the Bourne shell,

  set -x

This will cause the commands to be printed, with a "+" sign marking them.

Since the consultant will be unfamiliar with what you are doing, it is
helpful to comment your job, explaining what is going on, what files
are created, where the error occurs and so on.  Comments in a JOB
file begin with a "#" in column 1:

  # 
  # The program ROOT is supposed to create the file "results.dat"
  #
  root
  #
  # But when I take a directory listing, it doesn't show up!
  #
  ls results.dat


SLIDE 22) Documents:


The PSC provides online documents and brief MAN pages and examples for many of 
the topics discussed here.  In particular, you might like to look at some
of the following documents:

  BOUNDS.DOC   - Information about checking out of bounds arrays.
  CDBX.DOC     - Cray interactive debugger.
  CF77.DOC     - Cray FORTRAN compiler.
  D2S.DOC      - Double to single precision converter.
  DEBUG.DOC    - Cray debugger.
  FORTRAN.DOC  - Discusses some topics in FORTRAN.
  FSPLIT.DOC   - Splits a FORTRAN program up into individual modules.
  FTNCHEK.DOC  - FORTRAN program checker.
  REMARKS.DOC  - About the PSC mail address REMARKS, for error reporting.
  TOKENS.DOC   - Program for finding or renaming a variable.
  TOOLPACK.DOC - Powerful set of tools to maintain FORTRAN programs.


SLIDE 23) Exercises:


There are four sample programs available for you to practice on.

The examples are available in the FORTRAN examples area, as 
bad1.f, bad2.f, bad3.f and bad4.f.

Each of these programs has a bug which causes it to run incorrectly, or
to abort, on the Cray.  Can you find and fix the problems?