SLIDE 1) TROUBLESHOOTING This talk is meant to help you help yourself out of trouble when using the Pittsburgh Supercomputing Center's computers and software. This talk focusses on problems that can occur with programs written for the Cray YMP; some of the information is useful for other situations as well. We will talk about * When errors occur * What you can do to understand the error report * Things you can do without changing your program at all * Things that require you to alter your program * Techniques to try to isolate the error * Debugging programs that help you see the error as it occurs SLIDE 2) Errors can happen anytime: NQS: Wrong username or password in remote JOB file; Error in file transfer: wrong name, or binary file; Time limit exceeded; Memory limit exceeded; System commands: Maybe your shell isn't right? "set -x" in the Bourne shell is "set echo" in the C shell. Maybe your path isn't right? echo $PATH maybe /usr/local/bin, or other directories, aren't there. try using full name of command, for instance: /usr/local/bin/ftnchek Interactive work on Cray has 4MW memory limit. File system: Exceed quota in home directory; Use file names that are too long; Overwrite existing data; Use wrong name for file, or capitalized name; I/O: That file doesn't exist; Print data in wrong format; Programming: Illegal arithmetic; Uninitialized data; Arrays out of bounds; Incorrect optimization; Subroutine interface: Data that is supposed to retain its value between calls; Data of wrong type; Duplicate names; Undeclared arrays; Constants treated as variables; SLIDE 3) Make sure you understand the error message! Try the EXPLAIN command. Most error messages from Cray programs are now printed in a standard format: program-number one line message For instance: cft77-521 The trip count exceeds the compiler limit If you type explain cft77-521 then the Cray will print out several paragraphs, explaining that you have written an implied DO loop for a DATA statement, for which the number of iterations is too high. It even tells you that "too high" means greater than 16,777,215. You should use EXPLAIN if you don't think you understand the short version of the error message. Many times, EXPLAIN will be helpful enough that you can figure out and correct the error. There are a few exceptions to this system. The following errors are not handled by EXPLAIN: "Operand range error" "Floating point exception" "Program range error" "Error exit" SLIDE 4) A bad traceback The hard errors to fix are going to occur inside of programs you write. And most of your errors will make themselves obvious by causing the program to abruptly stop executing. You should pay careful attention to the traceback that is produced, and learn how to get information from it. On the Cray, the traceback is not always produced, especially if the problem was caused by a severe memory overwrite. But supposing we see it, we can hope that it tells us the exact line where the error occurred: Floating point exception TB001 - BEGINNING OF TRACEBACK - $TRBK WAS CALLED BY f_sig AT 174044b (LINE NUMBER 102) - f_sig WAS CALLED BY __handlr AT 104765b - __handlr WAS CALLED BY TAXPY AT 1125d - TAXPY WAS CALLED BY MULJIK AT 734a (LINE NUMBER 76) - MULJIK WAS CALLED BY SAMPLE AT 425d (LINE NUMBER 22) - SAMPLE WAS CALLED BY $START$ AT 303a TB002 - END OF TRACEBACK Floating exception (core dumped) Now the sequence of events is roughly: SAMPLE called MULJIK called TAXPY, where something bad happened. However, the line numbers only tell us about where the calls were made. When we read the really important line, - __handlr WAS CALLED BY TAXPY AT 1125d we don't see an associated source code line number, and so we really can't point to a line of TAXPY as causing the problem. Whether or not you get this most important line number simply depends on the type of error that has occurred. Now we are stuck examining an entire subroutine! SLIDE 5) A worse traceback Here is an example of how little information you can get from a traceback on the Cray: Floating point exception TB001 - BEGINNING OF TRACEBACK - $TRBK WAS CALLED BY f_sig AT 174044b (LINE NUMBER 102) - f_sig WAS CALLED BY __handlr AT 104765b Operand range error (core dumped) Notice that there is NO information about what routine of yours caused the trouble. You have NOTHING to go on, except that your program died. SLIDE 6) The same problem, on a VAX/VMS Compare the results when we run this same problem on the VAX/VMS front end: %SYSTEM-F-ACCVIO, access violation, reason mask=01, virtual address=0BB60436, PC=00300ADF, PSL=03C00020 %TRACE-F-TRACEBACK, symbolic stack dump follows module name routine name line rel PC abs PC TAXPY TAXPY 9 0000001F 00300ADF MULJIK MULJIK 11 000000C1 00300AB5 SAMPLE SAMPLE 22 0000004E 0030084E Now, what actually has happened is that TAXPY is storing values in a vector C, which is a dummy argument from MULJIK, which is a dummy argument from SAMPLE. But SAMPLE never sets aside the storage for the vector. So TAXPY is writing numbers in some arbitrary place. The VAX is much better at catching this error and pointing to the offending line, which is where TAXPY tries to store numbers in the vector C. One simple moral from this exercise: Shop around for good error messages. If you can't get help from one computer, try running the same program on another one, and see if you get better diagnostics. SLIDE 7) Ask the compiler for advice Normally, the compiler only prints out errors and warnings, where it is sure that something is wrong, or undesirable. You can ask that the compiler also print out lower priority messages, which sometimes help you to find a problem. For instance, with the default settings, the compiler will tell you if it sees a variable which is definitely used before having a value assigned to it, but it will not tell you if it sees a variable which MAY be used before having a value assigned to it. However, you can issue a command like cf77 -Wf"-m0" myprog.f to get ALL messages, in which case you might get information like: 2371 99. WORK(J,K,2) = WORK(J,K,2) - (UU(J,K-1)+SN*CC(J,K-1)) cft77-8128 cf77: CAUTION STEPFY, Line = 99, File = srs.f, Line = 2371 Variable "SN" may be used before it is assigned. SLIDE 8) Run your program with static memory The Cray FORTRAN compiler has an unusual memory feature called "stack" memory allocation. Using this method, variables local to a subroutine are not saved in between calls to that subroutine. This feature was added to the compiler so that programs created by it would be more suitable for parallel processing. By contrast, the static memory allocation scheme permanently set aside a space in memory for each variable, whether "global" or "local". It was possible to assign a value during one call to a subroutine, and use that value on a later call. In most cases, users should notice no difference between programs compiled with stack and static memory allocation. However, when users do have problems that can't be explained immediately from the error message, it has been found that stack memory allocation is frequently the cause of problems. You can check this out very easily. Try compiling with the statement cf77 -Wf"-e v" myprog.f If nothing gets better, you'll have to try other fixes. But if the program is cured, then you have three choices: Always use the -Wf"-e v" option for static memory; Find the offending subroutine, and insert a SAVE statement in it, which will save the values of all local variables; Find the offending variable in the offending subroutine, (say "X"), and either insert a "SAVE X" statement, or rewrite the subroutine so that it does not assume that X is preserved between calls. SLIDE 9) Run with vectorization off In order to run your program efficiently on the Cray, the compiler analyzes your DO loops, and makes educated guesses as to what operations on vectors can be executed in a pipelined fashion. The compiler is usually very careful about this. But occasionally, it will make a mistake, causing you to get the wrong results, or causing the program to fail catastrophically. You can tell the compiler to do no vectorization at all. You should only do this when you are debugging, because your program will run extremely slowly. You should be interested in seeing whether the problem goes away if vectorization is turned off. If the problem is caused by vectorization, it will be harder to fix than most, since your program is probably perfectly legal, and would execute correctly, except for some flaw in the vectorization portion of the compiler. To try turning off vectorization, use the command cf77 -Wf"-o novector" myprog.f SLIDE 10) Run with scalar optimization off Scalar optimization refers to everything the compiler does to speed up your program, except for vectorization and parallelization. Most of the methods of scalar optimization are standard, and are used with all compilers. These methods include discarding computations whose results are never used; moving assignment of constants outside of DO loops; replacing certain exponentiations by simpler forms (X**0.5 becomes SQRT(X)). It is highly unlikely that these changes would cause you problems. However, to turn off scalar optimization you can type cf77 -Wf"-o noscalar" myprog.f Turning off scalar optimization should not slow you down nearly as much as turning off vectorization. SLIDE 11) Run the program with run time checking The compiler can check that arrays don't go out of bounds, that subroutines are called with the right number and type of arguments, and that arrays are used properly in FORTRAN 90 expressions. To do all of these checks, compile with a statement like this: cf77 -Wf"-Rabc" myprog.f Then, when you run the program, you can expect to see diagnostics like this: lib-1950 a.out: WARNING At line 14 in Fortran routine "TEST01", subscript value 11 is out of bounds for array 'A'. In order to do bounds checking, however, you must be sure that arrays are declared with their true extent. In a subroutine, it is common for dummy argument arrays to be declared with size * or 1. This will make it impossible for the bounds checker to work properly. SLIDE 12) Run your program through a FORTRAN standards checker * cf77 -Wf"-en" myprog.f Reports all non-ANSI usages in program. * FTNCHEK Public domain package, can be installed anywhere. Available at PSC on UNICOS, VMS and ULTRIX. ftnchek -f77 myprog.f will report all non-ANSI usages inprogram. * TOOLPACK Distributed through Numerical Algorithms Group (NAG). Available at PSC on VMS only. Use the ISTPF tool to do a PFORT FORTRAN 77 check. See the example "PFORT.COM". SLIDE 13) Make your program "orthodox" * Don't use "reserved" names for variables: DATA, INDEX, READ, OPEN... * Don't use out of date FORTRAN features: Assigned GO TO, Arithmetic IF, DECODE/ENCODE, Hollerith data. * Declare the type of every variable. * Declare the exact extent of every vector and array, particularly in subroutines. * Use the ERR= keywords on I/O statements involving files (OPEN, READ, WRITE, CLOSE), and print a clear error message. * Replace Hollerith constants by quoted strings. Store alphanumeric data in CHARACTER variables, not INTEGERS. * End every DO loop with a CONTINUE statement. Nested DO loops should NOT share CONTINUE statements. The TOKENS program and the TOOLPACK program ISTCN can rename any variable in a FORTRAN program. The TOOLPACK tool ISTPL can set up a standardized declaration section in each of your routines, replace Hollerith data by CHARACTER strings, indent DO loops and IF statements, label statements in increments of 10, and put CONTINUE statements at the end of every DO loop. SLIDE 14) Convert DOUBLE PRECISION to REAL. Useful programs: * CF77 You can usually get the Cray FORTRAN compiler to convert DOUBLE PRECISION to REAL "on the fly" using the compile time switch "-dp": cf77 -Wf"-dp" myprog.f * Any editor You can TRY to replace all DOUBLE PRECISION usages with REAL ones. But that includes function names ("DCOS") and constant formats "1.0D-3". * D2S converts all DOUBLE PRECISION usages to REAL ones. (S2D does the reverse) d2s myprog.f > newprog.f * TOOLPACK The tool ISTPT converts from one precision to another. See the example "D2S.COM" SLIDE 15) Compare the online example with your problem If you're using a library or piece of software for which the PSC has installed an example, then perhaps you should compare your work with the example. Is anything obviously different? Sometimes just looking at an example will make your mistake obvious, or suggest a different way of solving your problem. Or you can at least make some guesses as to why your program might be failing (the example has a square matrix, you are using a rectangular one...) Can you run the example? Then at least you know that the software is not completely ruined. Perhaps the library is working, and it really is your fault. Does the example fail? Then you can blame us! It's probably not your fault. But check to see whether the error that occurs when the example fails is similar to what you are seeing. And then report the problem to us. Try to alter the online example to solve your problem. If you do this, one step at a time, you may find the crucial variable setting or misunderstanding or illegal step that is causing your problem. SLIDE 16) Can you make a smaller test case to compile? If you have a large program with many subroutines, it can be a real pain to run an editor, find various lines to change, and recompile. If you want a compiled listing, or if you increase the amount of warning messages the compiler prints out, you can get swamped. You may also suspect beforehand that your errors are occuring in one or two routines, which you want to try to compile in a special way, while leaving the rest of the program alone. In such a case, it's wise to try to isolate the suspected troublemaking routines, and handle them separately. One way to do this involves using the FSPLIT program, which splits a FORTRAN program up; each routine becomes a separate file. The command fsplit myprog.f will create, for instance, the files "main.f", "taxpy.f", "root.f", "random.f", and so on. You may need to put most of these files back into a single file. For instance, mv myprog.f myprog.for fsplit myprog.for rm myprog.for mv taxpy.f part1.for cat *.f > part2.for rm *.f mv part1.for part1.f mv part2.for part2.f Now "part1.f" contains "taxpy.f", and "part2.f" contains everything else. Then I could do things like cf77 -c -Wf"-m 0" part1.f which would compile just part1.f with all messages printed out. SLIDE 17) Can you make a smaller test case to run? If your error occurs during a job with a long time limit, or big memory requirements, you justed wasted a lot of system resources once. Trying to debug the program could cause you to waste a lot more resources, over and over. It's very helpful to try to cut down the time and memory requirements of your test case. That way, you can try out experimental fixes fast and cheaply. But if the error disappears in the smaller program, you need to think about what you've just changed. It may be necessary to repeat your downsizing effort one step at a time, to see when the error disappears, so you can try to understand WHY it disappears. SLIDE 18) Try a debugger: DEBUG If all your simple tactics fail, you might want to try using a debugger. There are several debuggers on the Cray, but we will talk about just two, a simple one named DEBUG, and a fancy one called CDBX. To use DEBUG, you need to use the "-g" switch on the compiler in order that the names of variables be retained. (WARNING: "-g" turns off optimization! This slows down your program a lot!) Then, when the program "crashes", it will create a "core" file containing a record of the values of all the program variables (and their names). The DEBUG program will then print out the values of the variables that were declared in the subroutine that crashed. Naturally, if the subroutine had hundreds of variables, the output can be unmanageable. Luckily, the program has a default limit on the number of entries of vectors that it will print out. Otherwise the output would be truly unmanageable. One way to use DEBUG is as follows: cf77 -g myprog.f a.out || debug The parallel bars mean that a.out will be run, and DEBUG will be run only if a.out terminates with an error condition. SLIDE 19) Sample DEBUG output + cf77 -g dbugprb.f + a.out Operand range error TB001 - BEGINNING OF TRACEBACK - $TRBK WAS CALLED BY f_sig AT 173644b (LINE NUMBER 102) - f_sig WAS CALLED BY __handlr AT 105425b - __handlr WAS CALLED BY DBUGPRB AT 404a - DBUGPRB WAS CALLED BY $START$ AT 303a TB002 - END OF TRACEBACK Operand range error (core dumped) + debug ***** START OF SYMBOLIC DUMP ***** Displaying variables for routine DBUGPRB A Array ( 10,10 ) of REAL (1,1): 1. (2,1): 0. (3,1): 0. (4,1): 0. (5,1): ********************** (6,1): 0. (7,1): 0. (8,1): 0. (9,1): 0. (10,1): 0. (1,2): 2. (2,2): 0. (3,2): 0. (4,2): 0. (5,2): 0. (6,2): 0. I 1 J 649 K 3474872583381988662 LIMIT 3143 Y .57150173335061 Z 1.5715017333506 ***** END OF SYMBOLIC DUMP ******* Here, it turns out that A is a 10 by 10 array, and that we are indexing entries A(I,J), where J has gone out of bounds. Strangely, the program didn't fail until J reached the value 649! DEBUG is very suitable for debugging programs that fail in a batch job. SLIDE 20) Try a debugger: CDBX Programmers with a UNIX background will be interested in the Cray version of the DBX program, called CDBX. CDBX is intended as an interactive debugger, and is best suited for X Windows output, although a command line option is also available. You can type cdbx -L to run the command line interface, or to run the X window version, type cdbx -display DISPLAY_NAME CDBX will look for a CORE file and an A.OUT file in the current directory, and assumes that these are from the program to be debugged. Once you begin running CDBX, you type commands, or select them with the mouse, including: run - Begins execution of the program, proceeds up to error. cont - Resume execution, starting with current statement. next - Executes up to the next statement. return - Continues program execution up to a return from a subroutine. step - Executes the next statement. stop at # - stop (wait for user command) at line number #. trace - prints out each line as it is executed. (Can be slow!) status - prints out list of current "events". delete # - removes event number #. (How you can stop TRACE) up - moves up the calling tree. down - moves down the calling tree. func SUB - moves to the subroutine or function named SUB. &VAR,&VAR+I /f - Print out memory locations corresonding to variable VAR, and the next I entries, using floating point format. list I,J - Prints lines I through J of current file. show vars - Prints names of all local variables. dump - Prints values of all local variables. assign VAR=VALUE - Changes the value of a variable. quit - exits the CDBX session. SLIDE 21) Prepare your report to User Services If you are running the program interactively, you should try to turn it into a batch job if possible. This will help you by recording clearly the sequence of commands that were used, and every message that was printed out. By the way, you should include the line set echo in your job file, if you use the C shell, or if you use the Bourne shell, set -x This will cause the commands to be printed, with a "+" sign marking them. Since the consultant will be unfamiliar with what you are doing, it is helpful to comment your job, explaining what is going on, what files are created, where the error occurs and so on. Comments in a JOB file begin with a "#" in column 1: # # The program ROOT is supposed to create the file "results.dat" # root # # But when I take a directory listing, it doesn't show up! # ls results.dat SLIDE 22) Documents: The PSC provides online documents and brief MAN pages and examples for many of the topics discussed here. In particular, you might like to look at some of the following documents: BOUNDS.DOC - Information about checking out of bounds arrays. CDBX.DOC - Cray interactive debugger. CF77.DOC - Cray FORTRAN compiler. D2S.DOC - Double to single precision converter. DEBUG.DOC - Cray debugger. FORTRAN.DOC - Discusses some topics in FORTRAN. FSPLIT.DOC - Splits a FORTRAN program up into individual modules. FTNCHEK.DOC - FORTRAN program checker. REMARKS.DOC - About the PSC mail address REMARKS, for error reporting. TOKENS.DOC - Program for finding or renaming a variable. TOOLPACK.DOC - Powerful set of tools to maintain FORTRAN programs. SLIDE 23) Exercises: There are four sample programs available for you to practice on. The examples are available in the FORTRAN examples area, as bad1.f, bad2.f, bad3.f and bad4.f. Each of these programs has a bug which causes it to run incorrectly, or to abort, on the Cray. Can you find and fix the problems?