CMSSL.TALK  16 September 1991


Introduction to the Connection Machine Scientific Software Library


1)   Why was CMSSL created?
2)   How was CMSSL designed?
3)   The contents of the library
4)   CMSSL routine naming conventions
5)   Setup/deallocate and save/restore features
6)   Other features of the library
7)   Multiple Instances
8)   Finding a parallel algorithm
9)   How to write a program that uses CMSSL
10)  LINSOL: A sample program calling CMSSL
11)  How to compile and run LINSOL with CMSSL
12)  Timings
13)  Documentation
14)  Exercises
       Login
       Get help and documentation on CMSSL
       Run the EXAMPLES program
       Run an example of CMSSL usage
       Run the timing example
       Write your own application


1) Why was CMSSL created?


CMSSL was created to make the Connection Machine easier to use, providing
correct and efficient code.


  User programs are not easy to port to the CM2.  Code generally must 
    be written from scratch.  Having a library of common routines 
    means there's less to create.

  It is easy to use the CM2 incorrectly, so that a program that could 
    run in parallel runs sequentially, and hence slowly and 
    inefficiently. Library routines can be written to run for sure in 
    parallel.

  TMC hopes to set a standard for parallel processing libraries.  
    They have been developing a linear algebra kernel, and publishing 
    a great many papers about their algorithms, with the hope of 
    influencing the design of other systems.  If you are familiar 
    with the benefits of the work done on the BLAS and LINPACK 
    libraries, you will see why this is an important goal.


2)  How was CMSSL designed?


  The designers of the library emphasized the following points:


    The library must be callable from all CM2 languages:

      Standard Language with Paris Interface: 

        C/Paris, FORTRAN/Paris, LISP/Paris

      CM languages: 

        C* (not available!), CM Fortran, *LISP


    The library must be usable on all CM hardware:

      (CM1, CM2, future CM3.  32 bit, 64 bit, with or without Weitek
      floating point processors, fieldwise/slicewise)


    The library contains optimal parallel algorithms for solving 
      problems.  


3) The contents of the library:


  CMSSL has routines for the following tasks:

    Broadcast of data from CM arrays to all processors;
    Fast Fourier transforms (powers of 2 only);
    Histogramming (data counting) routines;
    Linear equation solution, square matrix A, A*x=b:
      for a dense matrix A, using Gauss-Jordan or QR algorithm;
      for a triangular matrix A;
      for a tridiagonal matrix A, using cyclic reduction;
    Matrix inversion;
    Matrix-matrix multiplication A*B;
    Matrix-vector multiplication, A*x, or x*A;
    Multiple shifts, using NEWS communication where possible;
    Outer product of two vectors, A=x*transpose(y);
    QR factorization and solution of rectangular linear systems;
    Random number generation (uniform);
    Sparse matrix block operations;
    Sparse matrix gather/scatter
    Sparse matrix vector product;

  This includes roughly 50 routines.  By contrast, the Cray SCILIB
  has something like 500 routines, including the standard BLAS, 
  LINPACK, EISPACK and LAPACK libraries.


  Routines and topics that have been mentioned for future inclusion:

    Basic Linear Algebra Subprograms (BLAS);
    Eigenvalue routine (using Jacobi method);
    Singular Value Decomposition;
    Banded Linear Equation Solver;
    Iterative Linear Equation Solver;
    Lanczos Eigenvalue routine;
    General FFT routine.

  
  Useful topics that haven't been considered at all include:

    Finite elements
    Interpolation (fitting curves to data)
    Nonlinear algebraic equations
    Ordinary differential equations
    Quadrature (approximate integration)
    Special functions


4) CMSSL routine naming conventions


   CMSSL routines have long names.  Where matrices occur, the LAPACK 
   name convention is used:

     GEN for a general matrix
     HEP for a hermitian packed matrix
     HER for a hermitian matrix
     SYM for a symmetric matrix
     SYP for a symmetric packed matrix.

   Where matrix operations occur, the following words may occur as
   part of the name of the routine:

     FACTOR if the routine factors the matrix (LU or QR)
     INVERT if the routine computes the inverse matrix.
     MULT   if the routine multiplies the matrix by a vector or matrix.
     SOLVE  if the routine solves a related linear system.

   Sample CMSSL routine names include:

     GEN_GJ_INVERT       Invert general matrix, Gauss-Jordan algorithm
     GEN_TRIANG_SOLVE    Solve general triangular linear system
     INITIALIZE_FAST_RNG Initialize fast random number generator

   The CMSSL FORTRAN interface allows the user to call the same 
   routine name for all data types that the routine can handle.  
   However, a user may also use the CMSSL Paris interface, from a 
   Paris program, a C program, or even from FORTRAN.  In that case, 
   slightly different versions of the same routine are offered, 
   depending on the data type.

   In particular, a FORTRAN program might use this call:

     CALL GEN_MATRIX_VECTOR_MULT(...)

   for REAL, COMPLEX or DOUBLE PRECISION data, because there is a
   single FORTRAN interface to the following Paris routines:

     CMSSL_f_gen_matrix_mult_a_1L(...)  REAL or DOUBLE PRECISION data
     CMSSL_c_gen_matrix_mult_a_1L(...)  COMPLEX data

   In languages other than FORTRAN, users would call the appropriate 
   Paris version directly.


5) Setup/deallocate and save/restore features


When you call a CMSSL routine, a lot of things happen "behind your back".
Some routines, for instance, have to allocate memory, create, layout and
initialize data, choose communications paths to the individual processors,
and so on.  


In many cases, the use of a subroutine requires that a "setup" happen once,
after which the actual subroutine could be called repeatedly, after which
the memory allocated for the subroutine may be safely released.

The CMSSL library typically will provide a SETUP and a DEALLOCATE SETUP
routine in such cases.  For instance, the ALL_TO_ALL routine is used to
make the data in an array "hop" around through every possible position.
Before calling ALL_TO_ALL, you must call ALL_TO_ALL_SETUP.  Among other
things, the SETUP routine figures out how many hops will be needed, and
how they should be arranged to avoid data collisions.  After setup, you
may call ALL_TO_ALL repeatedly, with each call causing another hop.  When
you're done making the data hop, you can call DEALLOCATE_ALL_TO_ALL_SETUP
to discard the memory and variables associated with the operation.


In other cases, there is an internal state associated with a subroutine.
In normal uses, the internal state is hidden from the user, but available
to any routines that need to see it.  However, in some cases, in may be
important to "save" that internal state and restore it later.  One
obvious example is the random number generators, where the internal state
is obviously the "seeds" residing at each processor.  However, another
example is the QR factor/solve routines.  Ordinarily, the solve routine
is called immediately after the factor routine.  The CMSSL routines
GEN_QR_FACTOR and GEN_QR_SOLVE communicate "behind your back", through
variables invisible to you.  However, this clever scheme breaks down
if you call GEN_QR_FACTOR twice, for two different matrices (of different
shapes or types), and only then call GEN_QR_SOLVE.  The information that
the first GEN_QR_FACTOR left for the corresponding SOLVE routine is 
destroyed by the second GEN_QR_FACTOR call.  In order to preserve the
information that the factor routine wants to send to the solve routine,
you need to use the SAVE_GEN_QR_TEMPS routine to save the "invisible"
information, and the RESTORE_GEN_QR_TEMPS routine to restore it.  For instance:

      CALL GEN_QR_FACTOR(for matrix 1)
      CALL SAVE_GEN_QR_TEMPS(for matrix 1)

      CALL GEN_QR_FACTOR(for matrix 2)
      CALL SAVE_GEN_QR_TEMPS(for matrix 2)

      CALL RESTORE_GEN_QR_TEMPS(for matrix 1)
      CALL GEN_QR_SOLVE(for matrix 1)

      CALL RESTORE_GEN_QR_TEMPS(for matrix 2)
      CALL GEN_QR_SOLVE(for matrix 2)


6) Other features of the library


  There are some ways in which the compiler and CMSSL interact to save
  you trouble.  In particular, the following helpful things are done:

    * Automatic detection of array type 
        (REAL*4 or REAL*8, INTEGER*4 or INTEGER*8, COMPLEX)

    * Automatic detection of array shape

    * Data layout and communication issues usually hidden from user.
      (These sorts of problems are like the Cray's "memory bank conflicts"
      only more involved and more difficult to avoid).

  This means that you don't usually have to worry about the data type 
  of the array you are passing.  You usually don't have to worry about 
  telling CMSSL the actual physical size of your arrays, or the number 
  of dimensions.  And you don't have to worry about configuring your 
  arrays using LAYOUT commands.

  Of course, the more you rely on these features, the more difficult it
  is to write code that is portable, or even partially portable, between
  the Connection Machine and other computers.  


  However, there are some burdens that may go along with using
  the CMSSL library.

    * use of "include" files may be necessary.

    * The user must be sure the compiler knows where CM arrays are.

    * for best performance, it may be necessary to specify data layout.

  In many of the examples of CMSSL use, you will see an INCLUDE statement 
  used, which makes sure that certain functions are declared with the 
  correct type, and that certain symbolic constants are defined if 
  needed.  You may have to do this in your routines that call CMSSL.
  If so, there will be a note in the documentation to remind you.

  Moreover, the documentation for CMSSL will state that certain 
  arguments are expected to be CM arrays.  It is the user's 
  responsibility to make sure that such data items are recognized by 
  the compiler as CM arrays.  The easiest way to do this is to use 
  some FORTRAN 90 statement with the array, such as 

    A=0.0 
  or 
    FORALL (I=1:N) A(I,I)=1.0

  If you want peak speed with the library, you are going to have to 
  worry about the LAYOUT command and declarations of NEWS and SERIAL 
  indices to guarantee good performance.
 

7) Multiple Instances:


The Connection Machine can run faster because it can do many 
operations at the same time.  However, it can only do this when it 
knows that the operations being carried out are independent.  For 
certain problems, such as the LU factoring (with pivoting) of a matrix,
even the best algorithms cannot arrange the computation to be 
completely parallel.

Sometimes, however, maximum parallel execution can be achieved 
because there are thousands of sets of data, all of which have to 
have the same operations carried out on them.  Even if the algorithm 
isn't parallel, the data is, in some sense.  

An algorithm that clearly benefits from this method is any Monte Carlo 
calculation, where the same calculations are applied to hundreds or
thousands of cases that differ by random starting positions.

An example we can work through would be matrix multiplication.  The 
standard problem is to compute y=A*x, where we would declare the arrays
  
  DIMENSION Y(M), A(M,N), X(N)

and if we programmed in FORTRAN, we would write something like

        DO 10 I=1,M
          Y(I)=0.0
  10      CONTINUE

        DO 30 I=1,M
          DO 20 J=1,N
            Y(I)=Y(I)+A(I,J)*X(J)
  20        CONTINUE
  30      CONTINUE

However, suppose we had 200 sets of data to calculate.  This would be 
equivalent to computing y(k)=A(k)*x(k), where we have an extra index 
K running through all the copies of the matrix.  We could actually 
declare the arrays many different ways, although one simple way would 
be to add the extra index to the end of all of the original arrays:

  DIMENSION Y(M,L), A(M,N,L), X(N,L)

and our FORTRAN programming would be:

        DO 20 K=1,L
          DO 10 I=1,M
            Y(I,K)=0.0
  10        CONTINUE
  20      CONTINUE

        DO 50 K=1,L
          DO 40 I=1,M
            DO 30 J=1,N
              Y(I,K)=Y(I,K)+A(I,J,K)*X(J,K)
  30          CONTINUE
  40        CONTINUE
  50      CONTINUE

We are now looking at an example of "multiple instances" of matrix
multiplication.  The fact that we have multiple instances gives us an 
extra index to be concerned with, but it also means we have an extra 
possible way to organize the calculation for maximum efficiency.

If we were running this problem on the Cray, we would like to have the
innermost loop be the one with the highest iteration count.  Thus, our
choice about how to solve this problem would depend on the NUMBER of
matrices, as well as the number of rows and columns.

In a similar way, if you are interested in solving multiple instances 
of a problem on the Connection Machine, it's important to let the 
algorithm you are using know this, because it gives it extra 
opportunities to achieve good parallel performance.  For instance, if 
our matrices were small (10 rows by 10 columns) but we have 1,000 of 
them, we would be sure to prefer that the code execute in parallel 
over the index that counts the matrices.

Matrix multiplication is fully parallel.  If our algorithm weren't 
fully parallel, however, it should also be clear that having many 
items to operate on would give us a strong chance to run in parallel; 
think for instance of an ordinary differential equation.  Given the 
solution at T=0, we are required to compute the value at the point 
T=1, T=2 and so on.  But there is no way to begin the computation at 
T=2 until the solution at T=1 is known.  So parallel processing cannot 
be used in solving the ODE, even if we have hundreds of output points.  
But if we have many ODE's to solve at once, then we can in fact work 
in parallel over the components.  Each processor is devoted to a 
particular solution component.  On the first step we compute all the 
components at T=1, and so on.

With this in mind, many of the CMSSL library routines now have a more
elaborate set of input arguments.  Primarily, the new arguments specify
the "axes" to be used to define the data objects.  For instance, in the
matrix multiplication example, the array A(M,N,L) has three indices.  Two
of them, numbers 1 and 2, define the rows and columns of a matrix, while
the third defines WHICH matrix we are interested in.  In general, the CMSSL 
routine will want to know which indices define the matrix, or vector, or
other data object.  The routine will then assume that the other indices are
all used as counters of multiple instances of the data object, and will
parallelize the problem that many times.

Routines with a built in facility for handling multiple instances include:

  FFT
  Matrix-matrix multiply
  Matrix-vector multiply
  Multiple shift
  Outer product
  QR factorization
  Random number generator
  Tridiagonal solver

Unfortunately, this new ability means that the routines have more 
input arguments, and the documentation is more difficult to read. 


8) Finding a parallel algorithm


It does no good to build parallel hardware if there is no parallel software.
If important problems don't have a solution algorithm that works in parallel,
they won't get solved on the Connection Machine.  For this reason, many
algorithms have been redesigned or rediscovered, in order to take advantage
of the parallel computing capabilities available.

Here is one example of such an algorithm.  It is the one employed in the
tridiagonal matrix solver, in the CMSSL routine GEN_TRIDIAG_SOLVE.  The 
algorithm is known as cyclic reduction.

Cyclic reduction is a method which can be applied to a tridiagonal system of 
linear equations A*x=b.  Its primary benefit is that if the processor can
solve more than one equation at a time, the system can be solved in 
LOG(N) steps rather than in N steps.  Bear in mind that this means that
65,536 equations could be solved in 16 steps rather than in 65,536.  In 
particular, speedups can be achieved on both a vector machine or a parallel 
processing machine.

A tridiagonal system has some very special properties that will allow
us to carry this operation out.  Let's suppose we start out with 7
equations, where the first three are:

  b11*x1 + c12*x2                   = y1
  a21*x1 + b22*x2 + c23*x3          = y2
           a32*x2 + b33*x3 + c34*x4 = y3

You should see how we could use the first equation to eliminate the
coefficient a21 in the second equation.  We could also use the
third equation to eliminate the coefficient c23 in the second equation.
This knocks out variables x1 and x3 in the second equation, but adds 
x4 into that equation.  

By the same method, we can eliminate x3 and x5 from the equation for
x4, and so on.  If we do this, eliminating the odd variables from the
even equations, we end up with another tridiagonal system with half
the number of variables.

If we carry out the same operation repeatedly, we eventually end up
with a single equation in a single variable.  In our example, if
we started with 7 equations, this variable would be x4.  This we can solve!

We can plug x4 into the two previous equations for x2 and x6 and get them.
Then we plug those values into the four previous equations for x1, x3,
x5 and x7 and we're done.  

The only trick here is that by looking at the equations in the right way,
we can arrange the equations into LOG(N) groups of equations, each of
which we can solve simultaneously.  So if we really can solve them 
simultaneously, this is much faster than, say, the sequential backsolution
step in Gauss elimination, where we shuffle the equations til we know
what xn is, and then figure out what x(n-1) is and so on, one at a time.

For an example, consider the following system:

   4 -1  0  0  0  0  0  x1    2
  -1  4 -1  0  0  0  0  x2    4
   0 -1  4 -1  0  0  0  x3    6
   0  0 -1  4 -1  0  0  x4 =  8
   0  0  0 -1  4 -1  0  x5   10
   0  0  0  0 -1  4 -1  x6   12
   0  0  0  0  0 -1  4  x7   22

Let's look closely at the first three equations, to get started:

  4 x1 -   x2             = 2
  - x1 + 4 x2 -   x3      = 4
       -   x2 + 4 x3 - x4 = 6

If we multiply equation 2 by 4 for convenience, it becomes

  - 4 x1 + 16 x2 - 4 x3 = 16

Now eliminate x1 by adding in equation 1, to get

  15 x2 - 4 x3 = 18

Now eliminate x3 by adding in equation 3, to get

  14 x2 - x4 = 24

Similar operations create the new system

  14  -1  0   x2   24
  -1  14 -1   x4 = 48
   0  -1 14   x6   80

and if we eliminate x2 and x6 from the equation for x4, we get

  194 x4 = 776

We solve for x4 = 4.

This allows us to solve the previous system to get x2 = 2 and x6 = 6.
Finally, we solve x1=1, x3=3, x5=5 and x7=7.  

If the number of equations is not equal to (2**m)-1 for some m, then
the simplest thing to do is increase the number of equations to the
nearest such value, adding dummy equations of the form x(k)=0.  

For this method to be guaranteed to work, the matrix should be 
diagonally dominant.  (The size of the diagonal coefficient should be 
at least as large as the sum of the sizes of the off-diagonal 
coefficients in each row.)

Note that this method can also be used to solve a first order 
recurrence

  x(i)=a(i)+b(i)*x(i-1)

where the a(i) and b(i) are known, x(0) is given, and x(1) through x(n)
are to be computed.

The method of cyclic reduction provides such a powerful speedup of this
problem that even the Cray now provides such a routine, in the SCILIB
routine TRID.  


9) How to write a program that uses CMSSL


Let's look at the steps required to use a routine from the CMSSL 
library to solve a simple problem.  Let's say that we want to go 
through the steps of solving the following set of linear equations:

  2 x + 3 y + 2 z = 14         ( 2 3 2) (x)   (14)
  4 x +   y + 3 z = 15     or  ( 4 1 3) (y) = (15)  or  A*x = b
  5 x + 2 y +   z = 12         ( 5 2 1) (z)   (12)

Here the matrix A and right hand side vector b are known, and the 
solution vector x is to be found.

The algorithm that is generally employed here is Gauss elimination.
The CMSSL library uses a variant of this algorithm, known as 
Gauss-Jordan elimination.  The routine is GEN_GJ_SOLVE.


Here is a skeleton outline of the program we will need:


        INCLUDE something-or-other
        Declare arrays
        Signal the compiler which arrays are CM arrays
        Initialize the arrays.
        Call GEN_GJ_SOLVE
        Print results


10).  LINSOL: A Sample Program Calling CMSSL 


      PARAMETER (N=3)
      PARAMETER (NRHS=1)

      INCLUDE '/usr/include/cm/cmssl-cmf.h'

      REAL A(N,N)
      REAL B(N,NRHS)
      DOUBLE PRECISION DPIV
C
      A=0.0
      B=0.0
C
      A(1,1)=2.0
      A(1,2)=3.0
      A(1,3)=2.0
C
      A(2,1)=4.0
      A(2,2)=1.0
      A(2,3)=3.0
C
      A(3,1)=5.0
      A(3,2)=2.0
      A(3,3)=1.0
C
      B(1,1)=14.0
      B(2,1)=15.0
      B(3,1)=12.0
C
      IPIV=0
      DPIV=GEN_GJ_SOLVE(A,B,N,NRHS,IPIV,IERROR)
      IF(IERROR.NE.0)THEN
        WRITE(*,*)'GEN_GJ_SOLVE returned IERROR=',IERROR
        STOP
        ENDIF
C
      WRITE(*,*)' '
      WRITE(*,*)'The solution vector is:'
      WRITE(*,*)' '
      WRITE(*,'(1X,3G14.6)')(B(I,1),I=1,N)
      STOP
      END


11)  How to compile and run LINSOL with CMSSL


A file very much like this one is available online in the CMSSL 
examples directory, with the filename "linsol.fcm".  

To get a copy of this example, or other CMSSL examples, log in to 
the ULTRIX front ends, type "examples", and then specify that the 
machine you're interested in is a "cm2" and that you want to get 
examples for "cmssl".

Here's how we would use this file with CMSSL:

  Login to CMVAX.

  Compile the program, linking in CMSSL, to create an executable:

    cmf linsol.fcm -noslicewise -lcmssl
  or
    cmf linsol.fcm -slicewise -lcmssl-s

  Run the program:

    cmattach -w a.out


12) Timings


In order to measure the efficiency of CMSSL routines, it is necessary to
do timings.  On the Cray, this is done by calling the SECOND routine, to
get a reading of the current total elapsed CPU time.  On the CM, there
is a set of routines that must be employed.  For information on these
routines, refer to the document "cm_timer.doc".

I haven't had the time to pursue this topic in detail.  However, I've used
these timing routines to make a couple of simple tests, just to give you
a feel for the power of CMSSL.  In all cases, one sequencer of 8,192 processors
was used.

For these tests, I computed the product of two square matrices.  I used
both 32 and 64 bit arithmetic, two different sizes, and ran the program
on the Cray as well.  Here are the results.

  Machine  Arithmetic    Size  CPU Time     Routine used
    
    CM2      32 bit       512     1          GEN_MATRIX_MULT
    CM2      64 bit       512     2          GEN_MATRIX_MULT
    YMP      64 bit       512     0.9        MXM
    CM2      32 bit      2048    22          GEN_MATRIX_MULT
    CM2      64 bit      2048    37          GEN_MATRIX_MULT
    YMP      64 bit      2048    55          MXM

For the 500 by 500 problem, the CM2 and Cray times are comparable.  
For the 2000 by 2000 problem, however, the CM2 is doing better.

Notice, moreover, the growth pattern.  

Matrix multiplication of two N by N matrices requires (roughly) N**2 space and 
N**3 operations.  This means that if we use 4 times as many equations, we
will need 16 times as much space, and there will be 64 times as much work.
The time for the Cray did indeed go up by 64.  But the CM2 time went up
more like 16.  This suggests that the algorithm is indeed executing in
parallel, and that we will continue to see increasingly better results on the 
CM2 until the point where the machine is saturated.

(On top of that, the YMP job for the 2048 by 2048 case had to go into a 
special queue because it was asking for about 12 Megawords of memory, and 
it took a while for the job to run.)


13) Documentation


  * There is a brief "man" page available for CMSSL, provided by PSC.  It
    describes CMSSL, and lists where the document, examples, and other
    items are kept, as well as how to include the library at compile time.
    You can see this item by typing "man cmssl" on any PSC machine.

  * There is a document, CMSSL.DOC, available online.  This includes the
    manual page, followed by detailed descriptions of every routine in the
    library.  For simple applications, this may be enough information to
    write a program that uses the library.  This document only describes the 
    FORTRAN interface, however.  The document is available in the file
    /usr/local/doc/cmssl.doc on UNIX systems, and in PSCDOC:CMSSL.DOC
    on VMS.

  * There are SIX sets of examples online:

    The "CMSSL" examples have been prepared at the PSC.  They are rewritten
    versions of TMC examples.  Comments were inserted, PRINT statements
    added, and in some cases more interesting problems were used.  

    The other directories of examples were prepared by TMC.  They include
    examples of the various languages, of multiple instances, and more exotic
    LAYOUT commands.  On the other hand, you may find some of the examples
    terse, confusing, and difficult to use as the starting point for your
    own program.  The example directories available include
      
      cmssl.6.c-paris
      cmssl.6.fortran
      cmssl.6.fortran-paris
      cmssl.6.lisp-paris
      cmssl.6.starlisp

  * Thinking Machines Corporation sponsors the CMSSL communications newsletter.
    The first issue has come out, describing the library, listing new technical
    reports available, and taking the tridiagonal solver as a case study for
    timing data.  Get a free subscription by writing to 

      Laura Ellis
      Thinking Machines Corporation
      245 First Street
      Cambridge, Massachusetts, 02142

  * Thinking Machines Corporation has published three manuals about the library:
    "CMSSL for FORTRAN", "CMSSL for LISP" and "CMSSL for Paris".


14)  Exercises


14.1)  Login  


This is always an exercise in itself!

If possible, login to the UNIX front ends (pscuxa, pscuxb or pscuxc).
If all else fails, login to cpwsca or cpwscb, our VMS front ends.

Be sure to have your account and password information available!
If you change your password, don't use more than 8 alphanumeric characters!


14.2)  Get help and documentation on CMSSL.


Use the "man cmssl" command.

Locate the CMSSL online document, which is in /usr/local/doc/cmssl.doc
on UNIX systems, and in PSCDOC:CMSSL.DOC on VMS.  Copy it to your own
directory, and determine the name of the routine that carries out
sparse matrix-vector multiplication.


14.3)  Run the EXAMPLES program


For this exercise, you should be on the UNIX front ends.

The PSC has provided a command called EXAMPLES.  This command helps you
examine and copy any of the examples of program usage for our machines
and software that we have cataloged.

To run the program, simply type "examples", preferably on the ULTRIX front ends.

The program will display a menu.  You should specify that the machine you
are interested in is the "cm2".  Then you should tell it that the package
you are interested in is "cmssl".  Finally, you should tell it that you
want to type or copy some of the files in that directory to your current
directory.  

Try to copy all the files of the form "invert.*".  You should find two
files have been copied.


14.4)  Run an example of CMSSL usage


This exercise will require that you be logged in on CMVAX itself.
You will also need the file "invert.fcm" from the CMSSL examples directory,
as copied in the previous exercise.

Type the following commands:

  cmf invert.fcm -lcmssl
  cmattach -w a.out > invert.out

These commands should compile your program, including the CMSSL library,
then wait for a free sequencer on the CM, attach to it, run the executable,
and store the output in the file "invert.out".

Use the command
  
  more invert.out

to browse through the output, and make sure it looks OK.


14.5)  Run the timing example


Get the file "timing.fcm" from the "cm_timer" example directory.  Run
it the way you ran the program in the previous exercise:

  cmf timing.fcm -lcmssl
  cmattach -w a.out > timing.out

The reason you get output to the screen is that the ">" sign only redirects
"standard output".  It turns out you have to also redirect "standard error"
output to the file.  Repeat the step using the revised command:

  cmattach -w a.out >& timing.out

(You may have to delete or rename the old copy of "timing.out")  Now, all the 
output should go to the file.

This example multiplies two 512 by 512 matrices, once using 32 bit, and once
using 64 bit arithmetic.  To increase the problem size, you simply have to
change the two lines that read

        PARAMETER (N=512)

to another value.  Try setting N to 1024, doubling the problem size, and
see how the execution time is affected.  Make a guess as to the result
of using N=2048, and see if you're right.


14.6)  Write your own application


See if you understand how to create and use a program that calls CMSSL.

Write a simple program that computes a matrix*vector product.  The
problem to solve is

  ( 1 2 3 )   ( 1 )   ( 14 )
  ( 4 5 6 ) * ( 2 ) = ( 32 )
  ( 7 8 9 )   ( 3 )   ( 50 )

Figure out the CMSSL routine you will need by looking through CMSSL.DOC.  
Write the calling program.  (DON'T copy the example file.  Try to do this
one yourself!)

The formula for A(I,J) is A(I,J) = J + (I-1)*3, which means you can show
off and use the CM2 statement

  FORALL (I=1:3, J=1:3) A(I,J)=J+(I-1)*3

to initialize the matrix!  

Be sure to print the result vector as confirmation when you're done!

When you're done, or if you find yourself getting stuck, refer to the CMSSL 
example "matvec.fcm".