CMSSL.TALK 16 September 1991 Introduction to the Connection Machine Scientific Software Library 1) Why was CMSSL created? 2) How was CMSSL designed? 3) The contents of the library 4) CMSSL routine naming conventions 5) Setup/deallocate and save/restore features 6) Other features of the library 7) Multiple Instances 8) Finding a parallel algorithm 9) How to write a program that uses CMSSL 10) LINSOL: A sample program calling CMSSL 11) How to compile and run LINSOL with CMSSL 12) Timings 13) Documentation 14) Exercises Login Get help and documentation on CMSSL Run the EXAMPLES program Run an example of CMSSL usage Run the timing example Write your own application 1) Why was CMSSL created? CMSSL was created to make the Connection Machine easier to use, providing correct and efficient code. User programs are not easy to port to the CM2. Code generally must be written from scratch. Having a library of common routines means there's less to create. It is easy to use the CM2 incorrectly, so that a program that could run in parallel runs sequentially, and hence slowly and inefficiently. Library routines can be written to run for sure in parallel. TMC hopes to set a standard for parallel processing libraries. They have been developing a linear algebra kernel, and publishing a great many papers about their algorithms, with the hope of influencing the design of other systems. If you are familiar with the benefits of the work done on the BLAS and LINPACK libraries, you will see why this is an important goal. 2) How was CMSSL designed? The designers of the library emphasized the following points: The library must be callable from all CM2 languages: Standard Language with Paris Interface: C/Paris, FORTRAN/Paris, LISP/Paris CM languages: C* (not available!), CM Fortran, *LISP The library must be usable on all CM hardware: (CM1, CM2, future CM3. 32 bit, 64 bit, with or without Weitek floating point processors, fieldwise/slicewise) The library contains optimal parallel algorithms for solving problems. 3) The contents of the library: CMSSL has routines for the following tasks: Broadcast of data from CM arrays to all processors; Fast Fourier transforms (powers of 2 only); Histogramming (data counting) routines; Linear equation solution, square matrix A, A*x=b: for a dense matrix A, using Gauss-Jordan or QR algorithm; for a triangular matrix A; for a tridiagonal matrix A, using cyclic reduction; Matrix inversion; Matrix-matrix multiplication A*B; Matrix-vector multiplication, A*x, or x*A; Multiple shifts, using NEWS communication where possible; Outer product of two vectors, A=x*transpose(y); QR factorization and solution of rectangular linear systems; Random number generation (uniform); Sparse matrix block operations; Sparse matrix gather/scatter Sparse matrix vector product; This includes roughly 50 routines. By contrast, the Cray SCILIB has something like 500 routines, including the standard BLAS, LINPACK, EISPACK and LAPACK libraries. Routines and topics that have been mentioned for future inclusion: Basic Linear Algebra Subprograms (BLAS); Eigenvalue routine (using Jacobi method); Singular Value Decomposition; Banded Linear Equation Solver; Iterative Linear Equation Solver; Lanczos Eigenvalue routine; General FFT routine. Useful topics that haven't been considered at all include: Finite elements Interpolation (fitting curves to data) Nonlinear algebraic equations Ordinary differential equations Quadrature (approximate integration) Special functions 4) CMSSL routine naming conventions CMSSL routines have long names. Where matrices occur, the LAPACK name convention is used: GEN for a general matrix HEP for a hermitian packed matrix HER for a hermitian matrix SYM for a symmetric matrix SYP for a symmetric packed matrix. Where matrix operations occur, the following words may occur as part of the name of the routine: FACTOR if the routine factors the matrix (LU or QR) INVERT if the routine computes the inverse matrix. MULT if the routine multiplies the matrix by a vector or matrix. SOLVE if the routine solves a related linear system. Sample CMSSL routine names include: GEN_GJ_INVERT Invert general matrix, Gauss-Jordan algorithm GEN_TRIANG_SOLVE Solve general triangular linear system INITIALIZE_FAST_RNG Initialize fast random number generator The CMSSL FORTRAN interface allows the user to call the same routine name for all data types that the routine can handle. However, a user may also use the CMSSL Paris interface, from a Paris program, a C program, or even from FORTRAN. In that case, slightly different versions of the same routine are offered, depending on the data type. In particular, a FORTRAN program might use this call: CALL GEN_MATRIX_VECTOR_MULT(...) for REAL, COMPLEX or DOUBLE PRECISION data, because there is a single FORTRAN interface to the following Paris routines: CMSSL_f_gen_matrix_mult_a_1L(...) REAL or DOUBLE PRECISION data CMSSL_c_gen_matrix_mult_a_1L(...) COMPLEX data In languages other than FORTRAN, users would call the appropriate Paris version directly. 5) Setup/deallocate and save/restore features When you call a CMSSL routine, a lot of things happen "behind your back". Some routines, for instance, have to allocate memory, create, layout and initialize data, choose communications paths to the individual processors, and so on. In many cases, the use of a subroutine requires that a "setup" happen once, after which the actual subroutine could be called repeatedly, after which the memory allocated for the subroutine may be safely released. The CMSSL library typically will provide a SETUP and a DEALLOCATE SETUP routine in such cases. For instance, the ALL_TO_ALL routine is used to make the data in an array "hop" around through every possible position. Before calling ALL_TO_ALL, you must call ALL_TO_ALL_SETUP. Among other things, the SETUP routine figures out how many hops will be needed, and how they should be arranged to avoid data collisions. After setup, you may call ALL_TO_ALL repeatedly, with each call causing another hop. When you're done making the data hop, you can call DEALLOCATE_ALL_TO_ALL_SETUP to discard the memory and variables associated with the operation. In other cases, there is an internal state associated with a subroutine. In normal uses, the internal state is hidden from the user, but available to any routines that need to see it. However, in some cases, in may be important to "save" that internal state and restore it later. One obvious example is the random number generators, where the internal state is obviously the "seeds" residing at each processor. However, another example is the QR factor/solve routines. Ordinarily, the solve routine is called immediately after the factor routine. The CMSSL routines GEN_QR_FACTOR and GEN_QR_SOLVE communicate "behind your back", through variables invisible to you. However, this clever scheme breaks down if you call GEN_QR_FACTOR twice, for two different matrices (of different shapes or types), and only then call GEN_QR_SOLVE. The information that the first GEN_QR_FACTOR left for the corresponding SOLVE routine is destroyed by the second GEN_QR_FACTOR call. In order to preserve the information that the factor routine wants to send to the solve routine, you need to use the SAVE_GEN_QR_TEMPS routine to save the "invisible" information, and the RESTORE_GEN_QR_TEMPS routine to restore it. For instance: CALL GEN_QR_FACTOR(for matrix 1) CALL SAVE_GEN_QR_TEMPS(for matrix 1) CALL GEN_QR_FACTOR(for matrix 2) CALL SAVE_GEN_QR_TEMPS(for matrix 2) CALL RESTORE_GEN_QR_TEMPS(for matrix 1) CALL GEN_QR_SOLVE(for matrix 1) CALL RESTORE_GEN_QR_TEMPS(for matrix 2) CALL GEN_QR_SOLVE(for matrix 2) 6) Other features of the library There are some ways in which the compiler and CMSSL interact to save you trouble. In particular, the following helpful things are done: * Automatic detection of array type (REAL*4 or REAL*8, INTEGER*4 or INTEGER*8, COMPLEX) * Automatic detection of array shape * Data layout and communication issues usually hidden from user. (These sorts of problems are like the Cray's "memory bank conflicts" only more involved and more difficult to avoid). This means that you don't usually have to worry about the data type of the array you are passing. You usually don't have to worry about telling CMSSL the actual physical size of your arrays, or the number of dimensions. And you don't have to worry about configuring your arrays using LAYOUT commands. Of course, the more you rely on these features, the more difficult it is to write code that is portable, or even partially portable, between the Connection Machine and other computers. However, there are some burdens that may go along with using the CMSSL library. * use of "include" files may be necessary. * The user must be sure the compiler knows where CM arrays are. * for best performance, it may be necessary to specify data layout. In many of the examples of CMSSL use, you will see an INCLUDE statement used, which makes sure that certain functions are declared with the correct type, and that certain symbolic constants are defined if needed. You may have to do this in your routines that call CMSSL. If so, there will be a note in the documentation to remind you. Moreover, the documentation for CMSSL will state that certain arguments are expected to be CM arrays. It is the user's responsibility to make sure that such data items are recognized by the compiler as CM arrays. The easiest way to do this is to use some FORTRAN 90 statement with the array, such as A=0.0 or FORALL (I=1:N) A(I,I)=1.0 If you want peak speed with the library, you are going to have to worry about the LAYOUT command and declarations of NEWS and SERIAL indices to guarantee good performance. 7) Multiple Instances: The Connection Machine can run faster because it can do many operations at the same time. However, it can only do this when it knows that the operations being carried out are independent. For certain problems, such as the LU factoring (with pivoting) of a matrix, even the best algorithms cannot arrange the computation to be completely parallel. Sometimes, however, maximum parallel execution can be achieved because there are thousands of sets of data, all of which have to have the same operations carried out on them. Even if the algorithm isn't parallel, the data is, in some sense. An algorithm that clearly benefits from this method is any Monte Carlo calculation, where the same calculations are applied to hundreds or thousands of cases that differ by random starting positions. An example we can work through would be matrix multiplication. The standard problem is to compute y=A*x, where we would declare the arrays DIMENSION Y(M), A(M,N), X(N) and if we programmed in FORTRAN, we would write something like DO 10 I=1,M Y(I)=0.0 10 CONTINUE DO 30 I=1,M DO 20 J=1,N Y(I)=Y(I)+A(I,J)*X(J) 20 CONTINUE 30 CONTINUE However, suppose we had 200 sets of data to calculate. This would be equivalent to computing y(k)=A(k)*x(k), where we have an extra index K running through all the copies of the matrix. We could actually declare the arrays many different ways, although one simple way would be to add the extra index to the end of all of the original arrays: DIMENSION Y(M,L), A(M,N,L), X(N,L) and our FORTRAN programming would be: DO 20 K=1,L DO 10 I=1,M Y(I,K)=0.0 10 CONTINUE 20 CONTINUE DO 50 K=1,L DO 40 I=1,M DO 30 J=1,N Y(I,K)=Y(I,K)+A(I,J,K)*X(J,K) 30 CONTINUE 40 CONTINUE 50 CONTINUE We are now looking at an example of "multiple instances" of matrix multiplication. The fact that we have multiple instances gives us an extra index to be concerned with, but it also means we have an extra possible way to organize the calculation for maximum efficiency. If we were running this problem on the Cray, we would like to have the innermost loop be the one with the highest iteration count. Thus, our choice about how to solve this problem would depend on the NUMBER of matrices, as well as the number of rows and columns. In a similar way, if you are interested in solving multiple instances of a problem on the Connection Machine, it's important to let the algorithm you are using know this, because it gives it extra opportunities to achieve good parallel performance. For instance, if our matrices were small (10 rows by 10 columns) but we have 1,000 of them, we would be sure to prefer that the code execute in parallel over the index that counts the matrices. Matrix multiplication is fully parallel. If our algorithm weren't fully parallel, however, it should also be clear that having many items to operate on would give us a strong chance to run in parallel; think for instance of an ordinary differential equation. Given the solution at T=0, we are required to compute the value at the point T=1, T=2 and so on. But there is no way to begin the computation at T=2 until the solution at T=1 is known. So parallel processing cannot be used in solving the ODE, even if we have hundreds of output points. But if we have many ODE's to solve at once, then we can in fact work in parallel over the components. Each processor is devoted to a particular solution component. On the first step we compute all the components at T=1, and so on. With this in mind, many of the CMSSL library routines now have a more elaborate set of input arguments. Primarily, the new arguments specify the "axes" to be used to define the data objects. For instance, in the matrix multiplication example, the array A(M,N,L) has three indices. Two of them, numbers 1 and 2, define the rows and columns of a matrix, while the third defines WHICH matrix we are interested in. In general, the CMSSL routine will want to know which indices define the matrix, or vector, or other data object. The routine will then assume that the other indices are all used as counters of multiple instances of the data object, and will parallelize the problem that many times. Routines with a built in facility for handling multiple instances include: FFT Matrix-matrix multiply Matrix-vector multiply Multiple shift Outer product QR factorization Random number generator Tridiagonal solver Unfortunately, this new ability means that the routines have more input arguments, and the documentation is more difficult to read. 8) Finding a parallel algorithm It does no good to build parallel hardware if there is no parallel software. If important problems don't have a solution algorithm that works in parallel, they won't get solved on the Connection Machine. For this reason, many algorithms have been redesigned or rediscovered, in order to take advantage of the parallel computing capabilities available. Here is one example of such an algorithm. It is the one employed in the tridiagonal matrix solver, in the CMSSL routine GEN_TRIDIAG_SOLVE. The algorithm is known as cyclic reduction. Cyclic reduction is a method which can be applied to a tridiagonal system of linear equations A*x=b. Its primary benefit is that if the processor can solve more than one equation at a time, the system can be solved in LOG(N) steps rather than in N steps. Bear in mind that this means that 65,536 equations could be solved in 16 steps rather than in 65,536. In particular, speedups can be achieved on both a vector machine or a parallel processing machine. A tridiagonal system has some very special properties that will allow us to carry this operation out. Let's suppose we start out with 7 equations, where the first three are: b11*x1 + c12*x2 = y1 a21*x1 + b22*x2 + c23*x3 = y2 a32*x2 + b33*x3 + c34*x4 = y3 You should see how we could use the first equation to eliminate the coefficient a21 in the second equation. We could also use the third equation to eliminate the coefficient c23 in the second equation. This knocks out variables x1 and x3 in the second equation, but adds x4 into that equation. By the same method, we can eliminate x3 and x5 from the equation for x4, and so on. If we do this, eliminating the odd variables from the even equations, we end up with another tridiagonal system with half the number of variables. If we carry out the same operation repeatedly, we eventually end up with a single equation in a single variable. In our example, if we started with 7 equations, this variable would be x4. This we can solve! We can plug x4 into the two previous equations for x2 and x6 and get them. Then we plug those values into the four previous equations for x1, x3, x5 and x7 and we're done. The only trick here is that by looking at the equations in the right way, we can arrange the equations into LOG(N) groups of equations, each of which we can solve simultaneously. So if we really can solve them simultaneously, this is much faster than, say, the sequential backsolution step in Gauss elimination, where we shuffle the equations til we know what xn is, and then figure out what x(n-1) is and so on, one at a time. For an example, consider the following system: 4 -1 0 0 0 0 0 x1 2 -1 4 -1 0 0 0 0 x2 4 0 -1 4 -1 0 0 0 x3 6 0 0 -1 4 -1 0 0 x4 = 8 0 0 0 -1 4 -1 0 x5 10 0 0 0 0 -1 4 -1 x6 12 0 0 0 0 0 -1 4 x7 22 Let's look closely at the first three equations, to get started: 4 x1 - x2 = 2 - x1 + 4 x2 - x3 = 4 - x2 + 4 x3 - x4 = 6 If we multiply equation 2 by 4 for convenience, it becomes - 4 x1 + 16 x2 - 4 x3 = 16 Now eliminate x1 by adding in equation 1, to get 15 x2 - 4 x3 = 18 Now eliminate x3 by adding in equation 3, to get 14 x2 - x4 = 24 Similar operations create the new system 14 -1 0 x2 24 -1 14 -1 x4 = 48 0 -1 14 x6 80 and if we eliminate x2 and x6 from the equation for x4, we get 194 x4 = 776 We solve for x4 = 4. This allows us to solve the previous system to get x2 = 2 and x6 = 6. Finally, we solve x1=1, x3=3, x5=5 and x7=7. If the number of equations is not equal to (2**m)-1 for some m, then the simplest thing to do is increase the number of equations to the nearest such value, adding dummy equations of the form x(k)=0. For this method to be guaranteed to work, the matrix should be diagonally dominant. (The size of the diagonal coefficient should be at least as large as the sum of the sizes of the off-diagonal coefficients in each row.) Note that this method can also be used to solve a first order recurrence x(i)=a(i)+b(i)*x(i-1) where the a(i) and b(i) are known, x(0) is given, and x(1) through x(n) are to be computed. The method of cyclic reduction provides such a powerful speedup of this problem that even the Cray now provides such a routine, in the SCILIB routine TRID. 9) How to write a program that uses CMSSL Let's look at the steps required to use a routine from the CMSSL library to solve a simple problem. Let's say that we want to go through the steps of solving the following set of linear equations: 2 x + 3 y + 2 z = 14 ( 2 3 2) (x) (14) 4 x + y + 3 z = 15 or ( 4 1 3) (y) = (15) or A*x = b 5 x + 2 y + z = 12 ( 5 2 1) (z) (12) Here the matrix A and right hand side vector b are known, and the solution vector x is to be found. The algorithm that is generally employed here is Gauss elimination. The CMSSL library uses a variant of this algorithm, known as Gauss-Jordan elimination. The routine is GEN_GJ_SOLVE. Here is a skeleton outline of the program we will need: INCLUDE something-or-other Declare arrays Signal the compiler which arrays are CM arrays Initialize the arrays. Call GEN_GJ_SOLVE Print results 10). LINSOL: A Sample Program Calling CMSSL PARAMETER (N=3) PARAMETER (NRHS=1) INCLUDE '/usr/include/cm/cmssl-cmf.h' REAL A(N,N) REAL B(N,NRHS) DOUBLE PRECISION DPIV C A=0.0 B=0.0 C A(1,1)=2.0 A(1,2)=3.0 A(1,3)=2.0 C A(2,1)=4.0 A(2,2)=1.0 A(2,3)=3.0 C A(3,1)=5.0 A(3,2)=2.0 A(3,3)=1.0 C B(1,1)=14.0 B(2,1)=15.0 B(3,1)=12.0 C IPIV=0 DPIV=GEN_GJ_SOLVE(A,B,N,NRHS,IPIV,IERROR) IF(IERROR.NE.0)THEN WRITE(*,*)'GEN_GJ_SOLVE returned IERROR=',IERROR STOP ENDIF C WRITE(*,*)' ' WRITE(*,*)'The solution vector is:' WRITE(*,*)' ' WRITE(*,'(1X,3G14.6)')(B(I,1),I=1,N) STOP END 11) How to compile and run LINSOL with CMSSL A file very much like this one is available online in the CMSSL examples directory, with the filename "linsol.fcm". To get a copy of this example, or other CMSSL examples, log in to the ULTRIX front ends, type "examples", and then specify that the machine you're interested in is a "cm2" and that you want to get examples for "cmssl". Here's how we would use this file with CMSSL: Login to CMVAX. Compile the program, linking in CMSSL, to create an executable: cmf linsol.fcm -noslicewise -lcmssl or cmf linsol.fcm -slicewise -lcmssl-s Run the program: cmattach -w a.out 12) Timings In order to measure the efficiency of CMSSL routines, it is necessary to do timings. On the Cray, this is done by calling the SECOND routine, to get a reading of the current total elapsed CPU time. On the CM, there is a set of routines that must be employed. For information on these routines, refer to the document "cm_timer.doc". I haven't had the time to pursue this topic in detail. However, I've used these timing routines to make a couple of simple tests, just to give you a feel for the power of CMSSL. In all cases, one sequencer of 8,192 processors was used. For these tests, I computed the product of two square matrices. I used both 32 and 64 bit arithmetic, two different sizes, and ran the program on the Cray as well. Here are the results. Machine Arithmetic Size CPU Time Routine used CM2 32 bit 512 1 GEN_MATRIX_MULT CM2 64 bit 512 2 GEN_MATRIX_MULT YMP 64 bit 512 0.9 MXM CM2 32 bit 2048 22 GEN_MATRIX_MULT CM2 64 bit 2048 37 GEN_MATRIX_MULT YMP 64 bit 2048 55 MXM For the 500 by 500 problem, the CM2 and Cray times are comparable. For the 2000 by 2000 problem, however, the CM2 is doing better. Notice, moreover, the growth pattern. Matrix multiplication of two N by N matrices requires (roughly) N**2 space and N**3 operations. This means that if we use 4 times as many equations, we will need 16 times as much space, and there will be 64 times as much work. The time for the Cray did indeed go up by 64. But the CM2 time went up more like 16. This suggests that the algorithm is indeed executing in parallel, and that we will continue to see increasingly better results on the CM2 until the point where the machine is saturated. (On top of that, the YMP job for the 2048 by 2048 case had to go into a special queue because it was asking for about 12 Megawords of memory, and it took a while for the job to run.) 13) Documentation * There is a brief "man" page available for CMSSL, provided by PSC. It describes CMSSL, and lists where the document, examples, and other items are kept, as well as how to include the library at compile time. You can see this item by typing "man cmssl" on any PSC machine. * There is a document, CMSSL.DOC, available online. This includes the manual page, followed by detailed descriptions of every routine in the library. For simple applications, this may be enough information to write a program that uses the library. This document only describes the FORTRAN interface, however. The document is available in the file /usr/local/doc/cmssl.doc on UNIX systems, and in PSCDOC:CMSSL.DOC on VMS. * There are SIX sets of examples online: The "CMSSL" examples have been prepared at the PSC. They are rewritten versions of TMC examples. Comments were inserted, PRINT statements added, and in some cases more interesting problems were used. The other directories of examples were prepared by TMC. They include examples of the various languages, of multiple instances, and more exotic LAYOUT commands. On the other hand, you may find some of the examples terse, confusing, and difficult to use as the starting point for your own program. The example directories available include cmssl.6.c-paris cmssl.6.fortran cmssl.6.fortran-paris cmssl.6.lisp-paris cmssl.6.starlisp * Thinking Machines Corporation sponsors the CMSSL communications newsletter. The first issue has come out, describing the library, listing new technical reports available, and taking the tridiagonal solver as a case study for timing data. Get a free subscription by writing to Laura Ellis Thinking Machines Corporation 245 First Street Cambridge, Massachusetts, 02142 * Thinking Machines Corporation has published three manuals about the library: "CMSSL for FORTRAN", "CMSSL for LISP" and "CMSSL for Paris". 14) Exercises 14.1) Login This is always an exercise in itself! If possible, login to the UNIX front ends (pscuxa, pscuxb or pscuxc). If all else fails, login to cpwsca or cpwscb, our VMS front ends. Be sure to have your account and password information available! If you change your password, don't use more than 8 alphanumeric characters! 14.2) Get help and documentation on CMSSL. Use the "man cmssl" command. Locate the CMSSL online document, which is in /usr/local/doc/cmssl.doc on UNIX systems, and in PSCDOC:CMSSL.DOC on VMS. Copy it to your own directory, and determine the name of the routine that carries out sparse matrix-vector multiplication. 14.3) Run the EXAMPLES program For this exercise, you should be on the UNIX front ends. The PSC has provided a command called EXAMPLES. This command helps you examine and copy any of the examples of program usage for our machines and software that we have cataloged. To run the program, simply type "examples", preferably on the ULTRIX front ends. The program will display a menu. You should specify that the machine you are interested in is the "cm2". Then you should tell it that the package you are interested in is "cmssl". Finally, you should tell it that you want to type or copy some of the files in that directory to your current directory. Try to copy all the files of the form "invert.*". You should find two files have been copied. 14.4) Run an example of CMSSL usage This exercise will require that you be logged in on CMVAX itself. You will also need the file "invert.fcm" from the CMSSL examples directory, as copied in the previous exercise. Type the following commands: cmf invert.fcm -lcmssl cmattach -w a.out > invert.out These commands should compile your program, including the CMSSL library, then wait for a free sequencer on the CM, attach to it, run the executable, and store the output in the file "invert.out". Use the command more invert.out to browse through the output, and make sure it looks OK. 14.5) Run the timing example Get the file "timing.fcm" from the "cm_timer" example directory. Run it the way you ran the program in the previous exercise: cmf timing.fcm -lcmssl cmattach -w a.out > timing.out The reason you get output to the screen is that the ">" sign only redirects "standard output". It turns out you have to also redirect "standard error" output to the file. Repeat the step using the revised command: cmattach -w a.out >& timing.out (You may have to delete or rename the old copy of "timing.out") Now, all the output should go to the file. This example multiplies two 512 by 512 matrices, once using 32 bit, and once using 64 bit arithmetic. To increase the problem size, you simply have to change the two lines that read PARAMETER (N=512) to another value. Try setting N to 1024, doubling the problem size, and see how the execution time is affected. Make a guess as to the result of using N=2048, and see if you're right. 14.6) Write your own application See if you understand how to create and use a program that calls CMSSL. Write a simple program that computes a matrix*vector product. The problem to solve is ( 1 2 3 ) ( 1 ) ( 14 ) ( 4 5 6 ) * ( 2 ) = ( 32 ) ( 7 8 9 ) ( 3 ) ( 50 ) Figure out the CMSSL routine you will need by looking through CMSSL.DOC. Write the calling program. (DON'T copy the example file. Try to do this one yourself!) The formula for A(I,J) is A(I,J) = J + (I-1)*3, which means you can show off and use the CM2 statement FORALL (I=1:3, J=1:3) A(I,J)=J+(I-1)*3 to initialize the matrix! Be sure to print the result vector as confirmation when you're done! When you're done, or if you find yourself getting stuck, refer to the CMSSL example "matvec.fcm".