Using the Distributed-Memory Parallel Version of ARPS
Originally written by Adwait Sathye
Center for Analysis and Prediction of Storms
University of Oklahoma
January, 1996
Updated September, 1997

1. Introduction

Moving a sequential code to a parallel version invariably leads to a signicant amount of changes to the original structure and content. We have tried to automate the process, and shield the code changes from the end user. This document details the changes made to the ARPS code, how to `make' the model, the extra tools necessary to split & combine the history dumps and other data files, and simple queue scripts to submit the code on the Cray T3D and Cray T3E [ in Pittsburgh Supercomputing Center (PSC) ] and the IBM SP-2 [ in Maui High Performance Computing Center (HPCC) ]. Note: The code compilation and job submission is to be done on the front end machine (mario, in case of the PSC C90) when working with the T3D. The T3E machine name in (PSC)  is jaromir.psc.edu.
 

2. Changes to the ARPS code

All the code changes, in this released version, are handled by a translator, that will convert the sequential ARPS code into parallel form. The translator uses the cMP comments embedded in the code, so the user should not delete or modify these lines in any form. The translator will convert the original files and add a PlatformMachine extension. The Platform could be pvm/mpi (for now) and the Machine could either be t3d or t3e or sp2 (e.g. the translated version of tinteg3d.f is called tinteg3d_pvmt3e for PVM execution on the T3E. Similarly the arps executable on T3E is called arps_pvmt3e). For cluster execution, the Machine part is ignored (i.e. there will only be a Platform extension).

Note: If you are going to make changes to the code, make those changes to the original ARPS source, or they will be lost in a subsequent translation. A couple of `rules of thumb'

The rest of the section details the changes made to the code, to make it parallel. These will remain comment statements unless you create the parallel version. Note: All the changes are shown along with the original file name. 3. Setting up the model

This part contains the external changes to the ARPS model. Perhaps the most significant, at least the one with the most effect to ARPS'ers, is the changes to dims.inc. The values in NX, NY in dims.inc now re ect the size of the grid per processor. The values NPROCX and NPROCY indicate the number of processors in the X and Y direction. The total number of processors used in that run are indicated by NPROCX * NPROCY. To calculate the size of your entire grid,

and The nz value is not affected (its 2 dimensional decomposition along the X and Y dimensions).

It might just be easier to select your final grid size and then calculate

and You can set the values of nprocx and nprocy depending on the size of your model, and the number of jobs in the MPP queue. As a rule of thumb, the not-optimized-for-MPP code performs at 11 MFLOPS per processor (on the T3D), or 32 processors equal 1 C90 processor, dedicated. It might be a good idea to use 16 processors for smaller models and 64 for the larger cases. Its usually harder to get 128 & more processors. Also, the total number of processors have to beapower of 2 (on the T3D and T3E, not on the SP-2). Already ARPS has been tested on 1024 processor Cray T3E machine.

A new file, par.inc, has been added to set up the sizes of the communication buffers. As far as possible, try to keep

and though it wont hurt if those values are larger than nx, ny We have added a subroutine to check these values at run time, and the program will stop if those nx and/or ny are set to be larger than the values in this file.

Also, if you will be using external boundary data to initialize/nudge the model, do not forget to set the values of nxebc and nyebc (in exbc.inc) to be the same as nx and ny

4. Compiling the model

The compile script for the parallel model has been merged into makearps. To create a parallel version for the machine you are working on, type makearps arps_pvm. This will create the parallel version and the translator. To create the utilities to split data files (splitfiles) use makearps splitfiles command and to join the output history dumps (joinfiles) use command makearps splitfiles command. On the Cray T3D, T3E  and the IBM SP2, we add a Machine extension (i.e. for the T3D, the names change to arps_pvmt3d, joinfiles t3d and splitfiles t3d). Since the header files, and libraries, are platform dependent and their locations depend on where the system administrator placed them, we have added two new options to makearps, mpp inc and mpp lib. The mpp inc specifies the location of the include files (e.g. fpvm3.h, mpif.h etc.), while mpp lib specifies the location of the library (e.g. libmpi.a, libfpvm3.a etc.) files.

5. Running ARPS on parallel machines

5.1. Queue Scripts

The Cray uses Network Queueing System (NQS) to schedule jobs, while the IBM SP-2 uses Loadleveller. Most supercomputers have aninteractive job limit, and queueing systems allow you to run larger and longer jobs in a non-interactive manner. The scripts are similar to shell scripts with extra keywords (like QSUB in the case of NQS) to define your job limits and the execution environment. A typical NQS header for Cray t3D will look like

#QSUB -lM 7Mw -lT 15000
#QSUB -q mpp
#QSUB -l mpp p=16
#QSUB -l mpp t=9000
#QSUB -s /usr/local/bin/tcsh
#QSUB -eo
#QSUB -o $AFS/arps.log

The mpp p sets the number of processors, while mpp t sets the computation time limit for each individual processor. The -q mpp will work only on the PSC NQS.

A typical NQS header for Cray T3E will look like

#QSUB -lT 15000           # Mandatory qsub options specifying command
#QSUB -l mpp_p=2        # PE time limit and number of application PEs.
#QSUB -l mpp_t=9000   # Strongly recommended option specifying application
                                        # PE time limit.
#QSUB -o t3e.output -eo # Optional qsub option specifying the output file name.
#QSUB -eo
 
 A sample Loadleveller script looks like

#!/bin/csh
#
# initialdir = /u/asathye/ebc
# input = input.name
# output = arps.output
# error = arps.error
# class = Medium
#Ýjob type = pvm3 (or parallel for MPI jobs)
# min processors = 16
# max processors = 16
# requirements = (Adapter == "hps user")
# parallel path = /u/asathye/pvm3/bin/RS6K/
#
# notify user = your_email@your_address.edu
# queue

However, a program called 'xloadl' can help you construct a script with a X interface (Select the File option). Also, you can find more info on Loadleveller at http://www.mhpcc.edu/training/workshop/html/loadleveler/LoadLeveler.html
 

 

5.2. Splitting the data files

If you are planning to use any external data files, you will have to split those files before you execute the parallel version. The following programs are available to split the necessary data files. The splitfiles tool will scan the input file for files that need to split, and will split those files (once again, the splitfiles is called splitfiles_t3d, splitfiles_t3e, and splitfiles_sp2 on the T3D, T3E and SP2 respectively). Note: Some systems might have a limit on the number of open files. Please check/set your environment limits, via the limit command, before you run the model.

5.3 Running the model

To execute, say a 16 processor job on the Cray-T3D, do

To execute, say a 16 processor job on the Cray-T3E, do We had to change from feeding the input file to the program to feeding a file with the name of the input file in it. In the above example, the file, input.name should contain the name of the actual input file, for instance arps.input.

On the SP-2, we follow the typical PVM style. First copy the executable to the /pvm3/bin/RS6K/ directory. Then start the master process.

You can execute the MPI version with 5.4. Joining the data files

The new tool joinfiles, will combine all the history dumps at the end of the run. Similar, to joinfiles, it will scan the input file for the runname, and the frequency of history dumps, and combine the history dumps. Note: Some systems might have a limit on the number of open files. Please check/set your environment limits, via the limit command, before you run joinfiles.

A. A few technical details about the port B. Architectural differences between the C90, the T3D  and the T3E C. Location of the code D Job submission and status information Note: For PSC T3E machines which has two types of PEs you can specify your choice. The "/bin/setlabel -l S450 executable" command will instruct the system that you want to run the executable on 450 MHz PEs. But this will not gurantee that you will get 450 MHz PEs. To gaurantee for a specific types of PEs use "/bin/setlabel -l H450 executable" command. This command modifies the header of the 'executable'. To undo the change use  "/bin/setlabel -c executable" command.