xCAT HPC Benchmark HOWTO

xCAT HPC Production Readiness and Benchmark HOWTO (WIP)
(Formally known as the xCAT Benchmark HOWTO)

It is generally a good idea to verify that the cluster you just built actually can do work. This can be accomplished by running a few industry accepted benchmarks. The purpose of benchmarking is not to get the best results, but to get consistent repeatable accurate results that are also the best results.

With a goal of "consistent repeatable accurate results" it is best to start with as few variables as possible. I recommend starting with single node benchmarks. e.g. STREAM. If all machines have similar STREAM results, then memory can be ruled out as a factor with other benchmark anomalies. Next work your way up to processor and disk benchmarks, then multinode (parallel) benchmarks, then finally HPL. After each more complicated benchmark runs check for "consistent repeatable accurate results" before continuing.

Outlined below is a recommended path.

Single Node (serial) Benchmarks:

STREAM (memory)
NPB Serial (processor and memory)
NPB OpenMP (multi-processor and memory)
iozone (disk, memory, and processor)

*bold denotes primary exercised component.

Parallel Benchmarks:

ping-pong (interconnect)
Pallas MPI Benchmark (MPI API, interconnect)
NAS Parallel (multi-processor, memory, and interconnect)
High Performance Linpack (multi-processor, memory, and interconnect)

*bold denotes primary exercised component.

Prerequisites:

All hardware installed with xCAT and functioning. (xCAT Mini HOWTO)
Torque/Maui setup and tested. (xCAT Mini HOWTO)
Myrinet setup (if applicable). (Myrinet-HOWTO)
Compilers and Libraries installed and configured. (This document)

Feel free to follow your own path, but please start with the Sanity Checks and Torque IANS.

Sanity Checks

Hardware Check
Network Check
Disable Hyperthreading
Setup User Environment
Torque/Maui Check
32/64 Architectures
Software Install/Check

Torque/Maui IANS

Benchmarks

STREAM
NPB Serial
NPB OpenMP
IOzone
Ping-Pong
Pallas MPI Benchmark
http://www.hlrs.de/organization/par/services/models/mpi/b_eff/ (WIP)
NAS Parallel
High Performance Linpack

Sanity Checks

Hardware Check

When analyzing performance anomalies only compare apples-to-apples. Every system setting (BIOS) must be identical. Every hardware setting, configuration, OS, kernel, location of DIMMs, etc... must be identical. Exact clones.

First check that each node per class of node has the same amount of RAM, e.g.:

psh noderange free | grep Mem: | awk '{print $3}' | sort | uniq

It is OK to get multiple values as long as they are very close together.

Next check that each node per class of node has the same number of processors, e.g.:

psh noderange 'cat /proc/cpuinfo | grep processor | wc -l' | awk '{print $2}' | sort | uniq

Next check that each node per class of node has the same BIOS version, e.g.:

psh -s noderange $XCATROOT'/$(uname -m)/sbin/dmidecode;echo END' | \
perl -pi -e '$/ = "END\n";' -e 's/.*BIOS Information\n[^\n]*\n([^\n]*\n).*END/\1/sg' | \
grep Version | awk -F: '{print $3}' | sort | uniq

OR

psh compute $XCATROOT'/$(uname -m)/sbin/dmidecode | grep Version | head -1' | \
awk -F: '{print $3}' | sort | uniq

To find the offending nodes:

psh -s noderange $XCATROOT'/$(uname -m)/sbin/dmidecode;echo END' | \
perl -pi -e '$/ = "END\n";' -e 's/.*BIOS Information\n[^\n]*\n([^\n]*\n).*END/\1/sg' | \
grep Version | sort +2

OR

psh compute $XCATROOT'/$(uname -m)/sbin/dmidecode | grep Version | head -1' | sort +2

Network Check

No network, no cluster. The stability of the network is critical. As root type:

ppping noderange

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from. ppping (parallel parallel ping) is an xCAT utility that will tell each node in the noderange to ping all the nodes in the noderange. No news is good news, i.e. no output is good, only errors will be displayed.

If you have Myrinet, ssh to a Myrinet node and type:

ppping -i myri0 noderange

Where noderange is a list of all Myrinet nodes in the cluster.

Both tests are critical and must succeed, if not, you will never launch a job.

Disable Hyperthreading

Intel P4 and x86_64 Hyperthreading do not aid in the performance of processor and memory intensive HPC applications--it actually hurts performance.

Illustrated below the HPL benchmark. xhpl (HT disabled) continues to grow as the problem size increases. xhpl-ht (HT enabled) flattens out as the problem size increases.

All attempts to leverage HT in HPC applications have failed. E.g. multithreaded libs with OMP.

Setup User Environment

Never run benchmarks as root. Use the addclusteruser command or equivalent to create a user on your primary user/login node. If you are using NIS you are ready to go. If you are using password synchronization then type:

pushuser noderange username

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from.

Verify with psh that all nodes have mounted the correct /home directory and that your added user is visible and that the directory permissions are correct:

psh -s noderange 'ls -l /home | grep username'

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from and username is the user you are posing as.

Torque/Maui Check

Using pbstop and showq verify that all of your nodes are visible and ready.

bob@head01:~> qsub -l nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready

----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------

Note the Nodes: line. Try to ssh from node to node and back to the user node that started qsub:

bob@node10:~> ssh node9
bob@node9:~> exit
logout
Connection to node9 closed.
bob@node10:~> ssh head01
bob@head01:~> exit
logout
Connection to head01 closed.
bob@node10:~> exit
logout

qsub: job 0.head01.foobar.org completed

Now try to ssh back to the nodes that were assigned, you should be denied:

bob@head01:~> ssh node9
14653: Connection close by 199.88.179.209

32/64 Architectures

It is worth noting that x86_64 and ppc64 architectures can generate and execute 32-bit and 64-bit code. Below is a chart of the options. The default behavior is in bold.

	GNU	Intel	PGI 5.2	IBM
x86_64	-m32, -m64	32bit: Use i686 compiler. 64bit: Use x86_64 compiler.	-tp k8-32, -tp k8-64	n/a
ppc64 not SLES8	-m32, -m64	n/a	n/a	-q32, -q64
ppc64 SLES8	32bit: g?? 64bit: powerpc64-linux-g??	n/a	n/a	-q32, -q64

Software Install/Check

The following is the required base software. Benchmark software installation will be covered in the Benchmark section of this document.

For best results use Intel Compilers for IA64 and EM64T, PGI Compilers for x86_64, IBM and GNU for PPC64, and for IA32 use Intel Compilers or PGI. I have also had good results using PGI for just Fortran and GCC 3.3 for C on x86_64. Before building or downloading GCC 3.3 verify that is it not already installed:

rpm -qa | grep gcc

You may need to use the full path to gcc33, e.g.:

/opt/gcc33/bin/gcc

Intel 8.0 Compilers (IA32/IA64)
Intel 8.1 Compilers (EM64T)
PGI 5.2 Compilers (Opteron/IA32)
IBM Compilers (PPC64)
GNU Compilers
Intel Math Kernel Libraries 7.0 (optional)
AMD Core Math Libraries (optional)
ESSL/PESSL (PPC64, optional) WIP
Goto Libraries (required for HPL)
ATLAS Libraries (optional)
MPICH
MPICH-GM
LAM-MPI (optional)

Intel 8.0/8.1 Compilers

Download the Intel 8.0/8.1 C++ and Fortran compilers from http://www.intel.com/software/products/compilers and save to /tmp. Also obtain license files for both C++ and Fortran. The free non-commercial use license is fine for testing. It will be emailed to you within hours.
Extract in both tarballs in a file system with 400MB free space. e.g.:

cd /tmp
tar zxvf l_cc_pc_8.0.066.tar.gz
tar zxvf l_fc_pc_8.0.046.tar.gz
Create an environmental variable (INTELROOT) and a destination directory for the code that is globally accessible by every node in the cluster. E.g.

export INTELROOT=/usr/local/intel
export INTEL_LICENSE_FILE=$INTELROOT/$ARCH/licenses
mkdir -p $INTEL_LICENSE_FILE

There is nothing special about $INTELROOT, but it will be used to in this document to refer to the location of the compilers.

$ARCH = uname -m (e.g. /usr/local/intel/i686, /usr/local/intel/ia64, /usr/local/intel/x86_64). This will need to be done for C++ and Fortran.

EXCEPTION: If installing the ia32 compilers on x86_64 use i686 for $ARCH.
Copy .lic files to $INTEL_LICENSE_FILE. (Check your email).
Install the compilers.

cd to l_cc... then l_fc...

Then type ./install.sh, e.g.:

cd /tmp/l_cc_pc_8.0.066
./install.sh

cd /tmp/l_fc_pc_8.0.046
./install.sh

When prompted for an installation directory use the same directory you assigned to INTELROOT and append /$ARCH where $ARCH = uname -m (e.g. /usr/local/intel/i686, /usr/local/intel/ia64, /usr/local/intel/x86_64). This will need to be done for C++ and Fortran.

EXCEPTION: If installing the ia32 compilers on x86_64 use i686 for $ARCH.

NOTE: You will need to do this once for IA32, once for EM64T, and once for IA64.
Install any patches (this will take sometime over NFS), e.g.:

cd /tmp
tar zxvf l_cc_pc_8.0.066_pl070.1.tar.gz
cd /tmp/l_cc_pc_8.0.066_pl070.1
./patchinstall.sh

cd /tmp
tar zxvf l_fc_pc_8.0.046_pl050.1.tar.gz
cd /tmp/l_fc_pc_8.0.046_pl050.1
./patchinstall.sh
Create tests.

Create a file (foo.c) with the contents:

    #include <stdio.h>

    main()
    {
        printf("foo\n");
    }

Create a file (foo.f) with the contents (indent each line with 7 spaces).

    program foo
    write(*,*)'foo'
    end

Compile with:

. $INTELROOT/$ARCH/bin/iccvars.sh
icc -o foo1 foo.c
ifort -o foo2 foo.f
Test. Type:

./foo1; ./foo2

If you got two foos, then you're OK.
Update /etc/ld.so.conf. Each node will need the location of the shared libraries for runtime. For each node append to /etc/ld.so.conf the full path to the libraries for the architecture.

E.g. If $INTELROOT=/usr/local/intel and architecture = IA64 then append:

/usr/local/intel/ia64/lib

E.g. If $INTELROOT=/usr/local/intel and architecture = IA32 then append:

/usr/local/intel/i686/lib

E.g. If $INTELROOT=/usr/local/intel and architecture = EM64T then append:

/usr/local/intel/x86_64/lib

Then type:

/sbin/ldconfig

PGI 5.2 Compilers

Download the PGI 5.2 C and Fortran compilers from http://www.pgroup.com/downloads to /tmp.
Extract. E.g.:

mkdir /tmp/p
cd /tmp/p
tar zxvf /tmp/linux86*
Create an environmental variable (PGI) and a destination directory for the code that is globally accessible by every node in the cluster. E.g.

export PGI=/usr/local/pgi
Install the compilers. Type:

./install

When prompted for an installation directory use the same directory you assigned to PGI (e.g. /usr/local/pgi).
Create tests.

For x86_64:

export PATH=$PGI/linux86-64/5.2/bin:$PATH

For IA32:

export PATH=$PGI/linux86/5.2/bin:$PATH

Create a file (foo.c) with the contents:

    #include <stdio.h>

    main()
    {
        printf("foo\n");
    }

Create a file (foo.f) with the contents (indent each line with 7 spaces).

    program foo
    write(*,*)'foo'
    end

Type:

pgcc -o foo1 foo.c
pgf77 -o foo2 foo.f

Test. Type:

./foo1; ./foo2

If you got two foos, then you're OK.

IBM Compilers

Download the IBM Power C/C++ compiler for Linux from http://www-306.ibm.com/software/awdtools/vacpp to /tmp.
Download the IBM Power Fortran compiler for Linux from http://www-306.ibm.com/software/awdtools/fortran/xlfortran to /tmp.
Extract. E.g.:

mkdir /tmp/i
cd /tmp/i
tar zxvf ../vacpp.601.rhel3.tnb.tar.gz
tar zxvf ../xlf.811.rhel3.tnb.tar.gz
Remove any duplicate older packages. Search /tmp/i/rpms.
Read the install documentation and install any missing dependencies.

NOTE: SuSE SLES8 and RHAS3 support only.
Create an environmental variable (IBMCMP) and a destination directory for the code that is globally accessible by every node in the cluster. E.g.

export IBMCMP=/usr/local/ibmcmp
mkdir -p $IBMCMP
Install the compilers. Type:

cd /tmp/i/rpms
rpm -ivh --prefix=$IBMCMP *.rpm 2>&1 | tee /tmp/install.out

When prompted for an installation directory use the same directory you assigned to IBMCMP (e.g. /usr/local/ibmcmp).
NOTE: SLES8 requires the following updated RPMs:

cross-ppc64-gdb-5.2-115
cross-ppc64-linux-2.4.19-139
cross-ppc64-binutils-2.13.90.0.4-84
cross-ppc64-libs_and_headers-8.1-157
cross-ppc64-gcc-3.2.2-50
cross-ppc64-glibc-2.2.5-140
Configure compilers, E.g.:

cd $IBMCMP/vacpp/6.0/bin
./new_install
cd $IBMCMP/xlf/8.1/bin
./new_install
Get update PTFs and install.
Copy configuration files to other compile/user nodes. From the node you installed the compilers on type:

cd /etc/opt
psh noderange mkdir -p /etc/opt
pscp -r ibmcmp noderange:/etc/opt

Where noderange is the list of compile/user nodes.
Create tests.

export PATH=$IBMCMP/vacpp/6.0/bin:$IBMCMP/xlf/8.1/bin:$PATH

Create a file (foo.c) with the contents:

    #include <stdio.h>

    main()
    {
        printf("foo\n");
    }

Create a file (foo.f) with the contents (indent each line with 7 spaces).

    program foo
    write(*,*)'foo'
    end

Type:

cc -o foo1 foo.c
f77 -o foo2 foo.f

Test. Type:

./foo1; ./foo2

If you got two foos, then you're OK.

GNU Compilers

Verify that gcc and g77 are installed:

rpm -qa | egrep '(gcc|g77)'

NOTE: SuSE SLES8 x86_64 may have g77 as part of the gcc33 package, if g77 returns "command not found" try:

export PATH=$PATH:/opt/gcc33/bin
Create tests.

Create a file (foo.c) with the contents:

    #include <stdio.h>

    main()
    {
        printf("foo\n");
    }

Create a file (foo.f) with the contents (indent each line with 7 spaces).

    program foo
    write(*,*)'foo'
    end

Type:

gcc -o foo1 foo.c
g77 -o foo2 foo.f
Test. Type:

./foo1; ./foo2

If you got two foos, then you're OK.

Intel Math Kernel Libraries 7.0

Download the Intel 7.0 Math Kernel Libraries from http://www.intel.com/software/products/mkl to /tmp. Also obtain a license file. The free non-profit organization license is fine for testing. It will be emailed to you within seconds.
Extract the tarball in a file system with 200MB free space. E.g.:

cd /tmp
tar xvf l_mkl_p_7.0.017.tar
Create an environmental variable (INTELROOT) and a destination directory for the code that is globally accessible by every node in the cluster. E.g.

export INTELROOT=/usr/local/intel
export INTEL_LICENSE_FILE=$INTELROOT/licenses
mkdir -p $INTEL_LICENSE_FILE

There is nothing special about $INTELROOT, but it will be used to in this document to refer to the location of the MKL.
Copy the .lic file to $INTEL_LICENSE_FILE. (Check your email).
Install the libraries. E.g.:

cd /tmp/l_mkl_p_7.0.017
./install.sh

When prompted for an installation directory use the same directory you assigned to INTELROOT (e.g. /usr/local/intel).

NOTE: IA32, x86_64, and IA64 libraries are installed. You only need to do this once.

AMD Core Math Libraries

Download the AMD Core Math Libraries from http://www.developwithamd.com/acml to /tmp.
Create a ACMLROOT environmental variable and directory that is globally available to all nodes. E.g.:

export ACMLROOT=/usr/local/acml
mkdir -p $ACMLROOT

There is nothing special about $ACMLROOT, but it will be used to in this document to refer to the location of the AMD Core Math Libraries.
Install in $ACMLROOT. E.g.:

mkdir /tmp/a
cd /tmp/a
tar zxvf ../acml-32bit-2-1-0.tgz
tar zxvf ../acml-64bit-2-1-0.tgz
./install32.sh
./install64.sh

When prompted for an installation directory use the same directory you assigned to ACMLROOT (e.g. /usr/local/acml).

Goto Libraries

Create a GOTOROOT environmental variable and directory that is globally available to all nodes. E.g.:

export GOTOROOT=/usr/local/goto
mkdir -p $GOTOROOT/lib

There is nothing special about $GOTOROOT, but it will be used to in this document to refer to the location of the Goto Libraries.
Download all Goto Libraries and xerbla.* from http://www.cs.utexas.edu/users/flame/goto into $GOTOROOT and decompress.

Method 1:

cd $GOTOROOT/lib
wget -nv -nH --cut-dirs=3 -np --mirror http://www.cs.utexas.edu/users/kgoto/libraries/
rm -f *[AD] index.html
gunzip *.gz

Method 2:

Download any way you like and place in $GOTOROOT/lib and decompress.

cd $GOTOROOT/lib
gunzip *.gz

NOTE: If gunzip fails because your browser already uncompressed but the .gz extension is still present then type:

for i in *.gz
do
mv $i ${i%%.gz}
done

NOTE: If xerbla.f and and xerbla.c have .txt extensions (left by your browser) then type:

mv xerbla.f.txt xerbla.f
mv xerbla.c.txt xerbla.c
perl -pi -e 's/\r//' *.f *.c
Build xerbla.o using the same compiler you plan to use with HPL.

cd $GOTOROOT/lib

GNU:

g77 -c -o xerbla_$(uname -m).o xerbla.f

Intel:

ifc -c -o xerbla_$(uname -m).o xerbla.f

PGI:

pgf77 -c -o xerbla_$(uname -m).o xerbla.f

IBM:

f77 -c -o xerbla_$(uname -m).o xerbla.f

ATLAS Libraries

Optimally building ATLAS is beyond the scope of this document, however current binary distributions are available. It is recommended that you try both.

Download ATLAS binaries from https://sourceforge.net/project/showfiles.php?group_id=23725.
Create a ATLASROOT environmental variable and directory that is globally available to all nodes. E.g.:

export ATLASROOT=/usr/local/atlas
mkdir -p $ATLASROOT

There is nothing special about $ATLASROOT, but it will be used to in this document to refer to the location of the ATLAS Libraries.
Download all relative ATLAS libraries into /tmp and extract into $ATLASROOT.

cd $ATLASROOT
tar zxvf /tmp/atlas3.4.1_Linux_IA64Itan_2.tar.gz
tar zxvf /tmp/atlas3.6.0_Linux_HAMMER64SSE2_2.tar.gz
tar zxvf /tmp/atlas3.6.0_Linux_HAMMER64SSE2.tar.gz
tar zxvf /tmp/atlas3.6.0_Linux_P4SSE2.tar.gz
tar zxvf /tmp/atlas3.6.0_Linux_PIIISSE1.tar.gz
tar zxvf /tmp/atlas3.6.0_Linux_PPCG4AltiVec_2.tgz
tar zxvf /tmp/atlas3.7.1_Linux_HAMMER32SSE2.tar.gz

MPICH

MPICH is a freely available MPI implementation that runs over IP. There are many ways to build MPICH. Free free to build it your way or the xCAT way:

MPICH must be installed in a directory that is globally accessible to all nodes. Export the environmental variable MPICHROOT to specify that location before running mpimaker. (default if unspecified is /usr/local/mpich).
export MPICHROOT=/usr/local/mpich
Download from http://www-unix.mcs.anl.gov/mpi/mpich/download.html and save as mpich-VERSION.tar.gz into /tmp. In this example: mpich-1.2.6.tar.gz.
Download the latest cumulative patches file from ftp://ftp.mcs.anl.gov/pub/mpi/patch/1.2.6/patch.all and save as mpich-VERSION.patch into /tmp. E.g. (mpich-1.2.6.patch).

Run mpimaker with no arguments to get a list of supported compilers.

$XCATROOT/build/mpi/mpimaker

Output:

Usage: mpimaker: [mpich tarball] [up|smp] [compiler] [rsh|ssh|{full path to other}]

Supported Compilers/Architectures:

gnu:                    i686(32bit), ia64(64bit), x86_64(64bit), ppc64(32bit)
gnu32:                  x86_64(32bit), ppc64(32bit)
gnu64:                  x86_64(64bit), ppc64(64bit)

intel:                  i686(32bit), ia64(64bit), x86_64(64bit)
                        8.x compiler support only

pgi:                    i686(32bit)
pgi32:                  x86_64(32bit)
pgi64:                  x86_64(64bit)
pgi64gcc:               x86_64(64bit) PGI Fortran, GNU gcc
pgi64gcc33:             x86_64(64bit) PGI Fortran, GNU gcc33 (SLES8)
pgi64gcc33nogmalloc:    x86_64(64bit) PGI Fortran, GNU gcc33 (SLES8) 
                          GM only --with-device=ch_gm:-disable-registration
                        5.2 compiler support only

absoft64gcc33:          x86_64(64bit) Absoft Fortran, GNU gcc33 (SLES8)

nag64gcc33:             x86_64(64bit) NAG Fortran, GNU gcc33 (SLES8)

ibmcmp32:               ppc64(32bit) IBM VCCPP/XLF
ibmcmp64:               ppc64(64bit) IBM VCCPP/XLF

Hint(s):

/opt/xcat/build/mpi/mpimaker mpich-1.2.6..13.tar.gz smp pgi64 ssh
/opt/xcat/build/mpi/mpimaker mpich-1.2.6.tar.gz smp pgi64 ssh

Build MPICH, E.g.

cd /tmp
/opt/xcat/build/mpi/mpimaker mpich-1.2.6.tar.gz smp pgi64 ssh

The first argument to mpimaker is the name of the MPICH tarball.

The second argument is used to specify if you want shared memory support or not (smp = shared memory).

NOTE: Shared memory support is for MPICH-IP only, MPICH-GM shared memory support is a runtime option.

The third option is the compiler to use (run mpimaker with no args to get a list of supported compilers).

The forth option is the remote shell command to use.

A successful build should return:

mpimaker: mpich-1.2.6.tar.gz smp pgi64 ssh build start
patching file romio/mpi2-other/array/darray.c
patching file src/misc2/darray.c
mpimaker: mpich-1.2.6.tar.gz smp pgi64 ssh make
mpimaker: mpich-1.2.6.tar.gz smp pgi64 ssh build successful

MPICH installed in /usr/local/mpich/1.2.6/ip/x86_64/smp/pgi64/ssh

Please check config.cmd make.log install.log configure.log in
/usr/local/mpich/1.2.6/ip/x86_64/smp/pgi64/ssh
for errors. config.cmd was the command used to build MPICH

If the build failed check the files config.cmd make.log install.log configure.log in /tmp/mpich-1.2.6.
Inspect output. If the build was successful you may wish to check the files config.cmd make.log install.log configure.log in the root of the installed directory.

MPICH-GM

MPICH-GM is a freely available MPI implementation that runs over Myrinet. There are many ways to build MPICH-GM. Free free to build it your way or the xCAT way:

MPICH must be installed in a directory that is globally accessible to all nodes. Export the environmental variable MPICHROOT to specify that location before running mpimaker. (default if unspecified is /usr/local/mpich).
export MPICHROOT=/usr/local/mpich
Download from ftp://ftp.myri.com/pub/MPICH-GM into /tmp. In this example: mpich-1.2.6..13.tar.gz.

Run mpimaker with no arguments to get a list of supported compilers.

$XCATROOT/build/mpi/mpimaker

Output:

Usage: mpimaker: [mpich tarball] [up|smp] [compiler] [rsh|ssh|{full path to other}]

Supported Compilers/Architectures:

gnu:                    i686(32bit), ia64(64bit), x86_64(64bit), ppc64(32bit)
gnu32:                  x86_64(32bit), ppc64(32bit)
gnu64:                  x86_64(64bit), ppc64(64bit)

intel:                  i686(32bit), ia64(64bit), x86_64(64bit)
                        8.x compiler support only

pgi:                    i686(32bit)
pgi32:                  x86_64(32bit)
pgi64:                  x86_64(64bit)
pgi64gcc:               x86_64(64bit) PGI Fortran, GNU gcc
pgi64gcc33:             x86_64(64bit) PGI Fortran, GNU gcc33 (SLES8)
pgi64gcc33nogmalloc:    x86_64(64bit) PGI Fortran, GNU gcc33 (SLES8) 
                          GM only --with-device=ch_gm:-disable-registration
                        5.2 compiler support only

absoft64gcc33:          x86_64(64bit) Absoft Fortran, GNU gcc33 (SLES8)

nag64gcc33:             x86_64(64bit) NAG Fortran, GNU gcc33 (SLES8)

ibmcmp32:               ppc64(32bit) IBM VCCPP/XLF
ibmcmp64:               ppc64(64bit) IBM VCCPP/XLF

Hint(s):

/opt/xcat/build/mpi/mpimaker mpich-1.2.6..13.tar.gz smp pgi64 ssh
/opt/xcat/build/mpi/mpimaker mpich-1.2.6.tar.gz smp pgi64 ssh

Build MPICH-GM, E.g.

cd /tmp
/opt/xcat/build/mpi/mpimaker mpich-1.2.6..13.tar.gz smp pgi64 ssh

The first argument to mpimaker is the name of the MPICH tarball.

The second argument is used to specify if you want shared memory support or not (smp = shared memory).

NOTE: Shared memory support is for MPICH-IP only, MPICH-GM shared memory support is a runtime option. Just specify smp for GM.

The third option is the compiler to use (run mpimaker with no args to get a list of supported compilers).

The forth option is the remote shell command to use.

A successful build should return:

mpimaker: mpich-1.2.6..13.tar.gz smp pgi64 ssh build start
mpimaker: mpich-1.2.6..13.tar.gz smp pgi64 ssh make
mpimaker: mpich-1.2.6..13.tar.gz smp pgi64 ssh build successful

MPICH installed in /usr/local/mpich/1.2.6..13/gm/x86_64/smp/pgi64/ssh

Please check config.cmd make.log install.log configure.log in
/usr/local/mpich/1.2.6..13/gm/x86_64/smp/pgi64/ssh
for errors. config.cmd was the command used to build MPICH

If the build failed check the files config.cmd make.log install.log configure.log in /tmp/mpich-1.2.6..13.
Inspect output. If the build was successful you may wish to check the files config.cmd make.log install.log configure.log in the root of the installed directory.

LAM-MPI

LAM-MPI is a freely available MPI implementation that runs over IP. There are many ways to build LAM-MPI. Free free to build it your way or the xCAT way:

WIP

Torque/Maui IANS

If you are already familiar with starting and monitoring jobs through Torque/Maui you may still want to read this section to understand how Torque/Maui is setup by xCAT.

Attributes

xCAT and Torque share the same attributes. xCAT attributes are stored in $XCATROOT/etc/nodelist.tab and Torque attributes are stored in /var/spool/pbs/server_priv/nodes. The Torque nodes file is generated by xCAT's makepbsnodefile command.

In this document will assume that you assigned an attribute of ia64 to all IA64 nodes and an attribute of x86 to all ia32/i686 nodes. Examples using the attribute of compute refer to any compute node (ia64 or x86).

It will be necessary to edit supplied Torque scripts with the correct node assignment attributes for your cluster.

Submitting a Job

qsub is the Torque command to submit a job. Job submission is not allow for root. At a minimum you must specify how many nodes for how long.

To request (16) IA64 nodes with 2 processors/node, for 10 minutes interactively and assuming that Torque has ia64 assigned as an attribute to all the IA64 nodes, type:

$ qsub -l nodes=16:ia64:ppn=2,walltime=10:00 -I

To request 10 nodes of any attribute with 2 processors/node for 24 hours to run your Torque script foobar.pbs, type:

$ qsub -l nodes=10:ppn=2,walltime=24:00:00 foobar.pbs

Monitoring and Job Output

Use the tools pbstop, qstat, and showq to monitor Torque/Maui.

If Torque has the appropriate patches and your home directory contains a .pbs_spool subdirectory, then you can monitor your job stdout and stderr real-time with tail -f. If you do not have this patch applied your output will be in the directory submitted from in the format: jobname.ojobnumber for stdout and jobname.ejobnumber for stderr. You will have to wait for the job to complete before the files will be present.

Torque Help

RTFM.

E.g.

$ man qsub

Advice

For each benchmark run a small test first to test out all the scripts interactively, then a small test through Torque/Maui before submitting larger tests.

E.g. interactively:

$ cd ~/bench/benchmark
$ qsub -l nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready

----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------

node10:~> cd $PBS_O_WORKDIR
node10:~/bench/benchmark> ./benchmark.pbs
node10:~/bench/benchmark> exit

Benchmarks

As a regular user extract the xCAT bench tree from $XCATROOT/build/bench/bench.tgz into your home directory. This tarball contains the required Makefiles, configuration, Torque job scripts, and scripts to analyze each benchmark.
As a regular user extract the xCAT bench tree from $XCATROOT/build/bench/bench-oss.tgz into your home directory. This tarball contains OSS benchmark code.
Export environmental variables. The following should be exported:

MPICH
ATLASROOT
GOTOROOT
ACMLROOT

e.g.

export MPICH=/usr/local/mpich/1.2.5..10/gm-1.6.3_Linux-2.4.19-SMP-ia64/smp/gnu/ssh
export ATLASROOT=/usr/local/atlas
export GOTOROOT=/usr/local/goto
export ACMLROOT=/usr/local/acml
Setup path.

export PATH=$MPICH/bin:$PATH
Setup Intel compiler environment.

export INTELROOT=/usr/local/intel

For IA64, IA32, x86_64:

. $INTELROOT/$(uname -m)/bin/iccvars.sh

For 32-bit x86_64:

. $INTELROOT/i686/bin/iccvars.sh
Setup PGI compiler environment.

export PGI=/usr/local/pgi

For x86_64:

export PATH=$PGI/linux86-64/5.2/bin:$PATH

For IA32, 32-bit x86_64:

export PATH=$PGI/linux86/5.2/bin:$PATH
Setup IBM compiler environment:

export IBMCMP=/usr/local/ibmcmp
export PATH=$IBMCMP/vacpp/6.0/bin:$PATH
export PATH=$IBMCMP/xlf/8.1/bin:$PATH

STREAM

"The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels." -- http://www.cs.virginia.edu/stream/ref.html

IANS, STREAM just bangs on memory. TRIAD results are usually what to look at. STREAM is not a general purpose memory exerciser and utilizes little memory. However if you do have memory anomalies, STREAM can be effected.

STREAM results can be effected by:

Memory configuration. E.g. (2) 512MB DIMMs may perform differently than (4) 256MB DIMMs.
BIOS Version. Different BIOS versions may set different bits in the memory controller chipset.
Processor speed.
Operating system.
Compiler and options.
Kernel type (e.g. NUMA vs SMP vs UNI).
Kernel version..

When analyzing STREAM output only compare apples-to-apples. Every system setting (BIOS) must be identical. Every hardware setting, configuration, OS, kernel, location of DIMMs, etc... must be identical. Exact clones.

Building and running STREAM:

Download stream from ftp://ftp.cs.virginia.edu/pub/stream/Code to ~/bench/stream/build. Get all the files, but none of the directories.
Clean up.

$ cd ~/bench/stream/build
$ rm -f *.out work.* *.o
Build.

HINT: There is a simple script prep that does all of this for you. prep will copy all files to the directory ORIG first. Watch for errors. Type: ./prep

For each version of STREAM you want to run type:

$ make -f makefile

Where makefile is one of the included makefiles:

makefile_c_i686
makefile_c_ia64
makefile_c_omp_i686
makefile_c_omp_ia64
makefile_c_omp_x86_64
makefile_c_x86_64
makefile_f_i686
makefile_f_ia64
makefile_f_tuned_i686
makefile_f_tuned_ia64
makefile_f_tuned_x86_64
makefile_f_x86_64

Feel free to edit the makefiles.

Strip any debugging code out (ignore any errors):

$ strip *
Test. For each benchmark type:

$ ./benchmark
Submit benchmark.

$ cd ~/bench/stream

Edit stream.pbs and correct the settings for the Torque command options, and edit the RSHC environmental variable for the remote shell command. Because STREAM runs so quickly it can be a challenge to submit multiple STREAM tests to a cluster and get even distribution. This Torque script will handle running STREAM on each node assigned by Torque.

$ qsub stream.pbs

Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/stream. If OK, type:

$ cd ~/bench/stream/output
$ ../anal

stream_d results

benchmark          jobs        low       high          %       mean     median    std dev

c_ia64               16    3393.26    3488.88       2.81    3434.70    3441.13      28.74
c_omp_ia64           16    3811.37    3847.14       0.93    3831.52    3836.47      12.51
f_ia64               16    3394.98    3488.65       2.75    3433.98    3440.29      28.07
f_tuned_ia64         16    3390.86    3480.46       2.64    3432.07    3430.85      24.51

The number of jobs should equal the number of nodes even if you specified a ppn > 1. % (max variation from low to high) should be < 5%. Usually for STREAM it is 1-2%. Compiler options can have a large impact on variability and performance. E.g. after changing the optimization level from -O3 to -O2 stability increased, but performance dropped:

stream_d results

benchmark          jobs        low       high          %       mean     median    std dev

c_ia64               16    1038.20    1044.71       0.62    1041.85    1041.93       1.55
c_omp_ia64           16    1980.83    2005.46       1.24    1992.89    1992.90       6.73
f_ia64               16    1038.35    1046.26       0.76    1042.57    1042.57       1.93
f_tuned_ia64         16    1038.84    1044.78       0.57    1041.77    1041.76       1.47

Edit any of the makefiles and rebuild and retest until you get the desired performance and stability. If you cannot reduce variability to < 5% you may have non-identical hardware configurations or hardware problems. Analyze each output file in ~/bench/stream/output to find the irregularities.

NOTE: NUMA Systems.

NPB Serial

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been taken out and they run on one processor.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM results should be achieved first.

Download the NPB 3.0 from http://www.nas.nasa.gov/Software/NPB and save to /tmp, then extract in ~/bench.

$ cd ~/bench
$ tar zxvf /tmp/NPB3.0.tar.gz
Build:

$ cd ~/bench/NPB3.0/NPB3.0-SER

HINT: There is a simple script prep that does all of this for you. Watch for errors. Type: ./prep

$ rm -rf bin
$ mkdir bin
$ cd config

Select correct make.def and copy to make.def. E.g.:

$ cp make.def.ia64 make.def

$ cd ..
$ make suite
$ cd bin
$ strip *
$ cd ..
Manually test.

Start stopwatch.

$ cd bin
$ for i in *
> do
> ./$i
> done 2>&1 | tee output
$ cd ..

Stop stopwatch.

$ less bin/output

Check for errors.
Submit benchmark.

$ cd ~/bench/NPB3.0/NPB3.0-SER

Edit serial.pbs and correct the settings for the Torque command options (usually just the attributes). nodes should remain =1, and ppn should = the total number of processors in the node (usually =2). The walltime should be 15-30 minutes > than the results from your manual test.

$ ./runit n

Where n is the number of nodes you want to run the benchmark on.

Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/NPB3.0/NPB3.0-SER. If OK, type:

$ cd ~/bench/NPB3.0/NPB3.0-SER/output
$ ../anal

NPB Serial

benchmark          jobs        low       high          %       mean     median    std dev

bt.A                 16    1132.79    1142.69       0.87    1139.64    1140.16       2.41
cg.A                 16     238.53     241.72       1.33     240.42     240.32       0.74
ep.A                 16      21.96      21.99       0.13      21.97      21.99       0.01
ft.A                 16     619.41     625.72       1.01     623.12     623.10       1.76
is.A                 16      49.17      50.01       1.70      49.82      49.87       0.21
lu.A                 16     830.55     859.95       3.53     839.37     836.44       9.15
mg.A                 16    1056.95    1069.20       1.15    1062.30    1062.17       3.67
sp.A                 16     713.53     719.78       0.87     716.31     716.83       2.01

The number of jobs should equal the n you passed to ./runit n. % (max variation from low to high) should be < 5%. Compiler options can have a large impact on variability and performance.

NPB OpenMP

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been replaced with OpenMP calls to run multiple processor on a shared memory system.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial results should be achieved first.

Download the NPB 3.0 from http://www.nas.nasa.gov/NAS/NPB and save to /tmp, then extract in ~/bench.

$ cd ~/bench
$ tar zxvf /tmp/NPB3.0.tar.gz

Build:

$ cd ~/bench/NPB3.0/NPB3.0-OMP

HINT: There is a simple script prep that does all of this for you. Watch for errors. Type: ./prep

$ rm -rf bin
$ mkdir bin
$ cd config

Select correct make.def and copy to make.def. E.g.:

$ cp make.def.ia64 make.def

$ cd ..
$ make suite
$ cd bin
$ strip *
$ cd ..

Manually test.

Start stopwatch.

$ cd bin
$ for i in *
> do
> ./$i
> done 2>&1 | tee output
$ cd ..

Stop stopwatch.

$ less bin/output

Check for errors.
Submit benchmark.

$ cd ~/bench/NPB3.0/NPB3.0-OMP

Edit omp.pbs and correct the settings for the Torque command options (usually just the attributes). nodes should remain =1, and ppn should = the total number of processors in the node (usually =2). The walltime should be 15-30 minutes > than the results from your manual test.

$ ./runit n

Where n is the number of nodes you want to run the benchmark on.

Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/NPB3.0/NPB3.0-OMP. If OK, type:

$ cd ~/bench/NPB3.0/NPB3.0-OMP/output
$ ../anal

NPB OpenMP

benchmark          jobs        low       high          %       mean     median    std dev

bt.A                 16    2044.48    2064.61       0.98    2054.67    2054.68       4.60
cg.A                 16     418.72     441.48       5.43     429.96     430.39       4.72
ep.A                 16      44.43      44.46       0.06      44.45      44.45       0.00
ft.A                 16    1042.91    1060.65       1.70    1053.08    1053.53       4.66
lu.A                 16    1739.11    1963.40      12.89    1856.81    1873.96      67.51
mg.A                 16    1467.50    1506.04       2.62    1486.37    1486.54      11.85
sp.A                 16    1087.26    1104.33       1.57    1096.15    1095.74       5.68

The number of jobs should equal the n you passed to ./runit n. % (max variation from low to high) should be < 5%, however multiprocessor tests can have higher variability. You may want to analyze the output and track down the nodes that are increasing the variability, then rerun. If variability moves around you don't have a hardware problem, E.g.:

lu.A has a high degree of variability.

$ cd ~/bench/NPB3.0/NPB3.0-OMP/output
$ grep total *lu* | sort +4 -n

lu.A.node012.88.master: Mop/s total     =                  1739.11
lu.A.node009.91.master: Mop/s total     =                  1748.24
lu.A.node008.92.master: Mop/s total     =                  1766.47
lu.A.node005.95.master: Mop/s total     =                  1800.81
lu.A.node002.98.master: Mop/s total     =                  1803.29
lu.A.node016.84.master: Mop/s total     =                  1819.56
lu.A.node007.93.master: Mop/s total     =                  1850.71
lu.A.node010.90.master: Mop/s total     =                  1866.03
lu.A.node011.89.master: Mop/s total     =                  1873.96
lu.A.node004.96.master: Mop/s total     =                  1899.82
lu.A.node001.99.master: Mop/s total     =                  1903.54
lu.A.node013.87.master: Mop/s total     =                  1903.75
lu.A.node006.94.master: Mop/s total     =                  1914.18
lu.A.node015.85.master: Mop/s total     =                  1916.82
lu.A.node014.86.master: Mop/s total     =                  1939.39
lu.A.node003.97.master: Mop/s total     =                  1963.40

Note that output is evenly disbursed. This usually indicates software, but not necessarily a problem. Rerun the benchmark.

$ cd ~/bench/NPB3.0/NPB3.0-OMP/bin

Remove the benchmarks that you do not need to retest.

$ rm bt.A cg.A ...
$ cd ..
$ mv output output1
$ ./runit large n (you want multiple runs/node)
$ cd output
$ grep total *lu* | sort +4 -n | grep node013

lu.A.node013.172.master: Mop/s total     =                  1683.24
lu.A.node013.161.master: Mop/s total     =                  1834.06
lu.A.node013.151.master: Mop/s total     =                  1867.71
lu.A.node013.135.master: Mop/s total     =                  1943.91

Checking a random node the output varies. Higher variation for this benchmark may be normal. Compiler options can also have a large impact on variability and performance.

IOzone

"IOzone is a filesystem benchmark tool. The benchmark generates and measures a variety of file operations. Iozone is useful for performing a broad filesystem analysis of a vendor’s computer platform." -- http://www.iozone.org

Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial results should be achieved first.

Download the iozone from http://www.iozone.org and save to /tmp, then extract in ~/bench/iozone/build.

$ cd ~/bench
$ tar xvf /tmp/iozone3_217.tar
Build:

$ cd ~bench/iozone/build/src/current
$ make arch-type (e.g. linux, linux-ia64, linux-AMD64, linux-powerpc)
$ strip iozone
$ mv iozone ../../../iozone.$(uname -m)

The test must be run from a large local file system.

$ cd large local file system
$ ~/bench/iozone/iozone.$(uname -m) -j 1 -t 1 -l 1 -u 1 -L L1 cache size -S L2 cache size -s filesize -i 0 -i 1 -r 16M

Where L1 cache size is your processor type L1 cache size, e.g. IA64 = 128.
Where L2 cache size is your processor type L2 cache size, e.g. IA64 1.3GHz = 3072.
Where filesize is double RAM (e.g. 32G for a 16GB system).

Iozone: Performance Test of File I/O
Version $Revision: 3.217 $
Compiled for 64 bit mode.
Build: linux-ia64 

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million, 
Jean-Marc Zucconi, Jeff Blomberg,
Erik Habbinga, Kris Strecker.

Run began: Tue Feb 17 20:10:42 2004

File size set to 33554432 KB
Record Size 16384 KB
Command line used: /home/ibm/bench/iozone/iozone.ia64 -j 1 -t 1 -l 1 -u 1 -L 128 -S 3072 -s 32G -i 0 -i 1 -r 16M
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 3072 Kbytes.
Processor cache line size set to 128 bytes.
File stride size set to 1 * record size.
Min process = 1 
Max process = 1 
Throughput test with 1 process
Each process writes a 33554432 Kbyte file in 16384 Kbyte records

Children see throughput for 1 initial writers = 59938.48 KB/sec
Parent sees throughput for 1 initial writers = 45289.51 KB/sec
Min throughput per process = 59938.48 KB/sec 
Max throughput per process = 59938.48 KB/sec
Avg throughput per process = 59938.48 KB/sec
Min xfer = 33554432.00 KB

Children see throughput for 1 rewriters = 69368.87 KB/sec
Parent sees throughput for 1 rewriters = 48407.56 KB/sec
Min throughput per process = 69368.87 KB/sec 
Max throughput per process = 69368.87 KB/sec
Avg throughput per process = 69368.87 KB/sec
Min xfer = 33554432.00 KB

Children see throughput for 1 readers = 59893.57 KB/sec
Parent sees throughput for 1 readers = 59893.49 KB/sec
Min throughput per process = 59893.57 KB/sec 
Max throughput per process = 59893.57 KB/sec
Avg throughput per process = 59893.57 KB/sec
Min xfer = 33554432.00 KB

Children see throughput for 1 re-readers = 59999.96 KB/sec
Parent sees throughput for 1 re-readers = 59999.85 KB/sec
Min throughput per process = 59999.96 KB/sec 
Max throughput per process = 59999.96 KB/sec
Avg throughput per process = 59999.96 KB/sec
Min xfer = 33554432.00 KB

Do this for each file system you plan to test. Record the total time.

Submit tests.

$ cd ~/bench/iozone

Edit iozone.pbs and correct the settings for the Torque command options (usually just the attributes). nodes should remain =1, and ppn should = the total number of processors in the node (usually =2). The walltime should be 60 minutes > than the results from your manual test. Edit the FSS environmental variable and enter a space delimited list of directories to be tested. A directory off the root of the file system is fine.

FSS="/tmp /scr/$PBS_JOBID"

Edit the OPTIONS environmental variable using the same options as the manual test. E.g.:

OPTIONS="-j 1 -t 1 -l 1 -u 1 -L 128 -S 3072 -s 32G -i 0 -i 1 -r 16M"

NOTE: xCAT's prologue scripts will setup a directory with the submitting user's permissions in /scr, /scratch, and /nobackup if they exist with the $PBS_JOBID as the name.

$ ./runit n

Where n is the number of nodes you want to run the benchmark on.
Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/bonnie. If OK, type:

$ cd ~/bench/iozone/output
$ ../anal
```
iozone results

benchmark              nodes        low       high          %       mean     median    std dev

/tmp(w)                  100      16.34      49.14     200.73      44.00      44.80       3.85
/tmp(r)                  100      36.90      41.65      12.87      38.63      38.82       1.44
/scr/PBS(w)              100      55.90      70.70      26.47      64.35      64.33       1.13
/scr/PBS(r)              100      57.62      65.13      13.03      59.07      59.05       0.76
```
This analyze script only reports sequential block read and write performance in K/s. The number of jobs should equal the number of nodes even if you specified a ppn > 1. % (variation from low to high) varies from file system to file system. The number of operations/second is much smaller that the number of operations/second for a processor or memory benchmark. This may increase variability. The type of operations for I/O are also more complex. This may also increase variability.

In the example above /tmp is on the system disk with swap. iozone uses all memory and will increase swap usage. It is possible that may create anomalies when benchmarking a system disk. The /scr file system however has reduced variability. The mean and the median are very similar and close to center between low and high indicating a possible even distribution as illustrated:

There are a few anomalies that should be checked. View the raw data in cd ~/bench/iozone/output. Rerun for that node only if desired.

Ping-Pong

Ping-Pong is a simple benchmark that measures latency and bandwidth for different message sizes. There are many ping-pong benchmarks available. The origin of the pingpong.c included with xCAT is unknown.

Build. NOTE: A GNU compiled version MPICH is acceptable.

$ cd ~/bench/pingpong
$ make clean
$ make
$ strip pingpong
Edit pingpong.pbs and correct the settings for the Torque command options, and edit the RSHC environmental variable for the remote shell command and the MPICH environmental variable for the location of MPICH. Because Ping-Pong runs so quickly it can be a challenge to submit multiple Ping-Pong tests to a cluster and get even distribution. This Torque script will handle running Ping-Pong on each pair assigned by Torque. IANS, nodes= should = all the nodes you want to test, not just 2.
Test.

$ cd ~/bench/pingpong
$ qsub -l nodes=2:compute:ppn=2,walltime=1:00:00 -I
$ cd $PBS_O_WORKDIR
$ ./pingpong.pbs
$ cd output
$ cat *
$ exit
Submit:

$ cd ~/bench/pingpong
$ qsub pingpong.pbs

Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/bonnie. If OK, type:

$ cd ~/bench/pingpong/output
$ ../anal

pingpong

benchmark          jobs        low       high          %       mean     median    std dev

latency               8      11.74      11.79       0.42      11.76      11.77       0.02
4000.bw               8      64.90      65.00       0.15      64.97      65.00       0.04
8000.bw               8      85.40      85.50       0.11      85.46      85.50       0.05
16000.bw              8     100.50     100.60       0.09     100.52     100.50       0.04
32000.bw              8     163.40     163.50       0.06     163.43     163.40       0.05
64000.bw              8     191.70     191.90       0.10     191.81     191.80       0.06
128000.bw             8     209.80     209.90       0.04     209.85     209.90       0.05
256000.bw             8     219.20     219.40       0.09     219.30     219.30       0.05
512000.bw             8     224.40     224.60       0.08     224.46     224.50       0.07
1024000.bw            8     227.30     227.50       0.08     227.40     227.40       0.05
2048000.bw            8     229.70     229.80       0.04     229.75     229.80       0.05

The number of jobs should = 1/2 of nodes= in pingpong.pbs even if you specified a ppn > 1. % (variation from low to high) should be < 5%. If you cannot reduce variability to < 5% you may have non-identical hardware configurations, hardware, or software problems. Analyze each output file in ~/bench/pingpong/output to find the irregularities. Rerun tests manually to validate software or hardware irregularities.

Pallas MPI Benchmark

Pallas MPI Benchmark (PMB) provides a concise set of benchmarks targeted at measuring the most important MPI functions.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM, NPB Serial, and Ping-Pong results should be achieved first.

Download PMB2.2.1 from http://www.pallas.com/e/products/pmb and save to /tmp, then extract in ~/bench/PMB.

$ cd ~/bench/PMB
$ tar zxvf /tmp/PMB2.2.1.tar.gz
Build:

$ cd ~/bench/PMB2.2.1/SRC_PMB
$ cp -f Makefile.xcat Makefile
$ make clean
$ make
$ cp PMB-MPI1 ../..
Edit pmb.pbs and correct the settings for the Torque command options (usually just the attributes). nodes should = as many nodes you want to test, and ppn should = the total number of processors in the node (usually =2). The walltime should be very large for a large cluster. 24 hours is good. Edit the MPICH environmental variable for the location of MPICH.
Manual test:

$ cd ~/bench/PMB
$ qsub -l nodes=2:compute:ppn=2,walltime=1:00:00 -I
$ cd $PBS_O_WORKDIR
$ ./pmb.pbs

Check pbm.2x2.out for errors.
Submit:

$ cd ~/bench/PMB
$ qsub pmb.pbs
Analyze. Check the pbm*.out file for errors. The purpose of this test is to validate that it runs and that there are no problems with MPI.

NAS Parallel

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial results should be achieved first and PMB should pass.

Download the NPB 2.4 from http://www.nas.nasa.gov/NAS/NPB and save to /tmp, then extract in ~/bench.

$ cd ~/bench
$ tar zxvf /tmp/NPB2.4.tar.gz
Build:

$ cd ~/bench/NPB2.4/NPB2.4-MPI

HINT: There is a simple script prep that does all of this for you. Watch for errors. Type: ./prep

$ rm -rf bin
$ mkdir bin
$ cd config

Select correct make.def and copy to make.def. E.g.:

$ cp make.def.ia64 make.def

$ cd ..
$ make suite
$ cd bin
$ strip *
$ cd ..

Edit parallel.pbs and correct the settings for the Torque command options (usually just the attributes). nodes x ppn = 16. Edit the MPICH environmental variable for the location of MPICH.
Manual test:

$ cd ~/bench/NPB2.4/NPB2.4-MPI
$ qsub -l nodes=8:compute:ppn=2,walltime=1:00:00 -I
$ cd $PBS_O_WORKDIR
$ ./parallel.pbs

Check output directory for errors. Record the time it took.
Edit parallel.pbs and correct the walltime. It should be 30 minutes > than the results from your manual test.
Submit benchmark:

$ cd ~/bench/NPB2.4/NPB2.4-MPI
$ ./runit n

Where n is the number of test you want to run. Run enough for all nodes. Each test uses 16 processors (usually 8 nodes).

Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/NPB2.4/NPB2.4-MPI. If OK, type:

$ cd ~/bench/NPB2.4/NPB2.4-MPI/output
$ ../anal
```
NPB MPI

benchmark          jobs        low       high          %       mean     median    std dev

bt.C.16               8    4962.76    5019.85       1.15    5001.53    5009.26      19.22
cg.C.16               8    2099.15    2206.02       5.09    2168.93    2193.77      38.09
ep.C.16               8     351.20     352.30       0.31     352.01     352.29       0.47
ft.C.16               8    3875.62    4082.68       5.34    3994.40    4047.88      79.27
is.C.16               8     132.17     144.62       9.41     139.09     141.56       4.95
lu.C.16               8   12285.40   12461.64       1.43   12382.44   12388.76      50.90
mg.C.16               8    9305.53    9711.40       4.36    9449.80    9462.39     131.44
sp.C.16               8    4373.73    4457.04       1.90    4434.64    4448.91      28.32
```
The number of jobs should equal the n you passed to ./runit n. % (variation from low to high) should be < 5%, however multiprocessor and multinode tests can have higher variability. You may want to analyze the output and track down the jobs that are increasing the variability, then rerun. If variability moves around you don't have a hardware problem. Higher variation for this benchmark may be normal. Compiler options can also have a large impact on variability and performance.

In this example is.C.16 has high variability, it also ran in < 10 seconds on IA64. This benchmark may be too small to get consistency, and unfortunately the is benchmark does not have a larger (class D) problem size.

High Performance Linpack

"HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark." -- http://www.netlib.org/benchmark/hpl

Like the STREAM benchmark each node should be identical (cloned) when checking for variability. Consistent STREAM and NPB Serial and Parallel results should be achieved first and PMB should pass.

Download HPL from http://www.netlib.org/benchmark/hpl and save to /tmp, then extract in ~/bench.

$ cd ~/bench
$ tar zxvf /tmp/hpl.tgz
Build:

$ cd ~/bench/hpl
$ make arch=arch

Where type is:

ia64_goto+mkl
p4_goto

$ cd bin/arch
$ strip xhpl
Test:

$ cd ~/bench/hpl/bin/arch
$ cp ../../hpl.pbs .
$ cp ../../HPL2.dat HPL.dat

Edit hpl.pbs and correct the settings for the MPICH environmental variable for the location of MPICH and correct the settings for the LD_LIBRARY_PATH for the Intel MKL. Edit HPL.dat and change the NB parameter to 216 for IA64 and 120 for Pentium 4.

$ qsub -l nodes=2:compute:ppn=2,walltime=1:00:00 -I
$ cd $PBS_O_WORKDIR
$ ./hpl.pbs

While running check the free RAM on both nodes assigned by Torque. If you are running out of free, lower the problem size (N) in HPL.dat and rerun.

$ exit
Submit:

$ cd ~/bench/hpl/bin/arch

Edit hpl.pbs and correct the settings for the Torque command options (usually just the attributes). nodes should = the number of nodes you want to test (usually all of them) and ppn should = number of processors (usually 2). Edit and correct the settings for the MPICH environmental variable for the location of MPICH and correct the settings for the LD_LIBRARY_PATH for the Intel MKL and Goto Libraries.

Edit HPL.dat and change the following parameters:

    NB parameter to 72 for IA64 and 80 for Pentium 4. E.g.

        72    NB

    # of problems sizes (N) parameter to as many tests (N) you want to run. E.g.

        3    # of problems sizes (N)

    N parameter is a single space delimited list of problem sizes. E.g.

        10000 20000 30000

    P & Q
        P x Q = number of processors. E.g. if nodes=16 and ppn=2, then P x Q = 32.
        P and Q should be as close together as possible.
        Q > P
    If P x Q = 32 then p=4 and q=8.

$ qsub hpl.pbs
Analyze output. Check the Torque output files for stdout and stderr and verify that there were no errors. Look in .pbs_spool and ~bench/hpl/bin/arch. Check for the output in ~bench/hpl/bin/arch/hpl.*.

There is no analyze script for HPL. While running the benchmark check the free memory on each node. Increase N until utilizing 90% of RAM. When you have achieved best results start to adjust BCAST (0 - 5. After time experimenting with BCAST I can safely say--it depends. i.e. Try all values of BCAST.).

Read ~/bench/hpl/TUNING for tips.
Get famous. Submit your results to http://www.top500.org/main/submit.php.
Generate graphs.

Support

http://xcat.org

Egan Ford
egan@us.ibm.com
August 2004