xCAT HPC Production Readiness and Benchmark HOWTO (WIP)
(Formally known as the xCAT Benchmark HOWTO)

It is generally a good idea to verify that the cluster you just built actually can do work.  This can be accomplished by running a few industry accepted benchmarks.  The purpose of benchmarking is not to get the best results, but to get consistent repeatable accurate results that are also the best results.

With a goal of "consistent repeatable accurate results" it is best to start with as few variables as possible.  I recommend starting with single node benchmarks.  e.g. STREAM.  If all machines have similar STREAM results, then memory can be ruled out as a factor with other benchmark anomalies.  Next work your way up to processor and disk benchmarks, then multinode (parallel) benchmarks, then finally HPL.  After each more complicated benchmark runs check for "consistent repeatable accurate results" before continuing.

Outlined below is a recommended path.

Single Node (serial) Benchmarks:

Parallel Benchmarks:

Prerequisites:

Feel free to follow your own path, but please start with the Sanity Checks and Torque IANS.

Sanity Checks

Torque/Maui IANS

Benchmarks

Sanity Checks

Hardware Check

When analyzing performance anomalies only compare apples-to-apples.  Every system setting (BIOS) must be identical.  Every hardware setting, configuration, OS, kernel, location of DIMMs, etc... must be identical.  Exact clones.

First check that each node per class of node has the same amount of RAM, e.g.:

psh noderange free | grep Mem: | awk '{print $3}' | sort | uniq

It is OK to get multiple values as long as they are very close together.

Next check that each node per class of node has the same number of processors, e.g.:

psh noderange 'cat /proc/cpuinfo | grep processor | wc -l' | awk '{print $2}' | sort | uniq

Next check that each node per class of node has the same BIOS version, e.g.:

psh -s noderange $XCATROOT'/$(uname -m)/sbin/dmidecode;echo END' | \
perl -pi -e '$/ = "END\n";' -e 's/.*BIOS Information\n[^\n]*\n([^\n]*\n).*END/\1/sg' | \
grep Version | awk -F: '{print $3}' | sort | uniq


OR

psh compute $XCATROOT'/$(uname -m)/sbin/dmidecode | grep Version | head -1' | \
awk -F: '{print $3}' | sort | uniq


To find the offending nodes:

psh -s noderange $XCATROOT'/$(uname -m)/sbin/dmidecode;echo END' | \
perl -pi -e '$/ = "END\n";' -e 's/.*BIOS Information\n[^\n]*\n([^\n]*\n).*END/\1/sg' | \
grep Version | sort +2

OR

psh compute $XCATROOT'/$(uname -m)/sbin/dmidecode | grep Version | head -1' | sort +2


Network Check

No network, no cluster.  The stability of the network is critical.  As root type:

ppping noderange

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from.  ppping (parallel parallel ping) is an xCAT utility that will tell each node in the noderange to ping all the nodes in the noderange.  No news is good news, i.e. no output is good, only errors will be displayed.

If you have Myrinet, ssh to a Myrinet node and type:

ppping -i myri0 noderange

Where noderange is a list of all Myrinet nodes in the cluster.

Both tests are critical and must succeed, if not, you will never launch a job.

Disable Hyperthreading

Intel P4 and x86_64 Hyperthreading do not aid in the performance of processor and memory intensive HPC applications--it actually hurts performance.

Illustrated below the HPL benchmark.  xhpl (HT disabled) continues to grow as the problem size increases.  xhpl-ht (HT enabled) flattens out as the problem size increases.

All attempts to leverage HT in HPC applications have failed.  E.g. multithreaded libs with OMP.

Setup User Environment

Never run benchmarks as root.  Use the addclusteruser command or equivalent to create a user on your primary user/login node.  If you are using NIS you are ready to go.  If you are using password synchronization then type:

pushuser noderange username

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from.

Verify with psh that all nodes have mounted the correct /home directory and that your added user is visible and that the directory permissions are correct:

psh -s noderange 'ls -l /home | grep username'

Where noderange is a list of all nodes in the queuing system, plus any nodes that users will be submitting jobs from and username is the user you are posing as.

Torque/Maui Check

Using pbstop and showq verify that all of your nodes are visible and ready.

Login as your sample user (e.g. bob) and test Torque/Maui.

bob@head01:~> qsub -l nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready

----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------


Note the Nodes: line.  Try to ssh from node to node and back to the user node that started qsub:

bob@node10:~> ssh node9
bob@node9:~> exit
logout
Connection to node9 closed.
bob@node10:~> ssh head01
bob@head01:~> exit
logout
Connection to head01 closed.
bob@node10:~> exit
logout

qsub: job 0.head01.foobar.org completed

Now try to ssh back to the nodes that were assigned, you should be denied:

bob@head01:~> ssh node9
14653: Connection close by 199.88.179.209

32/64 Architectures

It is worth noting that x86_64 and ppc64 architectures can generate and execute 32-bit and 64-bit code.  Below is a chart of the options.  The default behavior is in bold.

  GNU Intel PGI 5.2 IBM
x86_64 -m32, -m64 32bit: Use i686 compiler.

64bit: Use x86_64 compiler.
-tp k8-32, -tp k8-64 n/a
ppc64 not SLES8 -m32, -m64 n/a n/a -q32, -q64
ppc64 SLES8 32bit: g??

64bit: powerpc64-linux-g??
n/a n/a -q32, -q64

Software Install/Check

The following is the required base software.  Benchmark software installation will be covered in the Benchmark section of this document.

For best results use Intel Compilers for IA64 and EM64T, PGI Compilers for x86_64, IBM and GNU for PPC64, and for IA32 use Intel Compilers or PGI.  I have also had good results using PGI for just Fortran and GCC 3.3 for C on x86_64.  Before building or downloading GCC 3.3 verify that is it not already installed:

rpm -qa | grep gcc

You may need to use the full path to gcc33, e.g.:

/opt/gcc33/bin/gcc


Intel 8.0 Compilers (IA32/IA64)
Intel 8.1 Compilers (EM64T)

PGI 5.2 Compilers (Opteron/IA32)
IBM Compilers (PPC64)

GNU Compilers
Intel Math Kernel Libraries 7.0 (optional)
AMD Core Math Libraries (optional)
ESSL/PESSL (PPC64, optional) WIP

Goto Libraries (required for HPL)
ATLAS Libraries (optional)
MPICH
MPICH-GM

LAM-MPI (optional)

Intel 8.0/8.1 Compilers

PGI 5.2 Compilers

IBM Compilers

GNU Compilers

Intel Math Kernel Libraries 7.0

AMD Core Math Libraries

Goto Libraries

ATLAS Libraries

Optimally building ATLAS is beyond the scope of this document, however current binary distributions are available.  It is recommended that you try both.

MPICH

MPICH is a freely available MPI implementation that runs over IP.  There are many ways to build MPICH.  Free free to build it your way or the xCAT way:

MPICH-GM

MPICH-GM is a freely available MPI implementation that runs over Myrinet.  There are many ways to build MPICH-GM.  Free free to build it your way or the xCAT way:

LAM-MPI

LAM-MPI is a freely available MPI implementation that runs over IP.  There are many ways to build LAM-MPI.  Free free to build it your way or the xCAT way:

WIP

 

Torque/Maui IANS

If you are already familiar with starting and monitoring jobs through Torque/Maui you may still want to read this section to understand how Torque/Maui is setup by xCAT.

Attributes

xCAT and Torque share the same attributes.  xCAT attributes are stored in $XCATROOT/etc/nodelist.tab and Torque attributes are stored in /var/spool/pbs/server_priv/nodes.  The Torque nodes file is generated by xCAT's makepbsnodefile command.

In this document will assume that you assigned an attribute of ia64 to all IA64 nodes and an attribute of x86 to all ia32/i686 nodes.  Examples using the attribute of compute refer to any compute node (ia64 or x86).

It will be necessary to edit supplied Torque scripts with the correct node assignment attributes for your cluster.

Submitting a Job

qsub is the Torque command to submit a job.  Job submission is not allow for root.  At a minimum you must specify how many nodes for how long.

To request (16) IA64 nodes with 2 processors/node, for 10 minutes interactively and assuming that Torque has ia64 assigned as an attribute to all the IA64 nodes, type:

$ qsub -l nodes=16:ia64:ppn=2,walltime=10:00 -I

To request 10 nodes of any attribute with 2 processors/node for 24 hours to run your Torque script foobar.pbs, type:

$ qsub -l nodes=10:ppn=2,walltime=24:00:00 foobar.pbs

Monitoring and Job Output

Use the tools pbstop, qstat, and showq to monitor Torque/Maui.

If Torque has the appropriate patches and your home directory contains a .pbs_spool subdirectory, then you can monitor your job stdout and stderr real-time with tail -f.  If you do not have this patch applied your output will be in the directory submitted from in the format: jobname.ojobnumber for stdout and jobname.ejobnumber for stderr.  You will have to wait for the job to complete before the files will be present.

Torque Help

RTFM.

E.g.

$ man qsub

Advice

For each benchmark run a small test first to test out all the scripts interactively, then a small test through Torque/Maui before submitting larger tests.

E.g. interactively:

$ cd ~/bench/benchmark
$ qsub -l nodes=2,walltime=1:00:00 -I
qsub: waiting for job 0.head01.foobar.org to start
qsub: job 0.head01.foobar.org ready

----------------------------------------
Begin PBS Prologue Thu Dec 19 14:17:53 MST 2002
Job ID: 0.head01.foobar.org
Username: bob
Group: users
Nodes: node10 node9
End PBS Prologue Thu Dec 19 14:17:54 MST 2002
----------------------------------------

node10:~> cd $PBS_O_WORKDIR
node10:~/bench/benchmark> ./benchmark.pbs
node10:~/bench/benchmark> exit

 

Benchmarks

STREAM

"The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels." -- http://www.cs.virginia.edu/stream/ref.html

IANS, STREAM just bangs on memory.  TRIAD results are usually what to look at.  STREAM is not a general purpose memory exerciser and utilizes little memory.  However if you do have memory anomalies, STREAM can be effected.

STREAM results can be effected by:

When analyzing STREAM output only compare apples-to-apples.  Every system setting (BIOS) must be identical.  Every hardware setting, configuration, OS, kernel, location of DIMMs, etc... must be identical.  Exact clones.

Building and running STREAM:

NPB Serial

"
The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been taken out and they run on one processor.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM results should be achieved first.

NPB OpenMP

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

The NAS Serial Benchmarks are the same as the NAS Parallel Benchmarks except that MPI calls have been replaced with OpenMP calls to run multiple processor on a shared memory system.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial results should be achieved first.

IOzone

"IOzone is a filesystem benchmark tool. The benchmark generates and measures a variety of file operations.  Iozone is useful for performing a broad filesystem analysis of a vendor’s computer platform." -- http://www.iozone.org

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial results should be achieved first.

Ping-Pong

Ping-Pong is a simple benchmark that measures latency and bandwidth for different message sizes.  There are many ping-pong benchmarks available.  The origin of the pingpong.c included with xCAT is unknown.

Pallas MPI Benchmark

Pallas MPI Benchmark (PMB) provides a concise set of benchmarks targeted at measuring the most important MPI functions.

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM, NPB Serial, and Ping-Pong results should be achieved first.

NAS Parallel

"The NAS Parallel Benchmarks (NPB) are a set of 8 programs designed to help evaluate the performance of parallel supercomputers. The benchmarks, which are derived from computational fluid dynamics (CFD) applications, consist of five kernels and three pseudo-applications. The NPB come in several flavors. NAS solicits performance results for each from all sources." -- http://www.nas.nasa.gov/NAS/NPB

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial results should be achieved first and PMB should pass.

High Performance Linpack

"HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. It can thus be regarded as a portable as well as freely available implementation of the High Performance Computing Linpack Benchmark." -- http://www.netlib.org/benchmark/hpl

Like the STREAM benchmark each node should be identical (cloned) when checking for variability.  Consistent STREAM and NPB Serial and Parallel results should be achieved first and PMB should pass.

Support

http://xcat.org


Egan Ford
egan@us.ibm.com
August 2004