xCAT.org: Extreme Cluster Administration Toolkit home  download  docs  mailing lists  support
home / docs / xCAT HOWTO
xCAT HOWTO
This document was most recently modified:
09/27/2002
The most recent version of this document is available at:
http://xcat.org/docs/xcat-HOWTO.html
printable version without xcat.org header and footer
Author:
Matt Bohnsack
bohnsack@bohnsack.com  http://bohnsack.com/
and others

Make certain you're using the most recent version of this document, before beginning a cluster implementation or even before you begin reading. Be aware that software versions referenced in this document may be out of date. Always check for newer software versions and be aware of the stability of these newer versions.

Send additions and corrections to the author, so this document can continue to be improved.


Table of Contents
  1. Introduction
  2. Understanding What xCAT does - Feature/Functionality Hierarchy
  3. Getting the xCAT Software Distribution
  4. Getting Other Required Software
  5. Reading Related Documentation
  6. Getting Help
  7. Understanding Cluster Components and the Example Cluster's Architecture
  8. Installing the OS on the Management Node
  9. Upgrading RedHat Software
  10. Installing Custom Kernel
  11. Installing xCAT
  12. Configuring xCAT
  13. Configuring the Ethernet Switch
  14. Configuring Networking on the Management Node
  15. Doing the Compute Node Preinstall (stage1)
  16. Configuring the Terminal Servers
  17. Completing Management Node Setup
  18. Collecting MAC Addresses (stage2)
  19. Configuring MPN/ASMA/RSA (stage3)
  20. Configuring APC Master Switches
  21. Installing Compute Nodes
  22. Installing / Configuring Myrinet Software
  23. Installing Portland Group Compilers
  24. Installing MPICH MPI Libraries
  25. Installing LAM MPI Libraries
  26. Installing PBS Resource Manager
  27. Installing Maui Scheduler
  28. Deploying PBS on the Cluster
  29. Adding Users and Setting Up User Environment
  30. Verifying Cluster Operation With Pseudo Jobs and PBS
  31. Running a Simple MPI Job Interactively via PBS
  32. Running a Simple MPI Job in Batch Mode via PBS
  33. Performing Advanced Tasks
  34. Contributing to xCAT
  35. ChangeLog
  36. TODO
  37. Thanks

1. Introduction
xCAT is a collection of mostly script based tools to build, configure, administer, and maintain Linux clusters.

xCAT is for use by IBM and IBM Linux cluster customers only. xCAT is copyright © 2000, 2001, 2002 IBM corporation. All rights reserved. Use and modify all you like, but do not redistribute. No warranty is expressed or implied. IBM assumes no liability or responsibility.

This document describes how to implement Linux cluster on IBM xSeries hardware using xCAT and other third party software. It covers the latest version of xCAT - v1.1RC9.9.5 with RedHat 6.2, 7.0. 7.1, 7.2, or 7.3 as an OS base. All of the examples cover installation on ia32 machines. xCAT does work with ia64 machines, but specific ia64 configuration is not detailed. Specific configuration examples from a somewhat common 32 node PIII cluster configuration are included.

You will need to adjust the configuration examples shown in this document to suit your particular cluster and architecture, but the examples should give a good general idea of what needs to be done. Please don't use this document verbatim as an implementation guide. You should rather use it as an inspiration to your own implementation. Use the man pages, source and other documentation that is available to figure out why certain design/configuration choices are made and how you can make different choices.

This document covers only a very little of the hardware connectivity, cabling, etc. that is required to implement a cluster. Additional documentation including hardware installation and configuration is available as a RedBook at http://publib-b.boulder.ibm.com/Redbooks.nsf/9445fa5b416f6e32852569ae006bb65f/7b1ce6b3913cafb386256bdb007595e8?OpenDocument&Highlight=0,SG24-6623-00. If you're serious about implementing a cluster and learning how things work, you should read the RedBook in addition to this document.

Back to TOC

2. Understanding What xCAT Does - Feature/Functionality Hierarchy
This section explains what you can do with xCAT, why xCAT is designed the way it is, and presents a feature/functionality hierarchy.

2.1 Understanding What Drives xCAT's Design and Architecture      
xCAT's architecture and feature set have two major drivers:
  1. Real world requirements - The features in xCAT are a result of the requirements met in hundreds of real cluster implementations. When users have had needs that xCAT or other cluster management solutions couldn't meet, xCAT has risen to the challenge. Over the last few years, this process has been repeatedly applied, resulting in a modular toolkit that represents best practices in cluster management and a flexibility that enables it to change rapidly in response to new requirements and work with many cluster topologies and architectures.
  2. Unmatched Linux clustering experience - The people involved with xCAT's development have used xCAT to implement many of the world's largest Linux clusters and a huge variety of different cluster types. The challenges faced during this work has resulted in features that enable xCAT to power all types of Linux clusters from the very small to the largest ever built.
2.2 Understanding What Types of Clusters xCAT is Good For      
xCAT works well with the following cluster types:

HPC - High Performance Computing Physics, Seismic, CFD, FEA, Weather, and other simulations; Bioinformatic work
HS - Horizontal Scaling Web farms, etc.
Administrative Not a traditional cluster, but a very convenient platform to install and administer a number of Linux machines
Windows or other OSes With xCAT's cloning and imaging support, it can be used to rapidly deploy and conveniently manage clusters with compute nodes that run Windows or any other OS
Other xCAT's modular toolkit approach makes it easy to adjust for building any type of cluster.


2.3 Understanding xCAT's Features      
A list of xCAT's current and planned for the near future features follows:

Status Feature Type Feature
current OS/Distribution support RedHat 6.2, 7.0, 7.1, 7.2, 7.3, and 8.0-beta on management and compute / storage / interactive | head nodes
current OS/Distribution support RedHat for ia64 on management and compute / storage / interactive | head nodes
current OS/Distribution support Any OS on compute nodes via OS agnostic imaging support
current Hardware Control Remote Power control (on/off/state) via IBM Management Processor Network and/or APC Master Switch
current Hardware Control Remote software reset (Ctrl+Alt+Del)
current Hardware Control Remote Network BIOS/firmware update and configuration on a lot of IBM hardware
current Hardware Control Remote OS console via pluggable support for a number of different terminal servers
current Hardware Control Remote POST/BIOS console via IBM Management Processor Network and via terminal servers in upcoming IBM BIOS releases.
current Boot Control Ability to remotely change boot type (network or local disk) with syslinux
current Automated installation Parallel install via scripted RedHat kickstart on ia32 and ia64
current Automated installation Parallel install via imaging with other Linux distributions, Widows, or other OSes
current Automated installation Network installation with supported PXE NICs or via etherboot on supported NICs without PXE
current Monitoring Hardware alerts and email notification with IBM's Management Processor Network and SNMP alerts
current Monitoring Remote vitals (fan speed/temp/etc...) with IBM's Management Processor Network
future Monitoring Configurable and extensible monitoring support via mon
future Monitoring Graphical look at cluster status via ganglia.
current Monitoring Remote hardware event logs with IBM's Management Processor Network
current Administration Utilities Parallel remote shell, ping, rsync, and copy
current Administration Utilities Remote hardware inventory with IBM's Management Processor Network
current Software Stack PBS and Maui schedulers - Build scripts, documentation, automated setup, extra related utilities, and deep integration
current Software Stack Myrinet - automated setup and installation
current Software Stack MPI - Build scripts, documentation, automated setup for MPICH, MPICH-GM, and LAM
future Software Stack SGridEngine scheduler supprt
current Usability Command line utilities for all cluster management functions
current Usability Single operations can be applied in parallel to multiple nodes with a very flexible and customizable group/range functionality
current Flexibility Support for various user defined node types

Back to TOC

3. Getting the xCAT Software Distribution
This section explains where and how you can get the xCAT software distribution.

Use the form available at http://xcat.org/download/ to request a username/password, if you don't already have one. Make certain that you fill out all the fields and provide a valid IBM sales contact. If you don't provide a valid IBM sales contact, your request may be ignored.

Back to TOC

4. Getting Other Required Software
This section will list of all the CDs, floppies, and software you might need to install a cluster and where to get them (but it's not complete yet).

4.1 Firmware and Hardware Configuration Software      
4.2 Other Software      
For now see the original sources cited throughout this document and this start at a list:

Back to TOC

5. Reading Related Documentation
There's quite a bit of related documentation available in various stages of completion. You should read it. It's all accessible at http://xcat.org/docs/.

Back to TOC

6. Getting Help
If you need assistance with building, maintaining, or administering your xCAT cluster, or you have an xCAT feature request, try the xCAT-user mailing list or contact your IBM sales rep or other IBM point of contact.

Back to TOC

7. Understanding Cluster Components and the Example Cluster's Architecture
This document uses a 32 node cluster that uses serial terminal servers for out-of-band console access, an APC Master Switch and IBM's Service Processor Network for remote hardware management, ethernet, and Myrinet as the basis of most of its examples. The following three diagrams describe some of the detail of this example cluster:

7.1 Components / Rack Layout      
Here you see how the hardware is positioned in the rack. Starting from the bottom and moving towards the top, we have:
  • The Myrinet switch: Used for high-speed, low-latency inter-node communication. Your cluster may not have Myrinet, if you aren't running parallel jobs that do heavy message passing, or if it doesn't fit in your budget.

  • Nodes 1-16: The first 16 compute nodes. Note that every 8th node has an MPA (Managment Processor Adaptor) installed. You may have RSA adapters or ASMA adapters. These cards enable the SPN (Service Processor Network) to function and remote hardware management to be performed.

  • Monitor/Keyboard: You know what this is.

  • Terminal servers: The terminal enable serial consoles from all of the compute nodes to be accessible from the management node. You will find this feature very useful during system setup and after setup administration.

  • APC master switch: This enables remote power control of devices that are not part of the Service Processor Network.. terminal servers, Myrinet switch, ASMA adapters, etc.

  • The management node: The management node is where we install the rest of the nodes from, manage the cluster, etc.

  • Nodes 17-32: The rest of the compute nodes.. again with Management Processor cards every 8th node.

  • Ethernet switch: Finally, at the top, we have the ethernet switch.
 
Ethernet Switch
node32
... nodes 27 - 31
node26
node25 MPA
node24
... nodes 19 - 24
node18
node17 MPA
Management Node
apc1 APC Master Switch
ts2 Terminal Servers
ts1
Monitor / Keyboard    
node16
... nodes 11 - 17
node10
node09 MPA
node08
... nodes 03 - 07
node02
node01 MPA has MPA card
Myrinet Switch

7.2 Networks      
Here you see the networks that are used in this document's examples. Note the listing of attached devices to the right. Important things to note are:
  • The external network is the organization's main network. In this example, only the management node has connectivity to the external network.

  • The ethernet switch hosts both the cluster and management network on separate VLANs.

  • The cluster network connects the management node to the compute nodes. We use a private class B network that has no connectivity to the external network. This is often the easiest way to do things and a good thing to do if you think your cluster might grow to more than 254 nodes. You may have a requirement to place the compute nodes on a network that is part of your external network.

  • The management network is a separate network used to connect all devices associated with cluster management... terminal servers, ASMA cards, etc. to the management node.

  • Parallel jobs use the message passing network for interprocess communication. Our example uses a separate private class B network over Myrinet. If you are not using Myrinet, this network could be the same as the cluster network. i.e. You could do any required message passing over the cluster network.
 

Cluster Network <- eth0 on management node (1Gb/s)
172.16.0.0/16 <- eth0 on compute nodes (100Mb/s)
  <- Myrinet switch's management port
 
Management Network <- eth1 on management node (100Mb/s)
172.17.0.0/16 <- terminal server's ethernet interfaces
  <- ASMA adapter's ethernet interfaces
  <- APC MasterSwitch's ethernet interface
   
 
External Network <- eth2 on management node (100Mb/s)
10.0.0.0/8
 
Message Passing Network <- myri0 on compute nodes (2Gb/s)
172.18.0.0/16

7.3 Connections      
arch diagram

7.4 Another Connections Diagram      

7.5 Other Architecture Notes      
Other notes about this architecture (and areas where yours may differ and you may need to make adjustments to this document's examples):
  • The compute nodes have no access to the external network.
  • The compute nodes get DNS,DHCP, and NIS services from the management node.
  • NIS is used to distribute username/passwd information to the compute nodes and the management node is the NIS master.
  • The management node is the only node with access to the management network.
  • PBS and Maui are used to schedule/run jobs on the cluster.
  • Users can only access compute nodes when the scheduler has allocated nodes to them and then only with ssh.
  • Jobs will use MPICH or LAM for message passing.

Back to TOC

8. Installing the OS on the Management Node
The first step in building an xCAT cluster is installing Linux on the management node. This is, roughly, how to do just that:

8.1 Create and Configure RAID Devices if Necessary      
If you are using a ServeRAID card in the management node, use the ServeRAID flash/config CD to update the ServeRAID firmware to v4.84 and define you RAID volumes. If you have other nodes with hardware RAID, you might as well update and configure them now as well. You can get this CD from http://www.pc.ibm.com/qtechinfo/MIGR-495PES.html.

8.2 NIS Notes      
If you plan on interacting with an external NIS server, check if it supports MD5 passwords and shadow passwords. If it doesn't support these modern features, don't turn them on during the install of the management node. I'm not absolutely certain on this point, but it's bitten me hard in the past, so be careful.

8.3 Partition Notes      
A good minimum drive partitioning scheme for the management node follows. YMMV:
/boot (200 MB)
/install (4 GB)
/usr/local/ (2.5 GB)
/opt/ (1 GB)
/var (1 GB per every 128 nodes)
/ (the rest of the disk)
SWAP (1 GB)

8.4 User Notes      
Its a good idea to create a normal user other than root during the install. I usually make an 'ibm' user.

8.5 DMA Notes      
Some IDE CDROM dives in x340s have a DMA problem that can cause large data errors to crop up in your install and later CD copying. If you have a CDROM that has this error, or if you don't want to risk having the frustrating experience of finding out if you do have a bad drive, you need to use this workaround:

Pass ide=nodma to the installer, (i.e., install: text ide=nodma) and then after the install is complete, add append="ide=nodma" to /etc/lilo.conf, /sbin/lilo, and reboot to a system with IDE DMA diabled.

If you're using grub, you'll have to add the ide=nodma line to the end to the kernel stanza you want to boot.

I highly recommend that you do the above... I've seen this problem as recently as 2002-06-06.

8.6 Install RedHat      
  • RedHat 7.0, 7.1, 7.2, or 7.3
    Select custom installation. When asked for packages to install choose everything. As an added component, under Kernel options choose to additionally install the SMP kernel.

  • RedHat 6.2
    Install RedHat 6.2 with ServeRAID support using:
    • A RedHat 6.2 with updated ServeRAID CD (modified version of the RedHat 6.2 installation CD). You can get a copy of this CD at ftp://cartman.sl.dfw.ibm.com/OS/linux/ks62sr4.iso (IBM internal site).

      or

    • The standard issue RedHat 6.2 CD without the latest IBM ServeRAID support. You must create a support diskette with the software found at the following URL: http://www.pc.ibm.com/qtechinfo/MIGR-495PES.html. At the boot: prompt, type expert. When asked for disk support, insert your floppy and select the second ServeRAID choice. If you do not see two ServeRAID options in the scroll list, the device driver has not been loaded and you will need to restart your configuration.

    Select custom installation. When asked for packages to install choose everything. As an added component, under Kernel options choose to additionally install the SMPX kernel.
8.7 Bring Up the Newly Installed System      
Reboot and login as root.

8.8 Turn Off Services We Don't Want (General)      
You probably want to turn off some of the network services that are turned on by default during installation for security and other reasons...

To view installed services:

> /sbin/chkconfig --list | grep ':on'

To turn off a service:

> /sbin/chkconfig <service> off

With Redhat 6.2, you'll also have to comment out the services you don't want run in /etc/inetd.conf and then restart inetd.

8.9 Turn Off Services We Don't Want (Specific)      
The following are examples of exactly what services to turn off for a system that works with our example architecture and will have nothing running that isn't necessary:
  • RedHat 6.2:
    /sbin/chkconfig --level 0123456 lpd off
    /sbin/chkconfig --level 0123456 linuxconf off
    /sbin/chkconfig --level 0123456 kudzu off
    /sbin/chkconfig --level 0123456 pcmcia off
    /sbin/chkconfig --level 0123456 isdn off
    /sbin/chkconfig --level 0123456 apmd off
    /sbin/chkconfig --level 0123456 autofs off
    /sbin/chkconfig --level 0123456 httpd off
    /sbin/chkconfig --level 0123456 reconfig off

    Also edit /etc/inetd.conf commenting out the following lines:

    #ftp stream tcp nowait root /usr/sbin/tcpd in.ftpd -l -a
    #telnet stream tcp nowait root /usr/sbin/tcpd in.telnet
    #shell stream tcp nowait root /usr/sbin/tcpd in.rshd
    #login stream tcp nowait root /usr/sbin/tcpd in.rlogind
    #talk dgram udp wait nobody.tty /usr/sbin/tcpd in.talkd
    #ntalk dgram udp wait nobody.tty /usr/sbin/tcpd in.ntalkd
    #finger stream tcp nowait nobody /usr/sbin/tcpd in.fingerd
    #linuxconf stream tcp wait root /bin/linuxconf linuxconf --http

    Then restart inetd:

    > /sbin/service inet restart

  • RedHat 7.1:
    /sbin/chkconfig --level 0123456 autofs off
    /sbin/chkconfig --level 0123456 linuxconf off
    /sbin/chkconfig --level 0123456 reconfig off
    /sbin/chkconfig --level 0123456 isdn off
    /sbin/chkconfig --level 0123456 pppoe off
    /sbin/chkconfig --level 0123456 iptables off
    /sbin/chkconfig --level 0123456 ipchains off
    /sbin/chkconfig --level 0123456 apmd off
    /sbin/chkconfig --level 0123456 pcmcia off
    /sbin/chkconfig --level 0123456 rawdevices off
    /sbin/chkconfig --level 0123456 lpd off
    /sbin/chkconfig --level 0123456 kudzu off /sbin/chkconfig --level 0123456 pxe off

  • RedHat 7.0, 7.2, 7.3:
    You get the point. Use your judgement.
8.10 Erase LAM Package      
You probably want to remove the RedHat LAM package. It can easily get in the way of the MPI software we install later on, because it's an old version and installs itself in /usr/bin:

> rpm --erase lam

Back to TOC

9. Upgrading RedHat Software
RedHat, like all software, has bugs. You should upgrade RedHat with all the available fixes to have the most stable and secure system possible (with the obvious caution that some of the updates might have undesired behaviours). Unless you want your new mega-buck cluster to get rooted by script kiddies, do this!

9.1 Create a Place To Put the Updates and Go There      
We'll use this directory (/install/post/updates/rhxx) later on during the installation of the compute nodes as well.

Substitute rh62, rh70, rh71, rh72, or rh73 for rhxx depending on what version of RedHat you are using.

> mkdir -p /install/post/updates/rhxx
> cd /install/post/updates/rhxx

9.2 Get the Updates      
Go to http://www.redhat.com/download/mirror.html and select a mirror site that has the "Updates" section of RedHat's FTP site.

Download all the rpms from updates/x.x/en/os/i386/ and updates/x.x/en/os/noarch/, to the directory you created above, where x.x is the RedHat release you are using. You will also want to download any glibc packages that are available in updates/x.x/en/os/i686/, so you have an optimized C library. If you're using the RedHat kernel, grab it from updates/x.x/en/os/i686/ as well.

9.3 Install the Updates      
Note the $(ls | egrep -v '^(kernel-)') stuff. We don't install RedHat's kernel updates, unless there's a very good reason.
  • Redhat 7.0 - 7.3:
    > cd /install/post/updates/rhxx
    > rpm -Fvh $(ls | egrep -v '^(kernel-)')

  • Redhat 6.2:
    Only if you're using RedHat 6.2...
    > cd /install/post/updates/rh62
    > rpm -Uivh --force --nodeps db3*
    > rpm -Fvh rpm*
    > rpm --rebuilddb
    > rpm -Fvh $(ls | egrep -v '^(kernel-)')

Back to TOC

10. Installing Custom Kernel
These instructions mention the use of pre-built xCAT custom kernels. In the past, you probably wanted to use these. Today, you should probably use a RedHat kernel update RPM or roll your own.

10.1 Download Custom Kernel      
Available at the link provided on the download page: (http://xcat.org/download/)

10.2 Install Kernel      
RedHat 7.2 or 7.3
Upgrade to lastest redhat kernel RPM or roll your own from kernel.org.

RedHat 7.1
You may want to use the kernel or roll you know, but...then again, you may want to:
> cd / ; tar xzvf kernel-2.4.5-2hpc.tgz

RedHat 6.2, 7.0?
> cd / ; tar xzvf kernel-2.2.19-4hpc.tgz


10.3 Edit /etc/lilo.conf      
# BEGIN example of /etc/lilo.conf for Redhat 6.2 after editing
boot=/dev/sda
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
linear
default=2.2.19-4hpc

image=/boot/vmlinuz-2.2.19-4hpc
        label=2.2.19-4hpc
        read-only
        root=/dev/sda9

image=/boot/vmlinuz-2.2.14-5.0smp
        label=linux
        initrd=/boot/initrd-2.2.14-5.0smp.img
        read-only
        root=/dev/sda9

image=/boot/vmlinuz-2.2.14-5.0
        label=linux-up
        initrd=/boot/initrd-2.2.14-5.0.img
        read-only
        root=/dev/sda9
# END expample of /etc/lilo.conf
10.4 Run lilo to Install and Reboot      
> /sbin/lilo; reboot

Do the right thing if you're using grub instead of lilo.

10.5 RH 7.2, 7.3 notes      
You will definetly want the use a RedHat kernel or roll your own if you're using RH 7.2 or 7.3

10.6 e1000 notes      
You will want to download the latest e1000 driver from Intel's website and build it into your kernel. The e1000 driver supplied with RedHat performs very poorly and in some cases will not work in a VLANed environment. Do this.

Back to TOC

11. Installing xCAT
Installing xCAT on the management node is very straightforward:

11.1 Download the Latest Version of xCAT      
http://xcat.org/download/
The latest verion of xCAT is 1.1RC9.9.5.

11.2 Unpack xCAT Into /opt/      
> cd /opt
> tar xzvf /where/you/put/it/xcat-dist-1.1RC9.9.5.tgz

11.3 Install xCAT      
The following sets up some environment and other stuff. It will be a more all-inclusive script in the future.

> cd /opt/xcat/sbin
> ./setupxcat

Note: setupxcat must actually be run after xCAT .tab files are setup later on.

11.4 Understanding with setupxcat does      
TODO

11.5 Add xCAT Man Pages to $MANPATH and test out xCAT man pages      
Add the following line to /etc/man.config:

MANPATH /opt/xcat/man

Test out the man pages:

> man site.tab

Back to TOC

12. Configuring xCAT
This section describes some of the xCAT configuration necessary for the 32 node example cluster. If your cluster differs from this example, you'll have to make changes. xCAT configuration files are located in /opt/xcat/etc. You must setup these configuration files before proceeding.

12.1 Copy the Sample Config Files to Their Required Location      
> mkdir /opt/xcat/etc
> cp /opt/xcat/samples/etc/* /opt/xcat/etc

12.2 Create Your Own Custom Configuration      
Edit /opt/xcat/etc/* to suit your cluster. Please read the man pages 'man site.tab', etc., to learn more about the format of these configuration files. There is a bit more detail on some of these files in some of the later sections. The following are examples that will work with our example 32 node cluster...

12.3 site.tab      
/opt/xcat/etc/site.tab
# site.tab control most of xCAT's global settings.
# man site.tab for information on what each field means.
# this example uses 'c' as a subdomain private to the cluster and
# 10.0.0.1 as the corp DNS server (forwarder).
rsh			/usr/bin/ssh
rcp			/usr/bin/scp
gkhfile			/opt/xcat/etc/gkh
tftpdir			/tftpboot
tftpxcatroot		xcat
domain			c.mydomain.com
dnssearch		c.mydomain.com,mydomain.com
nameservers		192.16.100.1
forwarders		10.0.0.1
nets			172.16.0.0:255.255.0.0,172.17.0.0:255.255.0.0,172.18.0.0:255.255.0.0
dnsdir			/var/named
dnsallowq		172.16.0.0:255.255.0.0,172.17.0.0:255.255.0.0,172.18.0.0:255.255.0.0
domainaliasip		172.16.100.1
mxhosts			c.mydomain.com,man-c.c.mydomain.com
mailhosts		man-c
master			man-c
homefs			man-c:/home
localfs			man-c:/usr/local
pbshome			/var/spool/pbs
pbsprefix		/usr/local/pbs
pbsserver		man-c
scheduler		maui
xcatprefix		/opt/xcat
keyboard		us
timezone		US/Central
offutc			-6
mapperhost		NA
serialmac		0
serialbps		9600
snmpc			public
snmpd			172.17.100.1
poweralerts		Y
timeservers		man-c
logdays			7
installdir		/install
clustername		Clever-cluster-name
dhcpver			2
dhcpconf		/etc/dhcpd.conf
dynamicr		eth0,ia32,172.30.0.1,255.255.0.0,172.30.1.1,172.30.254.254
usernodes		man-c
usermaster		man-c
nisdomain		NA
nismaster		NA
nisslaves		NA
homelinks		NA
chagemin		0
chagemax		60
chagewarn		10
chageinactive		0
mpcliroot		/opt/xcat/lib/mpcli
12.4 nodelist.tab      
/opt/xcat/etc/nodelist.tab
# nodelist.tab contains a list of nodes and defines groups that
# can be used in commands.  man nodelist.tab for more information.
node01	all,rack1,compute,myri,mpn1
node02	all,rack1,compute,myri,mpn1
node03	all,rack1,compute,myri,mpn1
node04	all,rack1,compute,myri,mpn1
node05	all,rack1,compute,myri,mpn1
node06	all,rack1,compute,myri,mpn1
node07	all,rack1,compute,myri,mpn1
node08	all,rack1,compute,myri,mpn1
node09	all,rack1,compute,myri,mpn2
node10	all,rack1,compute,myri,mpn2
node11	all,rack1,compute,myri,mpn2
node12	all,rack1,compute,myri,mpn2
node13	all,rack1,compute,myri,mpn2
node14	all,rack1,compute,myri,mpn2
node15	all,rack1,compute,myri,mpn2
node16	all,rack1,compute,myri,mpn2
node17	all,rack1,compute,myri,mpn3
node18	all,rack1,compute,myri,mpn3
node19	all,rack1,compute,myri,mpn3
node20	all,rack1,compute,myri,mpn3
node21	all,rack1,compute,myri,mpn3
node22	all,rack1,compute,myri,mpn3
node23	all,rack1,compute,myri,mpn3
node24	all,rack1,compute,myri,mpn3
node25	all,rack1,compute,myri,mpn4
node26	all,rack1,compute,myri,mpn4
node27	all,rack1,compute,myri,mpn4
node28	all,rack1,compute,myri,mpn4
node29	all,rack1,compute,myri,mpn4
node30	all,rack1,compute,myri,mpn4
node31	all,rack1,compute,myri,mpn4
node32	all,rack1,compute,myri,mpn4
rsa01	nan,mpa
rsa02	nan,mpa
rsa03	nan,mpa
rsa04	nan,mpa
ts01	nan,ts
ts02	nan,ts
myri01  nan
12.5 mpa.tab      
/opt/xcat/etc/mpa.tab
#service processor adapter management
#
#type      = asma,rsa
#name      = internal name (must be unique)
#            internal name should = node name
#            if rsa/asma is primary management
#            processor
#number    = internal number (must be unique and > 10000)
#command   = telnet,mpcli
#reset     = http(ASMA only),mpcli,NA
#dhcp      = Y/N(RSA only)
#gateway   = default gateway or NA (for DHCP assigned)
#
rsa01	rsa,rsa01,10001,mpcli,mpcli,NA,N,NA
rsa02	rsa,rsa02,10002,mpcli,mpcli,NA,N,NA
rsa03	rsa,rsa03,10003,mpcli,mpcli,NA,N,NA
rsa04	rsa,rsa04,10004,mpcli,mpcli,NA,N,NA
12.6 mp.tab      
/opt/xcat/etc/mp.tab
# mp.tab defines how the Service processor network is setup.
# node07 is accessed via the name 'node07' on the RSA 'rsa01', etc.
# man asma.tab for more information until the man page to mp.tab is ready
node01	rsa01,node01
node02	rsa01,node02
node03	rsa01,node03
node04	rsa01,node04
node05	rsa01,node05
node06	rsa01,node06
node07	rsa01,node07
node08	rsa01,node08
node09	rsa02,node09
node10	rsa02,node10
node11	rsa02,node11
node12	rsa02,node12
node13	rsa02,node13
node14	rsa02,node14
node15	rsa02,node15
node16	rsa02,node16
node17	rsa03,node17
node18	rsa03,node18
node19	rsa03,node19
node20	rsa03,node20
node21	rsa03,node21
node22	rsa03,node22
node23	rsa03,node23
node24	rsa03,node24
node25	rsa04,node25
node26	rsa04,node26
node27	rsa04,node27
node28	rsa04,node28
node29	rsa04,node29
node30	rsa04,node30
node31	rsa04,node31
node32	rsa04,node32
12.7 apc.tab      
/opt/xcat/etc/apc.tab
# apc.tab  defines  the  relationship  between nodes and APC
# MasterSwitches and their assigned outlets.  In our example,
# the power for asma1 is plugged into the 1st outlet the the
# APC MasterSwitch, etc.
rsa01	apc1,1
rsa02	apc1,2
rsa03	apc1,3
rsa04	apc1,4
ts01	apc1,5
ts02	apc1,6
myri01	apc1,7
12.8 conserver.cf      
/opt/xcat/etc/conserver.cf
# conserver.cf defines how serial consoles are accessed.  Our example
# uses the ELS terminal servers and node01 is connected to port 1 
# on ts01, node02 is connected to port 2 on ts01, node17 is connected to
# port 1 on ts02, etc.
# man conserver.cf for more information
#
# The character '&' in logfile names are substituted with the console
# name.  Any logfile name that doesn't begin with a '/' has LOGDIR
# prepended to it.  So, most consoles will just have a '&' as the logfile
# name which causes /var/consoles/ to be used.
#
LOGDIR=/var/log/consoles
#
# list of consoles we serve
#    name : tty[@host] : baud[parity] : logfile : mark-interval[m|h|d]
#    name : !host : port : logfile : mark-interval[m|h|d]
#    name : |command : : logfile : mark-interval[m|h|d]
#
node01:!ts01:3001:&:
node02:!ts01:3002:&:
node03:!ts01:3003:&:
node04:!ts01:3004:&:
node05:!ts01:3005:&:
node06:!ts01:3006:&:
node07:!ts01:3007:&:
node08:!ts01:3008:&:
node09:!ts01:3009:&:
node10:!ts01:3010:&:
node11:!ts01:3011:&:
node12:!ts01:3012:&:
node13:!ts01:3013:&:
node14:!ts01:3014:&:
node15:!ts01:3015:&:
node16:!ts01:3016:&:
node17:!ts02:3001:&:
node18:!ts02:3002:&:
node19:!ts02:3003:&:
node20:!ts02:3004:&:
node21:!ts02:3005:&:
node22:!ts02:3006:&:
node23:!ts02:3007:&:
node24:!ts02:3008:&:
node25:!ts02:3009:&:
node26:!ts02:3010:&:
node27:!ts02:3011:&:
node28:!ts02:3012:&:
node29:!ts02:3013:&:
node30:!ts02:3014:&:
node31:!ts02:3015:&:
node32:!ts02:3016:&:
%%   
#
# list of clients we allow
# {trusted|allowed|rejected} : machines
#
trusted: 127.0.0.1
12.9 conserver.tab      
/opt/xcat/etc/conserver.tab
# conserver.tab  defines  the relationship between nodes and
# conserver servers.  Our example uses only one conserver on
# the localhost.  man conserver.tab for more information.
node01	localhost,node01
node02	localhost,node02
node03	localhost,node03
node04	localhost,node04
node05	localhost,node05
node06	localhost,node06
node07	localhost,node07
node08	localhost,node08
node09	localhost,node09
node10	localhost,node10
node11	localhost,node11
node12	localhost,node12
node13	localhost,node13
node14	localhost,node14
node15	localhost,node15
node16	localhost,node16
node17	localhost,node17
node18	localhost,node18
node19	localhost,node19
node20	localhost,node20
node21	localhost,node21
node22	localhost,node22
node23	localhost,node23
node24	localhost,node24
node25	localhost,node25
node26	localhost,node26
node27	localhost,node27
node28	localhost,node28
node29	localhost,node29
node30	localhost,node30
node31	localhost,node31
node32	localhost,node32
12.10 nodehm.tab      
/opt/xcat/etc/nodehm.tab
#
#node hardware management
#
#power     = mp,baytech,emp,apc,apcp,NA
#reset     = mp,apc,apcp,NA
#cad       = mp,NA
#vitals    = mp,NA
#inv       = mp,NA
#cons      = conserver,tty,rtel,NA
#bioscons  = rcons,mp,NA
#eventlogs = mp,NA
#getmacs   = rcons,cisco3500
#netboot   = pxe,eb,ks62,elilo,file:,NA
#eth0      = eepro100,pcnet32,e100,bcm5700
#gcons     = vnc,NA
#serialbios   = Y,N,NA
#
#node	power,reset,cad,vitals,inv,cons,bioscons,eventlogs,getmacs,netboot,eth0,gcons,serialbios
#
node01  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node02  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node03  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node04  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node05  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node06  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node07  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node08  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node09  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node10  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node11  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node12  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node13  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node14  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node15  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node16  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node17  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node18  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node19  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node20  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node21  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node22  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node23  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node24  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node25  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node26  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node27  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node28  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node29  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node30  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node31  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node32  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
rsa01   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
rsa02   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
rsa03   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
rsa04   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
ts01    apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
ts02    apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
myri01  apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
12.11 noderes.tab      
/opt/xcat/etc/noderes.tab
#
#TFTP         = Where is my TFTP server? 
#               Used by makedhcp to setup /etc/dhcpd.conf
#               Used by mkks to setup update flag location
#NFS_INSTALL  = Where do I get my files?
#INSTALL_DIR  = From what directory?
#SERIAL       = Serial console port (0, 1, or NA)
#USENIS       = Use NIS to authencate (Y or N)
#INSTALL_ROLL = Am I also an installation server? (Y or N)
#ACCT         = Turn on BSD accounting
#GM           = Load GM module (Y or N)
#PBS          = Enable PBS (Y or N)
#ACCESS       = access.conf support
#GPFS         = Install GPFS
#INSTALL NIC  = eth0, eth1, ... or NA
#
#node/group	TFTP,NFS_INSTALL,INSTALL_DIR,SERIAL,USENIS,INSTALL_ROLL,ACCT,GM,PBS,ACCESS,GPFS,INSTALL_NIC
#
compute man-c,man-c,/install,0,N,N,N,Y,Y,Y,N,eth0
nan     man-c,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

12.12 nodetype.tab      
/opt/xcat/etc/nodetype.tab
# nodetype.tab maps nodes to types of installs.
# Our example uses only one type, but you might have a few
# different types.. a subset of nodes with GigE, storage nodes,
# etc.  man nodetype.tab for more information.
########### !!!!!!!!!!!! this file can not contain comments !!!!
########### !!!!!!!!!!!! this file can not contain comments !!!!
########### !!!!!!!!!!!! this file can not contain comments !!!!
node01	compute73
node02	compute73
node03	compute73
node04	compute73
node05	compute73
node06	compute73
node07	compute73
node08	compute73
node09	compute73
node10	compute73
node11	compute73
node12	compute73
node13	compute73
node14	compute73
node15	compute73
node16	compute73
node17	compute73
node18	compute73
node19	compute73
node20	compute73
node21	compute73
node22	compute73
node23	compute73
node24	compute73
node25	compute73
node26	compute73
node27	compute73
node28	compute73
node29	compute73
node30	compute73
node31	compute73
node32	compute73
12.13 passwd.tab      
/opt/xcat/etc/passwd.tab
# passwd.tab defines some passwords that will be used in the cluster
# man passwd.tab for more information.
cisco		cisco
rootpw		netfinity
asmauser	USERID
asmapass	PASSW0RD

Back to TOC

13. Configuring the Ethernet Switch
This section describes configuring the ethernet switch. Some of it is a bit rough. I'd appreciate some more accurate content.

13.1 Connecting to the Switch and Setting IP Address and Password      
  • Cisco:
    Connect the management node's COM1 to the switch's console port and...
    > cu -l /dev/ttyS0 -s 9600
    Login in and enable.

    Assign an IP to the default VLAN:
    cisco> conf t
    cisco> int vlan1
    cisco> ip address 172.17.5.1 255.255.0.0
    cisco> exit

    Allow telnet access and set telnet password to 'cisco':
    cisco> conf t
    cisco> line tty 0 15
    cisco> login password cisco cisco> exit

    Set enable and console passwords:
    Set enable password:
    cisco> conf t
    cisco> enable password cisco
    cisco> exit

    Set console password:
    cisco> conf t
    cisco> line vty 0 4
    cisco> password cisco
    cisco> exit

  • Extreme Networks:
    > config vlan Default ipaddress 172.16.5.1 255.255.0.0

13.1 Changing Port Settings      
  • Cisco:
    Setup 'spanning-tree portfast':
    Without this option, DHCP may fail because it takes too long for a port to come online after a machine powers up. Do not set spanning-tree portfast on ports that will connect to other switches. Do the following on each port on your switch:
    cisco> conf t
    cisco> int Fa/1
    cisco> spanning-tree portfast
    cisco> exit
    cisco> int Fa/2
    cisco> spanning-tree portfast
    cisco> exit
    etc.

  • Extreme Networks:
    None needed that I am aware of.

13.1 Setting up Remote Logging      
Here we have the switch send all its logging information to the management node's management interface. (We'll enable remote sysloging on the management node later).
  • Cisco:
    cisco> conf t
    cisco> logging 172.17.5.1
    cisco> exit

  • Extreme Networks:
    > config syslog 172.17.5.1

13.1 Setting up VLANs      
  • Cisco:
    # for each interface you want to put in a VLAN Fa0/1 .. Fa0/32, and gig ports { # clearly pseudo-code
      cisco> interface Fa0/1
      cisco> switchport mode access
      cisco> switchport access vlan 2
      cisco> exit
    # }

  • Extreme Networks:
    > create vlan man
    > config vlan man tag 2
    > config vlan man ipaddress 172.17.5.1 255.255.0.0
    > config Default delete port 1,2,3,4 # the ports you want in the managemnet VLAN
    > config man add port 1,2,3,4
    > show vlan

13.1 Notes on VLANs with Multiple Switches      
  • Cisco:
     cisco> configure terminal
     cisco> interface Gi0/1
     cisco> switchport mode trunk
     cisco> switchport trunk encapsulation isl # you should prob use the standard encap instead
     cisco> exit

  • Extreme Networks:
     unconfigure switch
     config Default delete port ports,you,don't,want,in,management,VLAN
    
     create vlan cluster
     config vlan cluster tag 2
     config cluster add port ports,you,want,in,cluster,VLAN
    
     show vlan
    
     save
13.1 Saving your changes      
You want to make certain your switch configuration is saved in case the switch is rebooted.
  • Cisco:
    cisco> write mem

  • Extreme Networks:
    extreme> save


Back to TOC

14. Configuring Networking on the Management Node
This section describes network setup on the management node:

14.1 /etc/hosts      
Create your /etc/hosts file. Make sure all devices are entered... terminal servers, switches, hardware management devices, etc.
The following is an sample of the /etc/hosts for the example cluster:
#  Localhost
127.0.0.1		localhost.localdomain localhost
# 
########## Management Node ###################
#
# cluster interface (eth0) GigE
172.16.100.1  man-c.c.mydomain.com        man-c
#
# management interface (eth1)
172.17.100.1  man-m.c.mydomain.com        man-m
#
# external interface (eth2)
10.0.0.1      man.c.mydomain.com          man
#
########## Management Equipment ##############
#
# RSA adapters. You might have ASMA cards instead
172.17.1.1    rsa01.c.mydomain.com        rsa01
172.17.1.2    rsa02.c.mydomain.com        rsa02
172.17.1.3    rsa03.c.mydomain.com        rsa03
172.17.1.4    rsa04.c.mydomain.com        rsa04
#
# Terminal Servers
172.17.2.1    ts01.c.mydomain.com         ts01
172.17.2.2    ts02.c.mydomain.com         ts02
#
# APC Master Switch
172.17.3.1    apc1.c.mydomain.com         apc01
#
# Myrinet Switch's ethernet management port
172.17.4.1    myri01.c.mydomain.com       myri01
#
# Ethernet Switch
172.17.5.1    ethernet01c.c.mydomain.com  ethernet01c
172.16.5.1    ethernet01.c.mydomain.com   ethernet01
#
########## Compute Nodes #####################
#
#
172.16.1.1    node01.c.mydomain.com       node01
172.18.1.1    node01-myri0.c.mydomain.com node01-myri0
172.16.1.2    node02.c.mydomain.com       node02
172.18.1.2    node02-myri0.c.mydomain.com node02-myri0
172.16.1.3    node03.c.mydomain.com       node03
172.18.1.3    node03-myri0.c.mydomain.com node03-myri0
172.16.1.4    node04.c.mydomain.com       node04
172.18.1.4    node04-myri0.c.mydomain.com node04-myri0
172.16.1.5    node05.c.mydomain.com       node05
172.18.1.5    node05-myri0.c.mydomain.com node05-myri0
172.16.1.6    node06.c.mydomain.com       node06
172.18.1.6    node06-myri0.c.mydomain.com node06-myri0
172.16.1.7    node07.c.mydomain.com       node07
172.18.1.7    node07-myri0.c.mydomain.com node07-myri0
172.16.1.8    node08.c.mydomain.com       node08
172.18.1.8    node08-myri0.c.mydomain.com node08-myri0
172.16.1.9    node09.c.mydomain.com       node09
172.18.1.9    node09-myri0.c.mydomain.com node09-myri0
172.16.1.10   node10.c.mydomain.com       node10
172.18.1.10   node10-myri0.c.mydomain.com node10-myri0
172.16.1.11   node11.c.mydomain.com       node11
172.18.1.11   node11-myri0.c.mydomain.com node11-myri0
172.16.1.12   node12.c.mydomain.com       node12
172.18.1.12   node12-myri0.c.mydomain.com node12-myri0
172.16.1.13   node13.c.mydomain.com       node13
172.18.1.13   node13-myri0.c.mydomain.com node13-myri0
172.16.1.14   node14.c.mydomain.com       node14
172.18.1.14   node14-myri0.c.mydomain.com node14-myri0
172.16.1.15   node15.c.mydomain.com       node15
172.18.1.15   node15-myri0.c.mydomain.com node15-myri0
172.16.1.16   node16.c.mydomain.com       node16
172.18.1.16   node16-myri0.c.mydomain.com node16-myri0
172.16.1.17   node17.c.mydomain.com       node17
172.18.1.17   node17-myri0.c.mydomain.com node17-myri0
172.16.1.18   node18.c.mydomain.com       node18
172.18.1.18   node18-myri0.c.mydomain.com node18-myri0
172.16.1.19   node19.c.mydomain.com       node19
172.18.1.19   node19-myri0.c.mydomain.com node19-myri0
172.16.1.20   node20.c.mydomain.com       node20
172.18.1.20   node20-myri0.c.mydomain.com node20-myri0
172.16.1.21   node21.c.mydomain.com       node21
172.18.1.21   node21-myri0.c.mydomain.com node21-myri0
172.16.1.22   node22.c.mydomain.com       node22
172.18.1.22   node22-myri0.c.mydomain.com node22-myri0
172.16.1.23   node23.c.mydomain.com       node23
172.18.1.23   node23-myri0.c.mydomain.com node23-myri0
172.16.1.24   node24.c.mydomain.com       node24
172.18.1.24   node24-myri0.c.mydomain.com node24-myri0
172.16.1.25   node25.c.mydomain.com       node25
172.18.1.25   node25-myri0.c.mydomain.com node25-myri0
172.16.1.26   node26.c.mydomain.com       node26
172.18.1.26   node26-myri0.c.mydomain.com node26-myri0
172.16.1.27   node27.c.mydomain.com       node27
172.18.1.27   node27-myri0.c.mydomain.com node27-myri0
172.16.1.28   node28.c.mydomain.com       node28
172.18.1.28   node28-myri0.c.mydomain.com node28-myri0
172.16.1.29   node29.c.mydomain.com       node29
172.18.1.29   node29-myri0.c.mydomain.com node29-myri0
172.16.1.30   node30.c.mydomain.com       node30
172.18.1.30   node30-myri0.c.mydomain.com node30-myri0
172.16.1.31   node31.c.mydomain.com       node31
172.18.1.31   node31-myri0.c.mydomain.com node31-myri0
172.16.1.32   node32.c.mydomain.com       node32
172.18.1.32   node32-myri0.c.mydomain.com node32-myri0
14.2 Configure Network Devices      
Edit /etc/modules.conf, /etc/sysconfig/network-scripts/*, and /etc/sysconfig/network, to create a network configuration that reflects the cluster's design. The following samples work with the example cluster:
/etc/modules.conf
alias eth0 e1000
alias eth1 pcnet32
alias eth2 e100
/etc/sysconfig/network
NETWORKING=yes
HOSTNAME="man-c"
GATEWAY="10.0.0.254"
GATEWAYDEV="eth2"
FORWARD_IPV4="no"
/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="none"
IPADDR="172.16.100.1"
NETMASK="255.255.0.0"
ONBOOT="yes"
/etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE="eth1"
BOOTPROTO="none"
IPADDR="172.17.100.1"
NETMASK="255.255.0.0"
ONBOOT="yes"
/etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE="eth2"
BOOTPROTO="none"
IPADDR="10.0.0.1"
NETMASK="255.0.0.0"
ONBOOT="yes"
14.3 Setup DNS      
Setup the resolver:

> makedns master

If the management node's IP address is listed first in site.tab's nameservers field (as it is in our example), this will generate zone files from the data in /etc/hosts and start a DNS server as a master.

If the management node's IP address is listed in the nameservers field, but not first, then it will become a slave DNS server and will do a zone xfer from the master (the first IP address listed). If the management node's IP address is not in site.tab then it's possible for the first IP address not to be in the cluster. Some installations user their own DNS external to the cluster and setup all the cluster names there. In this case, you just put the IP address of this external DNS server as the #1 address in the nameservers field, and the IP of the management node as #2, then #2 is a slave and will do a zone xfer, then you'll need to update either the .kstmp or /install/post/sync (this is what I do) to copy a resolv.conf to the clients that specify only to use the slave(2) for DNS.

14.4 Verify Name Resolution      
Verify that forward and reverse name resolution work like in the following examples:

> host node01
> host 172.16.1.1

Do not continue until forward and reverse name resolution are working.

14.5 Setup VLANs and Configure Ethernet Switches      
If you have separate subnets for the management and compute networks, like in our example, you need to setup VLANs on the ethernet switches. Experience has shown that many strange problems are solved with the introduction of VLANS. Use VLANs to separate the ports associated with the management, and cluster subnets. A set of somewhat random notes for configuring VLANs on different switches and setting up "spanning-tree portfast" on ciscos is available here.

Note: Tie this into the configuring the ethernet switch section and remove redundant information. There are other subsections in this section that need similar treatment.

14.6 Verify Management Node Network Setup      
You might have to wait until some of the management equipment is setup in the steps that follow, to do full verification. You need to check that:
  • You can ping all of the network interfaces
  • You can ping other devices on all of the subnets (cluster, management, external, etc.)
  • You can ping and route through your gateway

Back to TOC

15. Doing the Compute Node Preinstall (stage1)
'stage1' is an automated procedure for updating and configuring system BIOSes. This section describes an easy way to upgrade the x330 compute nodes' firmware and set their BIOS settings. It is x330 specific. You're on your own with non-x330s... just make certain that you enable PXE boot by setting the boot order to be CDROM, Floppy, Network, HD and that you disable virus detection or any other interactive BIOS features.

Note: Tonko's new unified stage1 stuff will replace this section very soon.

15.1 Create a CD/Floppy to Configure the BIOS and SPN in Each x330      
If you have a lot of compute nodes, it's a good idea to make multiple sets. The flash/configuration process can take up to 10 minute per machine, so you don't want wait around for 10 minutes for each node.

For 1 GHz machines (model 8654):
Create a CD from the ISO image located at:
/opt/xcat/stage/stage1/x330.iso

Create a floppy from the floppy image located at:
/opt/xcat/stage/stage1/x330.dd
This can be done under Linux buy placing a blank DOS formatted floppy into the floppy drive and:
> dd if=x330.dd of=/dev/fd0 bs=1024 conv=sync; sync

For machines > 1 GHz (model 8674):
Create a floppy from the floppy image located at:
/opt/xcat/stage/stage1/x330-6874-1.02.dd
This can be done under Linux buy placing a blank DOS formatted floppy into the floppy drive and:
> dd if=x330-6874-1.02.dd of=/dev/fd0 bs=1024 conv=sync; sync

You don't need a CD.

15.2 Paranoia      
Things just seem to work better when you do this... Remove AC Power from all compute nodes. Disconnect all the SPN serial networking.
After waiting a minute or so, restore AC Power to compute nodes and apply power to ASMA cards.

15.3 Flash the Compute Nodes      
Insert the floppy (and the CD if you're flashing a model 8654) into each compute node, power on the node and let the disks do their magic. There is no need for any manual intervention during this process... ignore any BIOS errors etc, they'll go away in 15 seconds or so. When the flash/config is finished, the CD will eject and the display should have an obvious "I'm done" message.

Reconnect the SPN serial network after all the machines have been flashed.
You shouldn't make any compute node BIOS modifications after this procedure.

15.4 Hardware Configurations That Are Different Than x330*.dd      
If you're nodes have hardware configurations that are slightly different that the hardware configurations that x330.dd / x330-6874-1.02.dd (different PCI cards or different memory size), you'll probably get a POST error message like "Memory Size has Changed" or similar.

You will want to create customer a customer x330.dd floppy image. Here's how:
  • Flash the nodes as detailed above
  • Reboot the nodes and fix any BIOS issues
  • Create a xcmosutil floppy
    > dd if=/opt/xcat/stage/stage1/xcmosutil.dd of=/dev/fd0 bs=1024 conv=sync; sync
  • Boot the xcmosutil floppy and replace the 'compute' settings:
    DOS> del compute
    DOS> cmosutil /s compute
  • Put the xcmosutil floppy back into the management node and copy the compute file:
    > mcopy a:/compute /tmp/compute
  • Make a copy of the x330.dd that is appropriate for your machine type:
    > cp /opt/xcat/stage/stage1/x330-6874-1.02dd /opt/xcat/stage/stage1/my_x330.dd
  • Mount the .dd image loopback and replace 'compute':
    > mount -o loop /opt/xcat/stage/stage1/my_x330.dd /mnt
    > rm /mnt/compute
    > cp /tmp/compute /mnt/compute
    > umount /mnt
  • Make a floppy of your new image as flash away:
    > dd if=/opt/xcat/stage/stage1/my_x330.dd of=/dev/fd0 bs=1024 conv=sync; sync
15.5 Serial Port Notes      
In the past, it was necessary to make a serial port hardware change inside of the x330. This is no longer necessary. All the bugs are fixed with the BIOS so, please use ttyS0/COM1/COMa.

Back to TOC

16. Configuring the Terminal Servers
This section describes setting up ELS and ESP terminal servers and conserver. Your cluster will probably have either ELSes or ESPs so you can skip the instructions for the terminal server type that is not a part of your cluster. Terminal servers enable out-of-band administration and access to the compute nodes... e.g. watching a compute node's console remotely before the compute node can be assigned an IP address or after the network config gets messed up, etc.

16.1 Learn About Conserver      
Check out conserver's website.

16.2 Shutdown Conserver      
Before setting up the terminal servers, make sure that the conserver service is stopped:

> /sbin/service/conserver stop

16.3 Setup ELS Terminal Servers      
This section describes how to configure the
Equinox ELS terminal server. If you're using the ESP terminal servers instead of the ELSes, you'll want to skip this section and skip ahead to 16.4 and follow the ESP instructions.
    16.3.1 conserver.cf Setup      
    Modify /opt/xcat/etc/conserver.cf
    This has already been covered in the configuring xCAT section, but this explains it...

    Each node gets a line like:

    nodeXXX:!tsx:yyyy:&:

    where x = ELS Unit number and yyyy = ELS port + 3000 e.g. node1:!ts1:3001:&: means access node1 via telnet to ts1 on port 3001. 'node1' should be connected to ts1's first serial port.

    16.3.2 Set ELS's IP Address      
    For each ELS unit in your cluster...

    Reset the ELS to factory defaults. You usually have to push the reset botton. If the button is green, just push it. If the botton is white, you need to hold it down until the link light stops blinking. All the new units have green buttons.

    Connect the DB-9 adaptor Equinox part #210062 to the management nodes's first serial port (COM1) and connect a serial cable from the ELS to the DB-9 adapter. You can test that the serial connection is good with:

    > cu -l /dev/ttyS0 -s 9600

    Hit Return to connect and you should see:

    Username>

    Unplug the serial cable to have cu hangup and then reconnect it for the next step:
    > setupelsip <ELS_HOSTNAME>

    Test for success:
    > ping <ELS_HOSTNAME>

    16.3.3 Final ELS Setup      
    After assigning the ELS' IP address over the serial link, use

    > setupels <ELS_HOSTNAME>

    to finish the setup for each ELS in your cluster. This sets up the terminal server's serial settings. After the serial settings are set, you can not use setupelsip again, because the serial ports have been set for reverse use. A reset of the unit will have to be performed again, if you need to change the IP address.
16.4 Setup ESP Terminal Servers      
This section describes how to configure the
Equinox ESP terminal server. If you're using ELS terminal servers, as most of the examples in this document do, you should skip this section and use the ELS section instead.

    16.4.1 conserver.cf Setup      
    Modify /opt/xcat/etc/conserver.cf

    Each node gets a line like:

    nodeXXX:/dev/ttyQxxyy:9600p:&:

    where xx = ESP Unit number and yy = ESP port (in hex) e.g. ttyQ01e0

    16.4.2 Build ESP Driver      
    Install the RPM (must be 3.03 or later!)

    16.4.3 ESP Startup Configuration      
    Type /opt/xcat/sbin/updaterclocal (you can run this multiple times without creating problems). You need to run this because the ESP RPM puts evil code in the rc.local file, that forces the ESP to load very last and any other service that needs the ESP to start (e.g. conserver) will fail.

    > cp /opt/xcat/rc.d/espx /etc/rc.d/init.d/
    > chkconfig espx on

    16.4.4 ESP Driver Configuration      
    Note the mac address of each ESP and manually create the /etc/eqnx/esp.conf file. All that esp util does is create this file, you can do it yourself and save a lot of time. No need to setup DHCP for the ESPs this way.

    > service espx stop
    > rmmod espx
    > service espx start
16.5 Start Conserver      
> service conserver start

16.6 Understanding How To Tell if Conserver and Terminal Servers are Working      
To be written. Also bits about cntrl-e-c, etc.

Back to TOC

17. Completing Management Node Setup
Here, we setup the final services necessary for a functioning management node.

17.1 Copy xCAT init Files      
This will enable some services to start at boot time and change the behavior of some existing services

> cd /opt/xcat/rc.d
> cp atftpd portmap snmptrapd syslog /etc/rc.d/init.d/

There are other init files in /opt/xcat/rc.d that you may wish to use, depending on your installation.

Note: Why portmap?

17.2 Copy the 'post' Files for RedHat 6.2 and RedHat 7.x      
Copy some install files from the xCAT distribution to the post directory that is used during unattended installs:

> cd /opt/xcat/post
> find . | cpio -dump /install/post

17.3 Setup syslog      
Here we enable remote logging...

> cp /opt/xcat/samples/syslog.conf /etc
> touch /var/log/pipemessages
> service syslog restart

On RH7.x based installs, you might want to edit /etc/sysconfig/syslog, changing SYSLOGD_OPTIONS and add the -r switch instead of copying the modified rc.d/syslogd. See the note here (but ignore the watchlogd stuff).

17.4 Setup snmptrapd      
snmptrapd received messages from the SPN.

> chkconfig snmptrapd on
> service snmptrapd start

17.5 Setup watchlogd      
watchlogd monitors syslog and sends email alerts to a user defined list of admins in the event of a hardware error. This functionality (hardware error alerts) requires a SPN.

> chkconfig watchlogd on
> service watchlogd start

You also must setup an alias in /etc/aliases called alerts. This alias is a comma delimited list of admins that should receive these hardware alerts as email.

The above is no longer supported. See this note.

17.6 Install SSH RPMs for RedHat 6.2      
This step isn't required for RedHat 7.x

> cd /opt/xcat/post/rpm62
> rpm -ivh --force --nodeps openssh*.rpm

17.7 Setup NFS and NFS Exports      
Make /etc/exports look something like the following:

/install node*(ro,no_root_squash)
/tftpboot node*(ro,no_root_squash)
/usr/local node*(ro,no_root_squash)
/opt/xcat node*(ro,no_root_squash)
/home node*(rw,no_root_squash)

Turn on NFS:

> chkconfig nfs on
> service nfs start
> exportfs -ar # (to source)
> exportfs # (to verify)

17.8 Setup NTP      
RedHat 7.x:
> chkconfig ntpd on
> service ntpd start

RedHat 6.2:
> chkconfig xntpd on
> service xntpd start

Note: setupxcat does this now, but you might want to configure the ntpd conf file to point to an external time server.

17.9 Setup TFTP      
The following works with RH6.2 and 7.x:

> rpm -ivh /opt/xcat/post/rpmxx/atftp-0.6-1.i386.rpm

> chkconfig atftpd on
> service atftpd start

Test that tftp is working by monitoring /var/log/messages and:

> tftp localhost
> get bogus_file
> quit

You should see tftp try to service the 'get' request in the log output.

Note: The tftpd that comes by default with RedHat 7.2,7.3 might work better than atftpd. atftpd scales well, but it has been know to die mysteriously.

17.10 Initial DHCP Setup      
    17.10.1 Collect the MAC Addresses of Cluster Equipment      
    Place the MAC addresses of cluster equipment that needs to DHCP for an IP address into /opt/xcat/etc/<MANAGEMENT_NET>.tab. See the man page for macnet.tab.

    If you have APC master switches, put their MAC addresses into this file.

    17.10.2 Make the Initial dhcpd.conf Config File      
    > makedhcp --new

    17.10.3 Edit dhcpd.conf      
    Check for anything out of the ordinary
    > vi /etc/dhcpd.conf

    17.10.4 Important DHCP Note      
    You probably don't want DHCP running on the network interface that is connected to the rest of the network. Except for in special circumstances, you'll want to remove the network section from dhcpd.conf that corresponds to the external network and then explicitly list the interfaces you want dhcpd to listen on in /etc/rc.d/init.d/dhcpd (leaving out the external interface).

    # BEGIN portion of /etc/rc.d/init.d/dhcpd
    daemon /usr/sbin/dhcpd eth1 eth2
    # END portion of /etc/rc.d/init.d/dhcpd

    On RedHat 7.2, you edit /etc/sysconfig/dhcpd instead of modifying /etc/rc.d/init.d/dhcpd, with something like:

    DHCPDARGS="eth1 eth2"
17.11 Setup NIS      
If you're using the management node as a NIS server for your cluster:
    17.11.1 Verify That xCAT Is Configured for NIS      
    Check stuff in site.tab.

    17.11.2 Setup Management Node as an NIS Server      
    > gennis

17.12 Copy the Custom Kernel to a Place Where Installs Can Access It      
Copy the custom kernel tarball you installed in step 10 to /install/post/kernel/.
Edit the KERNELVER variable where appropriate in /opt/xcat/ksxx/*.kstmp.

17.13 Copy the RedHat Install CD(s)      
  • RedHat 7.x
    Do the right thing substituting 1 or 2 for x for RedHat 7.1 or 7.2.
    1. Mount Install CD #1
      > mount /mnt/cdrom

    2. Copy the files from CD #1
      > cd /mnt/cdrom
      > find . | cpio -dump /install/rh7x

    3. Unmount Install CD #1
      > cd / ; umount /mnt/cdrom ; eject

    4. Mount Install CD #2
      > mount /mnt/cdrom

    5. Copy the files from CD #2 (and CD #3 for RH 7.3)
      > cd /mnt/cdrom
      > find RedHat | cpio -dump /install/rh7x

    6. Patch the files
      > /opt/xcat/build/rh7x/applypatch
      Check out this note from the mailing list, if you have storage nodes hooked to mass-storage.

  • RedHat 6.2
    1. Mount the Install CD
      > mount /mnt/cdrom

    2. Copy the files
      > cd /mnt/cdrom
      > find . | cpio -dump /install/rh62

    3. Patch the files
      Accept reverse patch errors (just hit enter)
      The following should be typed as one line
      > patch -p0 -d /install/rh62 < /opt/xcat/build/rh62/ks62.patch

17.14 Generate root's SSH Keypair      
The following command create's a SSH keypair for root with an empty passphrase, sets up root's ssh configuration, and copies keypair and config to /install/post/.ssh so that all installed nodes will have the same root keypair/config.

> gensshkeys root

Note: This is a good candidate for setupxcat. Also, I've found that I need to modify gensshkeys so that it places "Host myri* node* master*" in .ssh/config. (adds the master* for the master node).

17.15 Clean Up the Unneeded .tab Files      
In /opt/xcat/etc/, move unneeded .tab files somewhere out of the way e.g. rtel.tab, tty.tab, etc.

Back to TOC

18. Collecting MAC Addresses (stage2)
In this section, we collect the MAC addresses of the compute nodes and create entries in dhcpd.conf for them.

18.1 Setup      
> cd /opt/xcat/stage
> ./mkstage

18.2 Prepare to Monitor stage2 Progress      
> wcons -t 8 compute (or a subset like rack01)
> tail -f /var/log/messages

You should always be watching messages. It's a very good way to get information about what's happening with your cluster. Watching it is a great habit to get into.

18.3 Reboot Compute Nodes      
You'll have to do this manually.

When the machine's boot, they should PXE boot syslinux, get a dynamic IP address, and then load a linux kernel and a special RAMdisk that contains a script that prints the machine's MAC address to the console.

18.4 Observe Output in wcons Windows      
If your terminal servers are working correctly, you should see the machines boot their kernels and then something like this:



A closeup:



18.5 Notes on wcons, xterms and Changing Font Size      
The wcons windows are xterms. When viewing a large number of consoles on the screen at the same time, the xterms come up with the 'unreadable' font size. xterms have a feature that allows you to change the size of the font very easily. This is very useful when you have a screen of 'unreadable' consoles and you want to zoom in on one to view the output in greater detail.

To do this, move the mouse over the text portion of the xterm in question, hold down the control key, and press the right mouse button. You'll see a menu like the following:



Move the mouse down to select a larger font and then release the mouse button as shown:



Using this xterm feature, you can switch to a large font for detailed viewing and back to the 'unreadable' font to view all the consoles at once.

18.6 Notes on wcons, conserver and 'Ctrl-E .'      
This is a placeholder to remind me to document the 'Ctrl-E .' escape sequence that conserver uses to provide a lot of terminal functionality.
> getmacs compute

18.7 Collect the MACs      
Once you see that all the compute nodes are spitting their MAC addresses out of their serial consoles...
> getmacs compute

18.8 Kill the wcons Windows      
> wkill

18.9 Populate dhcpd.conf      
> makedhcp compute

At this point, a dhcpd will be running. So you might want to again make certain that it is only listening on the interfaces that you want it to be.
18.10 Notes on Collecting MAC addresses without a terminal server      
Configure cisco3500.tab with an example of the following:

node01   ethernet01,1
node02   ethernet01,2
node03   ethernet01,3
node04   ethernet01,4
...

Make nodehm.tab have entries like:
nodexx  mp,mp,mp,mp,mp,conserver,mp,mp,rcons,cisco3500,eepro100,vnc


Make sure the switch has a hostname and DNS resolves.
Verify that the nodes pluged into the switch ports match what you put
into cisco3500.tab  IE node1 port1 node2 port2

Make sure you can ping the switch, telnet to it and login.   Make sure the
password you set on the switch is the same in passwd.tab.  Put the nodes in
stage2.  Power them on and getmacs as usual.   What the getmacs command
does is issue the show mac-address-table on the switch and grab the macs
from it.

Back to TOC

19. Configuring MPN/ASMA/RSA (stage3)
Stage3 is a mostly automated procedure for configuring the Management Processor Network (MPN) (formally known as Service Processor Network) on IBM xSeries machines. This section describes how to perform stage3 with ASMA or RSA adapters. If your cluster doesn't have a Management Processor Network, you can skip this section.

19.1 Read the xCAT xSeries Management Processor HOWTO      
Learn more about the Service Processor Network at: http://xcat.org/docs/managementprocessor-HOWTO.html

19.2 Perform the manual steps      
  • For ASMA adapters:
    1. Download the MPN/ASMA/RSA config utility:
      http://www.pc.ibm.com/qtechinfo/MIGR-4QHMCT.html
      v1.06 at the time of this writing

    2. Create the ASMA config floppy:
      Using DOS (be certain that you use 'command' and not 'cmd' under NT or Win2k), run the .exe and follow the prompts.

    3. Configure ASMA cards:
      For each node that contains an ASMA card, you need to take the following steps, using the MPN/ASMA/RSA setup floppy disk. Once you have booted this floppy disk, select Configuration Settings -> Systems Management Adapter and apply the following changes under Network Settings:
      • Enable network interface
      • Set local IP address for ASMA network interface
      • Set subnet mask
      • Set gateway

  • For RSA adapters:
    1. Record the adapters' MAC addresses:
      Place the mac addresses in in <management-network>.tab (172.17.0.0.tab in our example).

    2. Modify dhcpd.conf and boot the RSA adapters:
      > gendhcp -new
      This puts static entries into dhcpd.conf from <management-network>.tab. Remove the power from the RSA adapters and then reapply it. The adapters should boot with the correct IP addresses.
19.3 Check That the Management Processors Are on the Network      
> pping mpa

This does a parallel ping to all the nodes that are defined as being a memeber of the group 'asma' in nodelist.tab. If you're using RSA adapters, you may have a group 'rsa' instead of 'asma'. If any of the adapters show up as 'noping', you'll have to investigate why they are not connecting to the network before you continue.

19.4 Program the Management Processors      
> mpasetup asma

This command sets up things like alerts, SNMP information, etc on each management adapter in the 'asma' group over IP. Again you might want to user a 'rsa' group if you have RSA adapters.

19.5 Verify the Management Processors Were Programed Correctly      
> mpacheck asma

At this point all of the Management Processors should be correctly programmed. Next, we'll configure the Service Processor devices that are in each compute node...

19.6 Nodeset      
The following command makes the nodes PXE boot the stage3 image. (it alters the files in /tftpboot/pxelinux.cfg/)
> nodeset compute stage3

19.7 Prepare to Monitor the stage3 Progress      
> wcons -t 8 compute (or a subset like rack01)
> tail -f /var/log/messages (you should always be watching messages)

19.8 Reboot Compute Nodes      
You'll have to do this manually.

19.9 Watch the Show      
You should see the SPN procedure move forward smoothly in all the compute nodes' wcons windows.

After a successful stage3 procedure, your wcons windows should look something like:

wcons windows after stage3

Closeup:

closeup wcons window after stage3

19.10 Test Out Some SPN Commands      
Read the man pages for rvitals, rinv, and rpower, etc. and then try out some of these commands on your cluster.

Example:
> rvitals compute ambtemp

19.11 mpncheck      
> mpncheck compute

Back to TOC

20. Configuring APC Master Switches
An APC Master Switch or similar remotely controlled Power Distribution Unit can be very valuable in easing administration headaches that are sometimes involved in maintaining clusters.

In our example cluster, the APC Master Switch is connected to the management ethernet VLAN and powers the MPAs, terminal servers, and Myrinet switch. After proper configuration, this allows the administrator to reset these devices by cycling their power from their desktop or another remote location with xCAT's rpower command-line utility. This is useful for debugging, troubleshooting, or just dealing with a flaky component and becomes a real requirement when you only have remote access to a cluster... Who wants get out of your chair and walk into the server room anyways?

APC's Installation guide can be found here.

20.1 Reset the APC MasterSwitch      
To be written

20.2 Connect to the Switch via a Serial Connection      
lala

20.3 Configure the Switch's IP Settings      
To be written

20.4 Test Switch's IP Connectivity      
lala

20.5 Figure out which Devices are connected to what Switch Ports      
To be written

20.6 Verify/Setup apc.tab and nohm.tab      
To be written

20.7 Test with rpower      
To be written


Back to TOC

21. Installing Compute Nodes
LaLaLa.

21.1 Edit/Generate Kickstart Scripts      
Modify kickstart template file if needed. Substitute your version of RedHat for xx... (/opt/xcat/ksxx/computexx.kstmp)

Generate real kickstart scripts from the templates:
> cd /opt/xcat/ksxx; ./mkks

21.2 Nodeset      
The following command makes the nodes PXE boot the RedHat kickstart image. (it alters the files in /tftpboot/pxelinux.cfg/)
> nodeset compute install

21.3 Prepare to Monitor the Installation Progress      
> wcons -t 8 compute (or a subset like rack01)
> tail -f /var/log/messages (you should always be watching messages)

21.4 Reboot the Compute Nodes      
You might want to do only a subset of 'compute'
> rpower compute boot

21.5 A Better Way to Install      
Instead of the three steps: nodeset, wcons, rpower, its much easier to user the single command:

> winstall -t 8 compute

This command accomplishes the above three commands in one step.

When going through the install procedure, you'll probably want to install onto only a single machine until you're fairly certain that the install is working well.. then do installs on the whole 'compute' group.

When installing with wcons, you should see something like the following:

wcons windows during compute node installation

Closeup:

wcons windows during compute node installation

21.6 Installs with No Terminal Servers      
To be written.

21.7 Verify that the Compute Nodes Installed Correctly      
To be written.

21.8 Update the SSH Global Known Hosts File      
> makesshgkh compute (or, again, a subset of 'compute')

21.9 Test SSH and psh      
> psh compute date | sort
The output here will be a good way to see if SSH/gkh is setup correctly on all of the compute nodes (a requirement for most cluster tasks). If a node doesn't appear here correctly, you must go back and troubleshoot the individual node, make certain the install happens correctly, rerun makesshgkh, and finally test again with psh. You really must get psh working correctly before continuing.

Back to TOC

22. Installing/Configuring Myrinet Software
The following section only applies to clusters that use Myrinet. It gives an example of creating a GM rpm on RedHat 6.2 and installing this driver on the compute nodes. With RedHat 7.1, the procedure will be slightly different (the kernel version will be different). You may wish to do the rpm building part of this section before you do the previous step to avoid having to install the compute nodes multiple times.

22.1 Make Certain xCAT Configuration Is Ready for Myrinet      
Set up a separate subnet for each compute node, node01-myri0, node02-myri0, etc. in /etc/hosts, DNS, etc.
Verify forward and reverse name resolution for these host names.
Add all the hosts that have Myrinet cards to the 'myri' group in nodelist.tab.

22.2 Get the Latest GM Source from Myricom      
ftp://ftp.myri.com/pub/GM/gm-1.5.1_Linux.tar.gz (contact Myricom for a username/password)

22.3 Copy the Source to a Build Location      
> cp gm-1.5.1_Linux.tar.gz /tmp

22.4 Build the GM rpm      
> cd /tmp ; /opt/xcat/build/gm/gmmaker 1.5.1_Linux
The result should be a GM rpm in /usr/src/redhat/RPMS/i686

If you're using a RedHat kernel, you'll need to prep the kernel tree before you do the above step (clearly some of the details change if you're using a different kernel than I am below)..

> cd /usr/src
> ln -s 2.4.9-31 linux
> cd /usr/src/linux
> make mrproper
> cp kernel-2.4.9-i686-smp.config ../.config
> make menuconfig (change PII to PIII), save and exit
> make deps
> make modules (for 30 secs, then ctrl c), this usually setups a few links that are required.

22.5 Copy the rpm to a Place That Is Accessible by the Kickstart Install Process      
> cp /usr/src/redhat/RPMS/i686/gm-1.5.1_Linux-2.2.19-4hpc.i686.rpm /install/post/kernel

22.6 Make Kickstart Files Aware of This GM Version      
Edit your .kstmp, setting the GMVER variable to the appropriate value... ( GMVER=1.5.1_Linux in our example )

22.7 Regenerate Kickstart Files      
> cd /opt/xcat/ksxx
> ./mkks

22.8 Do New Compute Node Installs      
Refer to the previous section.

22.9 Look at the Pretty Lights      
Check the Myrinet switch. If the compute nodes come up with the GM driver loaded and all cabling is correct, you should see lights on all the connected ports. Do you see the light?

22.10 Generate Myrinet Routing Maps      
> makegmroutes myri

22.11 Verify Connectivity Between All Myrinet Hosts      
> psh myri /opt/xcat/sbin/gmroutecheck myri
No output means success.

22.12 Install the GM rpm on the Management Node      
> rpm -ivh /usr/src/redhat/RPMS/i686/gm-1.5.1_Linux-2.2.19-4hpc.i686.rpm
If the management node doesn't have a Myrinet card, you'll want to keep GM from loading at boot...
> chkconfig --level 0123456 gm off

Back to TOC

23. Installing Portland Group Compilers
This section is out of date. You will benefit from reading the FAQ at: http://www.pgroup.com/faq.htm

Connect to the Portland Group's ftp server: ftp://ftp.pgroup.com/.

  • If a temporary installation:

    Get x86/linux86-HPF-CC.tar.gz
    Make a temporary directory and move the .tar file there.
    temp> tar -xzvf linux86-HPF-CC.tar.gz
    Read the INSTALL file.
    ./install
    Enter accept
    Enter 5 (PGI workstation or PGI server)
    Install Directory? /usr/local/pgi
    Create eval Licence? Yes
    Would you like the installation to be read only? Yes

    There are prob errors in this documentation... check and correct
  • For a permanent IBM customer install:

    Get ftp.pgroup.com/x86_ibm/pgi-3.2.i386.rpm
    > rpm -ivh pgi-3.2.i386.rpm
    > cd /usr/src/redhat/BUILD/pgi-3.2
    Edit Install, modifying INSTALL_DIR to use /usr/local/pgi.
    > ./install
    accept
    Create a file with all of the nodes in your cluster when asked by the install script.
    > lmutil lmhostid (If you have a license key)
    Now go to https://www.pgroup.com/License .
    Enter Username: (Provided by PGI)
    Enter Password:
    Click Generate License Key
    Scroll down to the bottom and click the <Issue Main Keys> button. In the FLEXln hostid field, enter the characters from the lmutil command above.

    Copy the output and replace the hostname in the generated license.
    Look at the path names in this file. Change /usr/pgi/... to /usr/local/pgi/...
    Place the contents in $PGI/license.dat

    Edit /etc/profile, adding
    export LM_LICENSE_FILE=$(PGI)/license.dat


    Edit /usr/local/pgi/linux86/bin/lmgrd.rc, modifying the PGI environment variable to look like this:
    PGI=${PGI:-/usr/local/pgi}
    Start the license manager and have it start at boot:
    > cp /usr/local/pgi/linux86/bin/lmgrd.rc /etc/rc.d/init.d
    > chkconfig --add lmgrd.rc
    > chkconfig --level 345 lmgrd.rc on
    > service lmgrd.rc start

Back to TOC

24. Installing MPICH MPI Libraries
MPI is a standard library used for message passing in parallel applications. This section documents how to install the MPICH MPI implementations that are used over ethernet and Myrinet.

  • MPICH
    Use MPICH if you want to run MPI over ethernet. Skip this if you only want to run MPICH over Myrinet.

    1. Learn about MPICH:
      MPICH's homepage is at: http://www-unix.mcs.anl.gov/mpi/mpich/

    2. Download MPICH:
      ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz

    3. Build MPICH:
      > cd /opt/xcat/build/mpi
      > cp /where/ever/you/put/it/mpich.tar.gz .
      > mv mpich.tar.gz mpich-1.2.3.tar.gz (or whatever the current version acutally is)
      > ./mpimaker (This will show you the possible arguments. You may want to use different ones.)
      > ./mpimaker 1.2.3 smp gnu ssh
      You can 'tail -f mpich-1.2.3/make.log' to view the build's progress.
      When done, you should have stuff in /usr/local/mpich/1.2.3/ip/smp/gnu/ssh

    4. Adjust environment:
      Add the following to your ~/.bashrc. You can put it in /etc/profile if you are only using this MPI lib:
      export MPICH="/usr/local/mpich/1.2.3/ip/smp/gnu/ssh"
      export MPICH_PATH="${MPICH}/bin"
      export MPICH_LIB="${MPICH}/lib"
      export PATH="${MPICH_PATH}:${PATH}"
      export LD_LIBRARY_PATH="${MPICH_LIB}:${LD_LIBRARY_PATH}"

    5. Test the environment:
      After re-sourcing the environment changes that you've made, it's a good idea to validate that everything is correct. A simple, but not complete, way to do this is:

      > which mpicc

      If you're setup for MPICH like in the above example, the output of this command should be:

      /usr/local/mpich/1.2.3/ip/smp/gnu/ssh/bin/mpicc


  • MPICH-GM
    MPICH-GM is a special version of MPICH that communicates over Myrinet's low-level GM layer.
    Use MPICH-GM if you want to run MPI over Myrinet. Skip this if you don't have Myrinet.

    1. Learn about MPICH-GM:
      Some information is available in Myricom's Software FAQ

    2. Download MPICH-GM:
      ftp://ftp.myri.com/pub/MPICH-GM/mpich-1.2.1..7b.tar.gz (contact Myricom for a username / password)

    3. Build MPICH-GM:
      > cd /tmp
      > cp /where/ever/you/put/it/mpich-1.2.1..7b.tar.gz .
      > /opt/xcat/build/mpi/mpimaker (This will show you the possible arguments. You may want to use different ones.)
      > /opt/xcat/build/mpi/mpimaker 1.2.1..7b:1.5.1_Linux-2.4.18-5smp smp gnu ssh
      You can 'tail -f mpich-1.2.1..7b/make.log' to view the build's progress.
      When done, you should have stuff in /usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.4.18-5smp/smp/gnu/ssh

    4. Adjust environment:
      Add the following to your ~/.bashrc. You can put it in /etc/profile if you are only using this MPI lib:
      export MPICH="/usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.2.19-4hpc/smp/gnu/ssh"
      export MPICH_PATH="${MPICH}/bin"
      export MPICH_LIB="${MPICH}/lib"
      export PATH="${MPICH_PATH}:${PATH}"
      export LD_LIBRARY_PATH="${MPICH_LIB}:${LD_LIBRARY_PATH}"

    5. Test the environment:
      After re-sourcing the environment changes that you've made, it's a good idea to validate that everything is correct. A simple, but not complete, way to do this is:

      > which mpicc

      If you're setup for MPICH like in the above example, the output of this command should be:

      /usr/local/mpich//gm-1.5.1_Linux-2.2.19-4hpc/smp/gnu/ssh/bin/mpicc

Back to TOC

25. Installing LAM MPI Libraries
LAM is a lesser used alternative to MPICH for message passing with MPI. It is reportedly faster then MPICH over TCP/IP. The stable version of LAM runs only over TCP/IP (no GM). This section documents how to install LAM. Skip this section if you don't need to use LAM.
  1. Learn about LAM:
    LAM's homepage is at http://www.lam-mpi.org/

  2. Download LAM:
    http://www.lam-mpi.org/download/files/lam-6.5.6.tar.gz

  3. Build LAM:
    (You may wish to use xcat/buil/lam/lammaker instead of these instructions. Run it without any arguments for usage.)
    > cd /usr/src
    > cp /where/ever/you/put/it/lam-6.5.6.tar.gz .
    For the Portland Group compilers:
    > ./configure --prefix=/usr/local/lam-6.5.6/ip/pgi/ssh --with-rpi=usysv \
    --with-rsh='ssh -x' --with-fc=pgf90 \
    --with-cc=pgcc --with-cxx=pgCC
    > make
    > make install
    For the GNU compilers:
    > ./configure --prefix=/usr/local/lam-6.5.6/ip/gnu/ssh --with-rpi=usysv \
    --with-rsh='ssh -x'
    > make
    > make install

  4. Adjust environment:
    Add the following to your ~/.bashrc. You can put it in /etc/profile if you are only using this MPI lib:
    export LAM="/usr/local/lam-6.5.6/ip/gnu/ssh"
    export LAM_PATH="${LAM}/bin"
    export LAM_LIB="${LAM}/lib"
    export PATH="${LAM_PATH}:${PATH}"
    export LD_LIBRARY_PATH="${LAM_LIB}:${LD_LIBRARY_PATH}"

  5. Test the environment:
    After re-sourcing the environment changes that you've made, it's a good idea to validate that everything is correct. A simple, but not complete, way to do this is:

    > which mpicc

    If you're setup for LAM like in the gcc example from above, the output of this command should be:

    /usr/local/lam-6.5.6/ip/gnu/ssh/bin/mpicc

Back to TOC

26. Installing PBS Resource Manager
PBS is a free tool that enables you to run batch jobs on a cluster. Here's how it can be setup quickly to work with our example:

26.1 Learn About PBS      
Check out the PBS homepage at http://www.openpbs.com/. You need to go through some rig-a-ma-roll to get a username and password. After you get a username and password, download and read the manual.

26.2 Download the PBS Source      
Make certain you have a username and password. You can't access the source without them.
Download the source from: http://www.openpbs.org/UserArea/Download/OpenPBS_2_3_12.tar.gz

26.3 Build PBS      
> cd /tmp
> cp /where/ever/you/put/it/OpenPBS_2_3_16.tar.gz /tmp
> /opt/xcat/build/pbs/pbsmaker OpenPBS_2_3_16.tar.gz scp

You should now have stuff in /usr/local/pbs. Configuration and environmnet setup is completed in the genpbs step later on.

Back to TOC

27. Installing Maui Scheduler
Maui is an OpenSource scheduler that offers advanced scheduling algorithms and integrates with PBS. Here's how it can be setup quickly to work with our example:

27.1 Learn About Maui      
The homepage is at: http://www.supercluster.org/.
It would be a good idea to read the docs.

27.2 Download Maui Source      
Download the source from: http://supercluster.org/downloads/maui/maui-3.0.7p2.tar.gz

27.3 Build Maui      
> cd /where/ever/your/put/the/tarball
> tar -xzvf maui-3.0.7p7.tar.gz
> cd maui-3.0.7
> ./configure
     Maui Installation Directory? /usr/local/maui
     Maui Home Directory? /usr/local/maui
     Compiler? gcc
     Checksum SEED? 123
     Correct? Y
     Do you want to use PBS? [Y|N] default (Y)
     PBS Target Directory: /usr/local/pbs
> make
> make install

You should now have stuff in /usr/local/maui. Environment setup and configuration will happen is the genpbs section later on.

> mkdir /var/log/maui


Back to TOC

28. Deploying PBS on the Cluster
If you're running PBS and Maui and you've installed them in the above two steps, you'll want to follow the instructions in this section to finish their setup and deploy them on the compute nodes.

28.1 Deploy      
> genpbs compute
(Where 'compute' is a nodelist.tab group that includes all your compute nodes)

28.2 Verify      
> showq
An example of part of the expected output on a 32 node, dual CPU cluster follows:
0 Active Jobs       0 of  64 Processors Active (0.00%)
                    0 of  32 Nodes Active      (0.00%)
If you don't see this kind of output, something is wrong with your PBS setup. You should fix it before you continue.

Back to TOC

29. Adding Users and Setting Up User Environment
There are a number of thing necessary to setup for each user, before they can run jobs within the framework of the example architecture. Some of the things covered in this section have been covered previously. Use your judgment on if and where to apply the following:

29.1 Setup MPI and Other Environment in /etc/skel/      
If you don't set the MPI environment globally in /etc/profile, /etc/csh.login, /etc/profile.d/mpi.sh, or you haven't done this already, you'll need to add something like the following to /etc/skel/.bashrc and the csh equilivant to /etc/skel/.cshrc, so that the adduser command below will automatically pick up this environment. This example is for MPICH-GM. Use whatever MPI library your users plan on using:

export MPICH="/usr/local/mpich/1.2.1..7/gm-1.5_Linux-2.2.19-4hpc/smp/gnu/ssh"
export MPICH_PATH="${MPICH}/bin"
export MPICH_LIB="${MPICH}/lib"
export PATH="${MPICH_PATH}:${PATH}"
export LD_LIBRARY_PATH="${MPICH_LIB}:${LD_LIBRARY_PATH}"

29.2 Add User      
> addclusteruser username

This command automates a lot of user setup. More on what it does after I get a chance to play with it a little. I wish there was a man page, but the source tells all.


Back to TOC

30. Verifying Cluster Operation With Pseudo Jobs and PBS
At this point, the cluster is almost ready to go. This section outlines a number of tests that will show that the infrastructure is in place for jobs to be successfully run on the cluster:

30.1 Check that the Compute Nodes Know About Your Users      
> ssh node01 (as root)
> su - ibm (or what ever user your testing)
> touch ~/test_of_read_write_home_dir (or what ever user your testing)

If you can't su to this user, something is wrong. If you're using NIS, there's an NIS problem... 'ypwhich; ypcat passwd' to test. If you're not using NIS, you probably haven't added this user to the compute node's /etc/passwd, etc. If you cant touch a file in the user's home directory, you don't have a writable home directory... fix it.

Make the above test work before continuing.

30.2 Test ssh to Compute Node as Regular User      
> su - ibm (or what ever user you're testing (on the management node))
> ssh node01

If you're using access control, like we are in the example xCAT configuration (ACCESS = Y in noderes.tab for node01), you should get a permission denied error. This is the correct behavior.... user's can only ssh to a resource after PBS has allocated the node to the user. If you're not using access control, you should be able to ssh to node01 as a regular user.

30.3 Test Interactive PBS Job to a Single Node      
This will validate that PBS is working and that your user's ssh key-pair is setup correctly.

Request a job:
> su - ibm (or whatever user you're testing (on the management node))
> qsub -l nodes=1,walltime=10:00 -I

This submits an interactive job to PBS, asking for 1 node for 10 minutes. After a bit, PBS should put you on one of the compute nodes. This should look something like:
qsub: waiting for job 1.man-c to start
qsub: job 1.man-c ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 16:44:56 MST 2001
Job ID:		1.man-c
Username:	ibm
Group:		ibm
Nodes:		node32
End PBS Prologue Tue Oct 30 16:44:56 MST 2001
----------------------------------------
[ibm@node32 ibm]$
If it doesn't, your PBS setup is broken. Fix it before continuing. If you get a permission denied type of error, your user's ssh key-pair or ssh configuration isn't setup correctly. Fix it before continuing.

When you can successfully get to the compute node via PBS as a regular user, try to ssh back to the head node. You should be able to without supplying a password. If you can't, something's broken. Fix it before continuing.

30.4 Test Interactive PBS Job to Multiple Nodes      
This will validate that PBS is working, that your user's ssh key-pair is setup correctly, and that jobs will work across more than one compute node.

Request a job:
> su - ibm (or whatever user you're testing (on the management node))
> qsub -l nodes=2,walltime=10:00 -I

This submits an interactive job to PBS, asking for 2 nodes for 10 minutes. After a bit, PBS should put you on one of the compute nodes and give you a list of the compute nodes that you have access to. This should look something like:
qsub: waiting for job 2.man-c to start
qsub: job 2.man-c ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 16:47:36 MST 2001
Job ID:		2.man-c
Username:	ibm
Group:		ibm
Nodes:		node31 node32
End PBS Prologue Tue Oct 30 16:47:36 MST 2001
----------------------------------------
[ibm@node32 ibm]$ 
Test using ssh between the compute nodes you have access to and the headnode. You should be able to ssh to and from all these nodes without supplying a password. If you can't, something's broken. Fix it before continuing.

Back to TOC

31. Running a Simple MPI Job Interactively via PBS
This section outlines how to build and run a simple MPI job:
  1. Be ready:
    Make certain that everything from section 24 works.

  2. Build a simple test MPI program:
    Here we build the simple MPI program cpi with a few different MPI libraries. cpi calculates the value of pi using numerical integration, in C. You only need to build with the libs that you are interested in running. cpi is not a particularly good test in terms of completeness or performance, but it does serve as a good first step for validating MPI and parallel operation.

    Create a place to build:
    > su - ibm (or whatever user you're using to test)
    > mkdir ~/cpi

    Copy the cpi source to this directory:
    > cp /where/you/put/it/mpich-1.2.2.3.tar.gz ~/cpi/
    > cd ~/cpi
    > tar -xzvf mpich-1.2.2.3.tar.gz
    > cp mpich-1.2.2.3/examples/basic/cpi.c ~/cpi

    Build the program:
    These are just examples. Exact path names, etc. may vary with your setup.

    • MPICH-IP
      1. Verify environment:
        > su - ibm (or whatever user you're using to test)
        Make certain that your environment (section 23) is setup correctly for MPICH-IP. i.e. 'which mpicc' should result in /usr/local/mpich/1.2.2.3/ip/smp/gnu/ssh/bin/mpicc, etc.
      2. Build:
        > cd ~/cpi
        > mkdir mpich-ip; cd mpich-ip
        > cp ~/cpi/cpi.c ~/cpi/mpich-ip
        > mpicc -o cpi cpi.c

    • MPICH-GM
      1. Verify environment:
        > su - ibm (or whatever user you're using to test)
        Make certain that your environment (section 23) is setup correctly for MPICH-GM. i.e. 'which mpicc' should result in something like /usr/local/mpich/1.2.1..7/gm-1.5_Linux-2.2.19-4hpc/smp/gnu/ssh/bin/mpicc, etc.
      2. Build:
        > cd ~/cpi
        > mkdir mpich-gm; cd mpich-gm
        > cp ~/cpi/cpi.c ~/cpi/mpich-gm
        > mpicc -o cpi cpi.c

    • LAM-IP
      1. Verify environment:
        > su - ibm (or whatever user you're using to test)
        Make certain that your environment (section 23) is setup correctly for LAM-IP. i.e. 'which mpicc' should result in /usr/local/lam/blah-blah/mpicc, etc.
      2. Build:
        > cd ~/cpi
        > mkdir lam-ip; cd lam-ip
        > cp ~/cpi/cpi.c ~/cpi/lam-ip
        > mpicc -o cpi cpi.c

  3. Run simple MPI jobs interactively:
    These are just examples. Exact path names, etc. may vary with your setup.

    • MPICH-IP
      1. Verify environment:
        > su - ibm (or whatever user you're using to test)
        Make certain that your environment (section 23) is setup correctly for MPICH-IP. i.e. 'which mpicc' should result in /usr/local/mpich/1.2.2.3/ip/smp/gnu/ssh/bin/mpicc, etc.
      2. Request resources and run the job:
        Your session should look something like the following...
        [ibm@man-c ibm]$ cd ~/cpi/mpich-ip
        [ibm@man-c mpich-ip]$ qsub -l nodes=4,walltime=10:00:00 -I
        qsub: waiting for job 3.man-c to start
        qsub: job 3.man-c ready
        
        ----------------------------------------
        Begin PBS Prologue Tue Oct 30 19:35:17 MST 2001
        Job ID:         3.man-c
        Username:       ibm
        Group:          ibm
        Nodes:          node32 node31 node30 node29
        End PBS Prologue Tue Oct 30 19:35:17 MST 2001
        ----------------------------------------
        [ibm@node32 ibm]$ cd $PBS_O_WORKDIR
        [ibm@node32 mpich-ip]$ which mpirun
        /usr/local/mpich/1.2.2.3/ip/smp/gnu/ssh/bin/mpirun
        [ibm@node32 mpich-ip]$ NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
        [ibm@node32 mpich-ip]$ mpirun -machinefile $PBS_NODEFILE  -np $NP cpi
        Process 0 of 4 on node1
        pi is approximately 3.1415926544231239, Error is 0.0000000008333307
        wall clock time = 0.002015
        Process 3 of 4 on node4
        Process 1 of 4 on node2
        Process 2 of 4 on node3
        [ibm@node32 mpich-ip]$ logout
        qsub: job 3.man-c completed
        [ibm@man-c mpich-ip]$
        
    • MPICH-GM
      1. Verify environment:
        > su - ibm (or whatever user you're using to test)
        Make certain that your environment (section 23) is setup correctly for MPICH-GM. i.e. 'which mpicc' should result in something like /usr/local/mpich/1.2.1..7/gm-1.5_Linux-2.2.19-4hpc/smp/gnu/ssh/bin/mpicc, etc.
      2. Request resources and run the job:
        Your session should look something like the following...
        [ibm@man-c ibm]$ cd ~/cpi/mpich-gm
        [ibm@man-c mpich-gm]$ qsub -l nodes=4:ppn=2,walltime=10:00:00 -I
        qsub: waiting for job 4.man-c to start
        qsub: job 4.man-c ready
        
        ----------------------------------------
        Begin PBS Prologue Tue Oct 30 17:59:06 MST 2001
        Job ID:         3.man-c
        Username:       ibm
        Group:          ibm
        Nodes:          node32 node31 node30 node29
        End PBS Prologue Tue Oct 30 17:59:06 MST 2001
        ----------------------------------------
        [matt@node1 matt]$ test -d ~/.gmpi || mkdir ~/.gmpi
        [matt@node1 matt]$ GMCONF=~/.gmpi/conf.$PBS_JOBID
        [matt@node1 matt]$ /usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE \
        >$GMCONF
        [matt@node1 matt]$ NP=$(head -1 $GMCONF)
        [matt@node1 matt]$ cd $PBS_O_WORKDIR
        [matt@node1 mpich-gm]$ RECVMODE="polling"
        [matt@node1 mpich-gm]$ mpirun.ch_gm --gm-f $GMCONF --gm-recv $RECVMODE \
        --gm-use-shmem -np $NP PBS_JOBID=$PBS_JOBID cpi
        
        Process 4 of 8 on node32
        Process 1 of 8 on node31
        Process 6 of 8 on node30
        Process 7 of 8 on node29
        Process 5 of 8 on node31
        Process 2 of 8 on node30
        Process 0 of 8 on node32
        pi is approximately 3.1415926544231247, Error is 0.0000000008333316
        wall clock time = 0.000805
        Process 3 of 8 on node29
        [matt@node1 mpich-gm]$ logout
        qsub: job 4.man-c completed
        [matt@man-c mpich-gm]
        
    • LAM-IP
      1. Verify environment:
        > su - ibm (or whatever user you're using to test)
        Make certain that your environment (section 23) is setup correctly for LAM-IP. i.e. 'which mpicc' should result in /usr/local/lam/6.5.4/ip/gnu/ssh/bin/mpicc, etc.
      2. Request resources and run the job:
        Your session should look something like the following...
        [ibm@man-c ibm]$ cd ~/cpi/lam-ip
        [ibm@man-c lam-ip]$ qsub -l nodes=4:ppn=2,walltime=10:00:00 -I
        qsub: waiting for job 4.man-c to start
        qsub: job 4.neptune ready
        
        ----------------------------------------
        Begin PBS Prologue Tue Oct 30 20:19:06 MST 2001
        Job ID:         4.man-c
        Username:       ibm
        Group:          ibm
        Nodes:          node32 node31 node30 node29
        End PBS Prologue Tue Oct 30 20:19:06 MST 2001
        ----------------------------------------
        [ibm@node32 ibm]$ which lamboot
        /usr/local/lam/6.5.4/ip/gnu/ssh/bin/lamboot
        [ibm@node32 ibm]$ lamboot -v $PBS_NODEFILE
        
        LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
        
        Executing hboot on n0 (node32 - 2 CPUs)...
        Executing hboot on n1 (node2 - 2 CPUs)...
        Executing hboot on n2 (node3 - 2 CPUs)...
        Executing hboot on n3 (node4 - 2 CPUs)...
        [ibm@node32 ibm]$ which mpirun
        /usr/local/lam/6.5.4/ip/gnu/ssh/bin/mpirun
        [ibm@node32 ibm]$ cd $PBS_O_WORKDIR
        [ibm@node32 lam-ip]$ mpirun C cpi
        Process 0 of 8 on node32
        Process 1 of 8 on node32
        Process 2 of 8 on node2
        Process 3 of 8 on node2
        Process 4 of 8 on node3
        Process 6 of 8 on node4
        Process 7 of 8 on node4
        pi is approximately 3.1415926544231247, Error is 0.0000000008333316
        wall clock time = 0.000807
        Process 5 of 8 on node3
        [ibm@node32 lam-ip]$ lamclean
        [ibm@node32 lam-ip]$ logout
        qsub: job 4.man-c completed
        [ibm@man-c lam-ip]$
        


Back to TOC

32. Running a Simple MPI Job in Batch Mode via PBS
Now we're ready to run cpi in batch mode via PBS:
  1. Grok /opt/xcat/samples/pbs/ in fullness:
    Before running batch jobs via PBS, would be a good time to scan the sample PBS batch files that are available in the xCAT distribution in /opt/xcat/samples/pbs. Further, once again examining the PBS documentation and searching the web for PBS examples using the MPI libraries that you are using would be a good idea.

  2. Again make certain your environment is correct:
    Understand the the user's environment $PATH, etc. has to be correct for the MPI library you are using.

  3. Run the sample MPI program via PBS:
    • MPICH-IP
      1. Create your PBS file:

        [ibm@man-c ibm]$ cd ~/cpi/mpich-ip

        cpi-mpichip.pbs should look something like the following...
        #!/bin/bash
        #PBS -l nodes=4:ppn=2,walltime=00:30:00
        #PBS -N cpi
        PROG=cpi
        
        #How many proc do I have?
        NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
        
        #messing with this can increase performance or break your code
        export P4_SOCKBUFSIZE=0x40000
        #export P4_GLOBMEMSIZE=33554432
        export P4_GLOBMEMSIZE=16777296
        #export P4_GLOBMEMSIZE=8388648
        #export P4_GLOBMEMSIZE=4194304
        
        #cd into the directory where I typed qsub
        cd $PBS_O_WORKDIR
        
        #run it
        mpirun -machinefile $PBS_NODEFILE -np $NP $PROG
        
      2. Submit your job and watch its progress.
        An example session follows:
        [ibm@man-c mpich-ip]$ pwd
        /home/ibm/cpi/mpich-ip
        [ibm@man-c mpich-ip]$ qsub cpi-mpichip.pbs
        [ibm@man-c mpich-ip]$ showq 
        ACTIVE JOBS--------------------
        JOBNAME USERNAME    STATE  PROC   REMAINING            STARTTIME
        
        5.man-c      ibm  Running     8     0:15:00  Wed Oct 31 12:06:33
        
        1 Active Job        8 of   64 Processors Active (12.50%)
                            4 of   32 Nodes Active      (12.50%)
      3. Observe the output
        [ibm@man-c mpich-ip]$ ls -lrt
        -rw-------    1 ibm     ibm         1031 Oct 31 12:06 cpi.o5
        -rw-------    1 ibm     ibm            0 Oct 31 12:06 cpi.e5
        You'll note some errors in this output. They don't seem to be fatal and only appear when using multiple CPUs/node (SMP/shared memory)
        # cpi.o5
        ----------------------------------------
        Begin PBS Prologue Thu Nov  1 14:29:45 MST 2001
        Job ID:         5.man-c
        Username:       ibm
        Group:          ibm
        Nodes:          node32 node31 node30 node29
        End PBS Prologue Thu Nov  1 14:29:45 MST 2001
        ----------------------------------------
        Process 0 of 8 on node32
        pi is approximately 3.1415926544231247, Error is 0.0000000008333316
        wall clock time = 0.002537
        Process 2 of 8 on node31
        Process 4 of 8 on node30
        Process 6 of 8 on node29
        Process 3 of 8 on node31
        p3_14667:  p4_error: OOPS: semop lock failed
        : 983043
        Process 5 of 8 on node30
        p5_14865:  p4_error: OOPS: semop lock failed
        : 983043
        Process 7 of 8 on node29
        p7_14148:  p4_error: OOPS: semop lock failed
        : 983043
        Process 1 of 8 on node32
        p1_3787: (0.829781) net_recv failed for fd = 5
        p1_3787:  p4_error: net_recv read, errno = : 9
        ----------------------------------------
        Begin PBS Epilogue Thu Nov  1 14:29:52 MST 2001
        Job ID:         5.man-c
        Username:       ibm
        Group:          ibm
        Job Name:       cpi
        Session:        3490
        Limits:         neednodes=4:ppn=2,walltime=00:30:00
        Resources:      cput=00:00:00,mem=10844kb,vmem=39444kb,walltime=00:00:02
        Queue:          dque
        Account:
        Nodes:          node32 node31 node30 node29
        
        Killing leftovers...
        node1:  killing node32 3792
        node3:  killing node30 14865
        
        End PBS Epilogue Thu Nov  1 14:29:53 MST 2001
        ----------------------------------------
        


    • MPICH-GM
      1. Create your PBS file:
        [ibm@man-c ibm]$ cd ~/cpi/mpich-gm

        It (cpi-mpichgm.pbs) should look something like the following...
        #!/bin/bash
        #PBS -l nodes=4:ppn=2,walltime=00:15:00
        #PBS -N cpi
        
        #prog name
        PROG=cpi
        PROGARGS=""
        
        #Make .gmpi directory if is does not exist for the gm conf files
        test -d ~/.gmpi || mkdir ~/.gmpi
        
        #Define uniq gm conf filename
        GMCONF=~/.gmpi/conf.$PBS_JOBID
        
        GMCONF=~/.gmpi/conf.$PBS_JOBID
        
        #Make gm conf file from pbs nodefile
        if /usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE >$GMCONF
        then
                :
        #       echo "GM Nodefile:"
        #       echo
        #       cat $GMCONF
        else
                echo "pbsnodefile2gmconf failed to create gm conf file!"
                exit
        fi
        
        #How many proc do I have?
        NP=$(head -1 $GMCONF)
        
        #cd into the directory where I typed qsub
        cd $PBS_O_WORKDIR
        
        #Set receive mode, default: polling
        #RECVMODE="blocking"
        #RECVMODE="hybrid"
        RECVMODE="polling"
        
        #remove --gm-use-shmem if you do not want to use shared memory
        #
        #use --gm-v and --gm-recv-verb for additional info at run,
        #check both .o and .e files for output
        mpirun.ch_gm \
                --gm-f $GMCONF \
                --gm-recv $RECVMODE \
                --gm-use-shmem \
                --gm-kill 5 \
                -np $NP \
                PBS_JOBID=$PBS_JOBID \
                TMPDIR=/scr/$PBS_JOBID \
                $PROG $PROGARGS
        
        #clean up
        rm -f $GMCONF
        
        exit 0
      2. Submit your job and watch its progress.
        An example session follows:
        [ibm@man-c mpich-gm]$ pwd
        /home/ibm/cpi/mpich-gm
        [ibm@man-c mpich-gm]$ qsub cpi-mpichgm.pbs
        [ibm@man-c mpich-gm]$ showq 
        ACTIVE JOBS--------------------
        JOBNAME USERNAME  STATE    PROC  REMAINING            STARTTIME
        
        6.man-c      ibm  Running     8    0:15:00  Wed Oct 31 12:06:33
        
        1 Active Job      8 of   64 Processors Active (12.50%)
                          4 of   32 Nodes Active      (12.50%)
      3. Observe the output
        [ibm@man-c mpich-gm]$ ls -lrt
        -rw-------    1 ibm     ibm         1031 Oct 31 12:06 mcpi.o6
        -rw-------    1 ibm     ibm            0 Oct 31 12:06 mcpi.e6
        # mcpi.o6
        ----------------------------------------
        Begin PBS Prologue Wed Oct 31 12:06:34 MST 2001
        Job ID:         6.man-c
        Username:       ibm
        Group:          ibm
        Nodes:          node32 node31 node30 node29
        End PBS Prologue Wed Oct 31 12:06:34 MST 2001
        ----------------------------------------
        Process 7 of 8 on node29
        Process 4 of 8 on node32
        Process 5 of 8 on node31
        Process 1 of 8 on node31
        Process 2 of 8 on node30
        Process 3 of 8 on node29
        Process 6 of 8 on node30
        Process 0 of 8 on node32
        pi is approximately 3.1415926544231247, Error is 0.0000000008333316
        wall clock time = 0.000789
        ----------------------------------------
        Begin PBS Epilogue Wed Oct 31 12:06:48 MST 2001
        Job ID:         6.man-c
        Username:       ibm
        Group:          ibm
        Job Name:       cpi
        Session:        20298
        Resources:      cput=00:00:00,mem=320kb,vmem=1400kb,walltime=00:00:10
        Queue:          dque
        Account:
        Nodes:          node32 node31 node30 node29
        
        Killing leftovers...
        
        End PBS Epilogue Wed Oct 31 12:06:49 MST 2001
        ----------------------------------------


    • LAM-IP
      1. Create your PBS file:
        [ibm@man-c lam-ip]$ cd ~/cpi/lam-ip
        It (cpi-lam.pbs) should look something like the following...
        #!/bin/bash
        #PBS -l nodes=4:ppn=2,walltime=00:30:00
        #PBS -N cpi
        
        #How many processors do I have?
        NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
        
        #cd into the directory where I typed qsub
        cd $PBS_O_WORKDIR
        
        #lamboot
        lamboot $PBS_NODEFILE
        
        #run it
        mpirun C cpi
        
        #cleanup
        lamclean
      2. Submit your job and observe its progress
        An example session with a bit of commentary follows:
        [ibm@man-c lam-ip]$ qsub cpi-lam.pbs
        [ibm@man-c lam-ip]$ showq  (note the active nodes and processors)
        ACTIVE JOBS--------------------
        JOBNAME USERNAME    STATE PROC REMAINING            STARTTIME
        
        7.man-c      ibm  Running    8   0:30:00  Tue Oct 30 20:40:24
        
        1 Active Job        8 of   64 Processors Active (12.50%)
                            4 of   32 Nodes Active      (12.50%)
      3. Observe the output
        [ibm@man-c lam-ip]$ ls -l
        -rw-------    1 ibm     ibm            0 Oct 30 20:40 mcpi.e7
        -rw-------    1 ibm     ibm         1189 Oct 30 20:40 mcpi.o7
        
        # mcpi.o7
        ----------------------------------------
        Begin PBS Prologue Tue Oct 30 20:40:25 MST 2001
        Job ID:         7.man-c
        Username:       ibm
        Group:          ibm
        Nodes:          node32 node31 node30 node29
        End PBS Prologue Tue Oct 30 20:40:25 MST 2001
        ----------------------------------------
        
        LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame
        
        Process 0 of 8 on node32
        pi is approximately 3.1415926544231247, Error is 0.0000000008333316
        wall clock time = 0.000757
        Process 1 of 8 on node32
        Process 2 of 8 on node31
        Process 6 of 8 on node29
        Process 4 of 8 on node30
        Process 3 of 8 on node31
        Process 7 of 8 on node29
        Process 5 of 8 on node30
        ----------------------------------------
        Begin PBS Epilogue Tue Oct 30 20:40:31 MST 2001
        Job ID:         7.man-c
        Username:       ibm
        Group:          ibm
        Job Name:       cpi
        Session:        8009
        Resources:      cput=00:00:00,mem=320kb,vmem=1400kb,walltime=00:00:02
        Queue:          dque
        Account:
        Nodes:          node32 node31 node30 node29
        
        Killing leftovers...
        node32:  killing node32 8085
        node30:  killing node30 6201
        node31:  killing node31 6302
        node29:  killing node29 6201
        
        End PBS Epilogue Tue Oct 30 20:40:32 MST 2001
        ----------------------------------------


  4. Running jobs that are a bit more substantial:
    1. Adjust your .pbs file to make the cpi job run on all of the CPUs in your cluster.

    2. Sumit a bunch of cpi jobs into the queue and watch their progress with showq and qstat.

    3. Check out http://xcat.org/docs/running_mpi_jobs-HOWTO.html for more examples of running MPI benchmarks and test jobs.

    4. Check out http://xcat.org/docs/top500-HOWTO.html for documentation on running the HPL benchmark

    5. Look for more documentation on this subject in the future.



Back to TOC

33. Performing Advanced Tasks
This section is a place holder to remind me to provide information about GPFS, storage nodes, and other more advanced stuff.

Back to TOC

34. Contributing to xCAT
Join the xCAT-dev mailing list and post your suggestions, bug-fixes, code, etc.

Back to TOC

35. Changelog
2002-09-27   Reference the new redbook.
2002-06-20   a few things added to the todo section. todos and notes added to main document. turn off conserver before configuring terminal servers. no need to chkconfig conserver on or copy init.d files for it.
2002-06-18   setupxcat must be run after .tab files are setup. nodetype.tab can not contain comments. More info on setup of extreme ethernet switches.
2002-06-16   Got rid of many references to /usr/local/xcat that are now /opt/xcat. Updated versions of PBS and Maui. Feature list. Got rid of blipfert download references.
2002-06-06   Updates to reflect xCAT 1.1RC9.2. RH 7.3 stuff.
2002-04-18   "Contributing to xCAT" section. New stuff in "Understanding What xCAT Does".
2002-04-11   Modified noderes.tab example to match xCAT v 1.1RC8.3
2002-03-26   Removed myri.com user/pass. Added note about bug in mpimaker.
2002-03-17   More ethernet switch setup up. xCAT is at 1.1RC8.3
2002-03-12   Added "Configuring the Ethernet Switch" section. Added stuff about collecting MAC addresses without term servers.
2002-03-06   New section: "Configuring APC Master Switches". Added info about x330 8674 stage1 image and howto create a custom stage1 image with xcmosutil.
2002-03-05   New sections: "Understanding What xCAT Does", "Getting the xCAT Software Distribution", "Getting Help", "Performing Advanced Tasks".
2002-03-01   A little more stage2 detail. Howto change font sizes in an xterm. Better screen shots for stage2, stage3, and install.
2002-02-23   Links to firmware upgrades. Updated GM/MPICH-GM versions. Updated ServeRAID to v4.84.
2002-02-17   Added images of wcons install process to stage3 and compute node install sections.
2002-02-12   xCAT 1.1RC8.1
2002-01-28   New versions of LAM-MPI and MPICH
2002-01-27   Spelling fixes.
2002-01-20   Added another diagram to the architecture section.
2002-01-22   Added stuff to be controlled by apc master switch to nodehm.tab
2002-01-16   BIOS v1.02 stage1 dd for 8674. Don't use COM2. noderes.tab changed to '0' (COM1). Changed serialmac to 0 in site.tab.
2002-01-09   gensshkeys, not makesshkeys. DHCPDARGS="eth1 eth2" (the quotes are important)
2002-01-06   Don't need nodeset for initial stage2. Fixed mpacheck/mpncheck confusion. winstall as alternative to 'nodeset; rpower'. Use addcluster user and recommend using /etc/skel, etc.
2002-01-04   watchlogd setup documented. Don't need to install ssh rpms for RH 7.x. Don't need to setup ssh keys for root by hand. dhcpd configuration options on RH7.x. Use maui 3.0.7p1.
2002-01-03   Subsection formating changes to they stick out better... not quite done yet though. More, new, and improved ELS and ESP config documentation.
2001-12-21   Updated to xCAT 1.1RC7.5. Added RSA stuff and documented the more automated way of configuring the ASMA adapters. Added Copyright stuff in the intro. All links to xCAT software are password protected. Links to bohnsack.com documentation have been removed.
2001-12-18   Added links to man pages where possible. More content in the architecture section.
2001-12-10   Updated to xCAT 1.1RC7.4
2001-12-09   Many formating adjustments to make it work inside of xcat.org. Changed .png to .gif to make it display better with Netscape.
2001-12-08   Changed to PHP from HTML::Mason and moved to xcat.org.
2001-12-06   Updated xCAT version to 1.1RC7.3.
2001-11-09   Updated GM version to 1.5_Linux. Updated xCAT version to 1.1RC7.2.
2001-11-01   Added documentation on running MPICH-IP cpi via a PBS batch job.
2001-10-31   Added documentation on running MPICH-GM cpi via a PBS batch job.
2001-10-30   Added a .png diagram to the architecture section. Filled in the testing cluster operation section with major new content (and split it into 3 sections in the process). Updated GM version to 1.5_Linux_beta3. Moved additional reading to the top of the document.
2001-10-29   Fixed a bunch of spelling mistakes.... I'm certain new ones have already surfaced.
2001-10-26   Started keeping track of changes. Added xcat configuration examples. Added network configuration examples. Added architecture section. Added more stuff in the introduction. Moved VLAN configuration examples to a separate document. Moved ESP configuration to a separate document. Added examples of what services to remove. Fixed sequence problem with installing the correct OpenSSH rpms. Split LAM and MPICH into separate sections. Added stuff about making ypind come up at boot on management node.


Back to TOC

36. TODO
  • Tonko's new stage1 stuff
  • /install/post/sync documentation
  • document passwd group user sync with prsync
  • Patches for PGI on RedHat 7.x
  • New GM version. Running gm_stress
  • PVM stuff
  • 'makedns' section is very unclear... improve. Also explain why we use a bogus DNS sub-domain and DNS forwarders.
  • Include some simple stuff about using conserver cntrl-e-c, etc.
  • dhcp setup is unclear or incorrect in parts.
  • How to deal with a 'user' or 'interactive' nodes... PBS, PXE and ks on an x340, etc.
  • How to install a RedHat update kernel as a part of kickstart
  • ia64 stuff.
  • Changing the enable and telnet passwords on cisco 3500s.
  • Make certain each section has a few sentences at the top that explain why we want to do what we're about to do.
  • Validate the correctness of licensed PGI compiler install... possibly condense into only talking about the licensed install
  • Make the stage1,2,3 sections have why we do stage-x intro material
  • Provide content in the 'before you begin' section
  • flesh out 'verify cluster is setup for NIS'
  • Move a number of the other docs into this document
  • Spelling check
  • Go through the redbook and add any content that is in the redbook, missing from this document, and relevant.
  • Sync with Egan and others before 1.1 release to this document can be included as an accurate guide with the release
  • work so its easy to provide single page and multi-page versions of the document in html.
  • Convert to DocBook, work on stylesheets, and auto-generate HTML, PDF, man, and text document versions... yeah right.

Back to TOC

37. Thanks
Thanks go out to the following people. They helped this document become what it is today.
  • Mike Galicki, Chris DeYoung, Mark Atkinson, Greg Kettmann, people from POSDATA, Kevin Rudd, and Tonko L De rooy, for suggestions and contributions.
  • Egan Ford for writing xCAT, answering my xCAT questions, and contributing to this document.
  • Jay Urbanski for answering my xCAT questions and getting me started with Linux clustering at IBM.

Back to TOC

home  download  docs  mailing lists  support  usage stats  about this site Extreme Cluster Administration Toolkit