xCAT.org - xCAT HOWTO

xCAT HOWTO
This document was most recently modified: 09/27/2002 The most recent version of this document is available at: `http://xcat.org/docs/xcat-HOWTO.html` printable version without xcat.org header and footer	Author: Matt Bohnsack bohnsack@bohnsack.com http://bohnsack.com/ and others
Make certain you're using the most recent version of this document, before beginning a cluster implementation or even before you begin reading. Be aware that software versions referenced in this document may be out of date. Always check for newer software versions and be aware of the stability of these newer versions. Send additions and corrections to the author, so this document can continue to be improved.

Table of Contents

1. Introduction

xCAT is a collection of mostly script based tools to build, configure, administer, and maintain Linux clusters.

xCAT is for use by IBM and IBM Linux cluster customers only. xCAT is copyright © 2000, 2001, 2002 IBM corporation. All rights reserved. Use and modify all you like, but do not redistribute. No warranty is expressed or implied. IBM assumes no liability or responsibility.

This document describes how to implement Linux cluster on IBM xSeries hardware using xCAT and other third party software. It covers the latest version of xCAT - v1.1RC9.9.5 with RedHat 6.2, 7.0. 7.1, 7.2, or 7.3 as an OS base. All of the examples cover installation on ia32 machines. xCAT does work with ia64 machines, but specific ia64 configuration is not detailed. Specific configuration examples from a somewhat common 32 node PIII cluster configuration are included.

You will need to adjust the configuration examples shown in this document to suit your particular cluster and architecture, but the examples should give a good general idea of what needs to be done. Please don't use this document verbatim as an implementation guide. You should rather use it as an inspiration to your own implementation. Use the man pages, source and other documentation that is available to figure out why certain design/configuration choices are made and how you can make different choices.

This document covers only a very little of the hardware connectivity, cabling, etc. that is required to implement a cluster. Additional documentation including hardware installation and configuration is available as a RedBook at http://publib-b.boulder.ibm.com/Redbooks.nsf/9445fa5b416f6e32852569ae006bb65f/7b1ce6b3913cafb386256bdb007595e8?OpenDocument&Highlight=0,SG24-6623-00. If you're serious about implementing a cluster and learning how things work, you should read the RedBook in addition to this document.

Back to TOC

2. Understanding What xCAT Does - Feature/Functionality Hierarchy

This section explains what you can do with xCAT, why xCAT is designed the way it is, and presents a feature/functionality hierarchy.

2.1 Understanding What Drives xCAT's Design and Architecture

xCAT's architecture and feature set have two major drivers:

Real world requirements - The features in xCAT are a result of the requirements met in hundreds of real cluster implementations. When users have had needs that xCAT or other cluster management solutions couldn't meet, xCAT has risen to the challenge. Over the last few years, this process has been repeatedly applied, resulting in a modular toolkit that represents best practices in cluster management and a flexibility that enables it to change rapidly in response to new requirements and work with many cluster topologies and architectures.
Unmatched Linux clustering experience - The people involved with xCAT's development have used xCAT to implement many of the world's largest Linux clusters and a huge variety of different cluster types. The challenges faced during this work has resulted in features that enable xCAT to power all types of Linux clusters from the very small to the largest ever built.

2.2 Understanding What Types of Clusters xCAT is Good For

xCAT works well with the following cluster types:

HPC - High Performance Computing	Physics, Seismic, CFD, FEA, Weather, and other simulations; Bioinformatic work
HS - Horizontal Scaling	Web farms, etc.
Administrative	Not a traditional cluster, but a very convenient platform to install and administer a number of Linux machines
Windows or other OSes	With xCAT's cloning and imaging support, it can be used to rapidly deploy and conveniently manage clusters with compute nodes that run Windows or any other OS
Other	xCAT's modular toolkit approach makes it easy to adjust for building any type of cluster.

2.3 Understanding xCAT's Features

A list of xCAT's current and planned for the near future features follows:

Status	Feature Type	Feature
current	OS/Distribution support	RedHat 6.2, 7.0, 7.1, 7.2, 7.3, and 8.0-beta on management and compute / storage / interactive \| head nodes
current	OS/Distribution support	RedHat for ia64 on management and compute / storage / interactive \| head nodes
current	OS/Distribution support	Any OS on compute nodes via OS agnostic imaging support
current	Hardware Control	Remote Power control (on/off/state) via IBM Management Processor Network and/or APC Master Switch
current	Hardware Control	Remote software reset (Ctrl+Alt+Del)
current	Hardware Control	Remote Network BIOS/firmware update and configuration on a lot of IBM hardware
current	Hardware Control	Remote OS console via pluggable support for a number of different terminal servers
current	Hardware Control	Remote POST/BIOS console via IBM Management Processor Network and via terminal servers in upcoming IBM BIOS releases.
current	Boot Control	Ability to remotely change boot type (network or local disk) with syslinux
current	Automated installation	Parallel install via scripted RedHat kickstart on ia32 and ia64
current	Automated installation	Parallel install via imaging with other Linux distributions, Widows, or other OSes
current	Automated installation	Network installation with supported PXE NICs or via etherboot on supported NICs without PXE
current	Monitoring	Hardware alerts and email notification with IBM's Management Processor Network and SNMP alerts
current	Monitoring	Remote vitals (fan speed/temp/etc...) with IBM's Management Processor Network
future	Monitoring	Configurable and extensible monitoring support via mon
future	Monitoring	Graphical look at cluster status via ganglia.
current	Monitoring	Remote hardware event logs with IBM's Management Processor Network
current	Administration Utilities	Parallel remote shell, ping, rsync, and copy
current	Administration Utilities	Remote hardware inventory with IBM's Management Processor Network
current	Software Stack	PBS and Maui schedulers - Build scripts, documentation, automated setup, extra related utilities, and deep integration
current	Software Stack	Myrinet - automated setup and installation
current	Software Stack	MPI - Build scripts, documentation, automated setup for MPICH, MPICH-GM, and LAM
future	Software Stack	SGridEngine scheduler supprt
current	Usability	Command line utilities for all cluster management functions
current	Usability	Single operations can be applied in parallel to multiple nodes with a very flexible and customizable group/range functionality
current	Flexibility	Support for various user defined node types

Back to TOC

3. Getting the xCAT Software Distribution

This section explains where and how you can get the xCAT software distribution.

Use the form available at http://xcat.org/download/ to request a username/password, if you don't already have one. Make certain that you fill out all the fields and provide a valid IBM sales contact. If you don't provide a valid IBM sales contact, your request may be ignored.

Back to TOC

4. Getting Other Required Software

This section will list of all the CDs, floppies, and software you might need to install a cluster and where to get them (but it's not complete yet).

4.1 Firmware and Hardware Configuration Software

ServeRAID v4.84 (firmware, config utilities, and driver): http://www.pc.ibm.com/qtechinfo/MIGR-495PES.html
x330 (model 8654 (<= 1Gz)) BIOS v1.05: http://www.pc.ibm.com/qtechinfo/MIGR-4RTLS3.html
x330 (model 8674 (> 1Gz)) BIOS v1.02: http://www.pc.ibm.com/qtechinfo/MIGR-4ZEQVG.html
x340 BIOS v11.0: http://www.pc.ibm.com/qtechinfo/MIGR-4K5NJD.html
x342 BIOS v1.03: http://www.pc.ibm.com/qtechinfo/MIGR-4ZFMXK.html
ASMA MPA (in x330s) v1.06: http://www.pc.ibm.com/qtechinfo/MIGR-4QHMCT.html
RSA MPA (in x330s): v1.03: http://www.pc.ibm.com/qtechinfo/MIGR-4WDQKX.html
RSA MPA (in x342s): v1.06: http://www.pc.ibm.com/qtechinfo/MIGR-38657.html
FastT Fibre Channel Controller: http://www.pc.ibm.com/qtechinfo/MIGR-40708.html

4.2 Other Software

For now see the original sources cited throughout this document and this start at a list:

MPICH: http://www-unix.mcs.anl.gov/mpi/mpich/
LAM-MPI: http://www.lam-mpi.org/
PVM: http://www.csm.ornl.gov/pvm/

Back to TOC

5. Reading Related Documentation

There's quite a bit of related documentation available in various stages of completion. You should read it. It's all accessible at http://xcat.org/docs/.

Back to TOC

6. Getting Help

If you need assistance with building, maintaining, or administering your xCAT cluster, or you have an xCAT feature request, try the xCAT-user mailing list or contact your IBM sales rep or other IBM point of contact.

Back to TOC

7. Understanding Cluster Components and the Example Cluster's Architecture

This document uses a 32 node cluster that uses serial terminal servers for out-of-band console access, an APC Master Switch and IBM's Service Processor Network for remote hardware management, ethernet, and Myrinet as the basis of most of its examples. The following three diagrams describe some of the detail of this example cluster:

7.1 Components / Rack Layout

Here you see how the hardware is positioned in the rack. Starting from the bottom and moving towards the top, we have:

The Myrinet switch: Used for high-speed, low-latency inter-node communication. Your cluster may not have Myrinet, if you aren't running parallel jobs that do heavy message passing, or if it doesn't fit in your budget.
Nodes 1-16: The first 16 compute nodes. Note that every 8th node has an MPA (Managment Processor Adaptor) installed. You may have RSA adapters or ASMA adapters. These cards enable the SPN (Service Processor Network) to function and remote hardware management to be performed.
Monitor/Keyboard: You know what this is.
Terminal servers: The terminal enable serial consoles from all of the compute nodes to be accessible from the management node. You will find this feature very useful during system setup and after setup administration.
APC master switch: This enables remote power control of devices that are not part of the Service Processor Network.. terminal servers, Myrinet switch, ASMA adapters, etc.
The management node: The management node is where we install the rest of the nodes from, manage the cluster, etc.
Nodes 17-32: The rest of the compute nodes.. again with Management Processor cards every 8th node.
Ethernet switch: Finally, at the top, we have the ethernet switch.

Ethernet Switch
node32
...		nodes 27 - 31
node26
node25	MPA
node24
...		nodes 19 - 24
node18
node17	MPA
Management Node
apc1		APC Master Switch
ts2		Terminal Servers
ts1
Monitor / Keyboard
node16
...		nodes 11 - 17
node10
node09	MPA
node08
...		nodes 03 - 07
node02
node01	MPA	has MPA card
Myrinet Switch

7.2 Networks

Here you see the networks that are used in this document's examples. Note the listing of attached devices to the right. Important things to note are:

The external network is the organization's main network. In this example, only the management node has connectivity to the external network.
The ethernet switch hosts both the cluster and management network on separate VLANs.
The cluster network connects the management node to the compute nodes. We use a private class B network that has no connectivity to the external network. This is often the easiest way to do things and a good thing to do if you think your cluster might grow to more than 254 nodes. You may have a requirement to place the compute nodes on a network that is part of your external network.
The management network is a separate network used to connect all devices associated with cluster management... terminal servers, ASMA cards, etc. to the management node.
Parallel jobs use the message passing network for interprocess communication. Our example uses a separate private class B network over Myrinet. If you are not using Myrinet, this network could be the same as the cluster network. i.e. You could do any required message passing over the cluster network.

Cluster Network		<- eth0 on management node (1Gb/s)
172.16.0.0/16		<- eth0 on compute nodes (100Mb/s)
		<- Myrinet switch's management port

Management Network		<- eth1 on management node (100Mb/s)
172.17.0.0/16		<- terminal server's ethernet interfaces
		<- ASMA adapter's ethernet interfaces
		<- APC MasterSwitch's ethernet interface


External Network		<- eth2 on management node (100Mb/s)
10.0.0.0/8

Message Passing Network		<- myri0 on compute nodes (2Gb/s)
172.18.0.0/16

7.3 Connections

7.4 Another Connections Diagram

7.5 Other Architecture Notes

Other notes about this architecture (and areas where yours may differ and you may need to make adjustments to this document's examples):

The compute nodes have no access to the external network.
The compute nodes get DNS,DHCP, and NIS services from the management node.
NIS is used to distribute username/passwd information to the compute nodes and the management node is the NIS master.
The management node is the only node with access to the management network.
PBS and Maui are used to schedule/run jobs on the cluster.
Users can only access compute nodes when the scheduler has allocated nodes to them and then only with ssh.
Jobs will use MPICH or LAM for message passing.

Back to TOC

8. Installing the OS on the Management Node

The first step in building an xCAT cluster is installing Linux on the management node. This is, roughly, how to do just that:

8.1 Create and Configure RAID Devices if Necessary

If you are using a ServeRAID card in the management node, use the ServeRAID flash/config CD to update the ServeRAID firmware to v4.84 and define you RAID volumes. If you have other nodes with hardware RAID, you might as well update and configure them now as well. You can get this CD from http://www.pc.ibm.com/qtechinfo/MIGR-495PES.html.

8.2 NIS Notes

If you plan on interacting with an external NIS server, check if it supports MD5 passwords and shadow passwords. If it doesn't support these modern features, don't turn them on during the install of the management node. I'm not absolutely certain on this point, but it's bitten me hard in the past, so be careful.

8.3 Partition Notes

A good minimum drive partitioning scheme for the management node follows. YMMV:

`/boot`	(200 MB)
`/install`	(4 GB)
`/usr/local/`	(2.5 GB)
`/opt/`	(1 GB)
`/var`	(1 GB per every 128 nodes)
`/`	(the rest of the disk)
`SWAP`	(1 GB)

8.4 User Notes

Its a good idea to create a normal user other than root during the install. I usually make an 'ibm' user.

8.5 DMA Notes

Some IDE CDROM dives in x340s have a DMA problem that can cause large data errors to crop up in your install and later CD copying. If you have a CDROM that has this error, or if you don't want to risk having the frustrating experience of finding out if you do have a bad drive, you need to use this workaround:

Pass ide=nodma to the installer, (i.e., install: text ide=nodma) and then after the install is complete, add append="ide=nodma" to /etc/lilo.conf, /sbin/lilo, and reboot to a system with IDE DMA diabled.

If you're using grub, you'll have to add the ide=nodma line to the end to the kernel stanza you want to boot.

I highly recommend that you do the above... I've seen this problem as recently as 2002-06-06.

8.6 Install RedHat

RedHat 7.0, 7.1, 7.2, or 7.3
Select custom installation. When asked for packages to install choose everything. As an added component, under Kernel options choose to additionally install the SMP kernel.
RedHat 6.2
Install RedHat 6.2 with ServeRAID support using:
- A RedHat 6.2 with updated ServeRAID CD (modified version of the RedHat 6.2 installation CD). You can get a copy of this CD at ftp://cartman.sl.dfw.ibm.com/OS/linux/ks62sr4.iso (IBM internal site).
  
  or
- The standard issue RedHat 6.2 CD without the latest IBM ServeRAID support. You must create a support diskette with the software found at the following URL: http://www.pc.ibm.com/qtechinfo/MIGR-495PES.html. At the boot: prompt, type expert. When asked for disk support, insert your floppy and select the second ServeRAID choice. If you do not see two ServeRAID options in the scroll list, the device driver has not been loaded and you will need to restart your configuration.
Select custom installation. When asked for packages to install choose everything. As an added component, under Kernel options choose to additionally install the SMPX kernel.

8.7 Bring Up the Newly Installed System

Reboot and login as root.

8.8 Turn Off Services We Don't Want (General)

You probably want to turn off some of the network services that are turned on by default during installation for security and other reasons...

To view installed services:

> /sbin/chkconfig --list | grep ':on'

To turn off a service:

> /sbin/chkconfig <service> off

With Redhat 6.2, you'll also have to comment out the services you don't want run in /etc/inetd.conf and then restart inetd.

8.9 Turn Off Services We Don't Want (Specific)

The following are examples of exactly what services to turn off for a system that works with our example architecture and will have nothing running that isn't necessary:

RedHat 6.2:
/sbin/chkconfig --level 0123456 lpd off
/sbin/chkconfig --level 0123456 linuxconf off
/sbin/chkconfig --level 0123456 kudzu off
/sbin/chkconfig --level 0123456 pcmcia off
/sbin/chkconfig --level 0123456 isdn off
/sbin/chkconfig --level 0123456 apmd off
/sbin/chkconfig --level 0123456 autofs off
/sbin/chkconfig --level 0123456 httpd off
/sbin/chkconfig --level 0123456 reconfig off

Also edit /etc/inetd.conf commenting out the following lines:

#ftp stream tcp nowait root /usr/sbin/tcpd in.ftpd -l -a
#telnet stream tcp nowait root /usr/sbin/tcpd in.telnet
#shell stream tcp nowait root /usr/sbin/tcpd in.rshd
#login stream tcp nowait root /usr/sbin/tcpd in.rlogind
#talk dgram udp wait nobody.tty /usr/sbin/tcpd in.talkd
#ntalk dgram udp wait nobody.tty /usr/sbin/tcpd in.ntalkd
#finger stream tcp nowait nobody /usr/sbin/tcpd in.fingerd
#linuxconf stream tcp wait root /bin/linuxconf linuxconf --http

Then restart inetd:

> /sbin/service inet restart
RedHat 7.1:
/sbin/chkconfig --level 0123456 autofs off
/sbin/chkconfig --level 0123456 linuxconf off
/sbin/chkconfig --level 0123456 reconfig off
/sbin/chkconfig --level 0123456 isdn off
/sbin/chkconfig --level 0123456 pppoe off
/sbin/chkconfig --level 0123456 iptables off
/sbin/chkconfig --level 0123456 ipchains off
/sbin/chkconfig --level 0123456 apmd off
/sbin/chkconfig --level 0123456 pcmcia off
/sbin/chkconfig --level 0123456 rawdevices off
/sbin/chkconfig --level 0123456 lpd off
/sbin/chkconfig --level 0123456 kudzu off /sbin/chkconfig --level 0123456 pxe off
RedHat 7.0, 7.2, 7.3:
You get the point. Use your judgement.

8.10 Erase LAM Package

You probably want to remove the RedHat LAM package. It can easily get in the way of the MPI software we install later on, because it's an old version and installs itself in /usr/bin:

> rpm --erase lam

Back to TOC

9. Upgrading RedHat Software

RedHat, like all software, has bugs. You should upgrade RedHat with all the available fixes to have the most stable and secure system possible (with the obvious caution that some of the updates might have undesired behaviours). Unless you want your new mega-buck cluster to get rooted by script kiddies, do this!

9.1 Create a Place To Put the Updates and Go There

We'll use this directory (/install/post/updates/rhxx) later on during the installation of the compute nodes as well.

Substitute rh62, rh70, rh71, rh72, or rh73 for rhxx depending on what version of RedHat you are using.

> mkdir -p /install/post/updates/rhxx
> cd /install/post/updates/rhxx

9.2 Get the Updates

Go to http://www.redhat.com/download/mirror.html and select a mirror site that has the "Updates" section of RedHat's FTP site.

Download all the rpms from updates/x.x/en/os/i386/ and updates/x.x/en/os/noarch/, to the directory you created above, where x.x is the RedHat release you are using. You will also want to download any glibc packages that are available in updates/x.x/en/os/i686/, so you have an optimized C library. If you're using the RedHat kernel, grab it from updates/x.x/en/os/i686/ as well.

9.3 Install the Updates

Note the $(ls | egrep -v '^(kernel-)') stuff. We don't install RedHat's kernel updates, unless there's a very good reason.

Redhat 7.0 - 7.3:
> cd /install/post/updates/rhxx
> rpm -Fvh $(ls | egrep -v '^(kernel-)')
Redhat 6.2:
Only if you're using RedHat 6.2...
> cd /install/post/updates/rh62
> rpm -Uivh --force --nodeps db3*
> rpm -Fvh rpm*
> rpm --rebuilddb
> rpm -Fvh $(ls | egrep -v '^(kernel-)')

Back to TOC

10. Installing Custom Kernel

These instructions mention the use of pre-built xCAT custom kernels. In the past, you probably wanted to use these. Today, you should probably use a RedHat kernel update RPM or roll your own.

10.1 Download Custom Kernel

Available at the link provided on the download page: (http://xcat.org/download/)

10.2 Install Kernel

RedHat 7.2 or 7.3
Upgrade to lastest redhat kernel RPM or roll your own from kernel.org.

RedHat 7.1
You may want to use the kernel or roll you know, but...then again, you may want to:
> cd / ; tar xzvf kernel-2.4.5-2hpc.tgz

RedHat 6.2, 7.0?
> cd / ; tar xzvf kernel-2.2.19-4hpc.tgz

10.3 Edit /etc/lilo.conf

# BEGIN example of /etc/lilo.conf for Redhat 6.2 after editing
boot=/dev/sda
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
linear
default=2.2.19-4hpc

image=/boot/vmlinuz-2.2.19-4hpc
        label=2.2.19-4hpc
        read-only
        root=/dev/sda9

image=/boot/vmlinuz-2.2.14-5.0smp
        label=linux
        initrd=/boot/initrd-2.2.14-5.0smp.img
        read-only
        root=/dev/sda9

image=/boot/vmlinuz-2.2.14-5.0
        label=linux-up
        initrd=/boot/initrd-2.2.14-5.0.img
        read-only
        root=/dev/sda9
# END expample of /etc/lilo.conf

10.4 Run lilo to Install and Reboot

> /sbin/lilo; reboot

Do the right thing if you're using grub instead of lilo.

10.5 RH 7.2, 7.3 notes

You will definetly want the use a RedHat kernel or roll your own if you're using RH 7.2 or 7.3

10.6 e1000 notes

You will want to download the latest e1000 driver from Intel's website and build it into your kernel. The e1000 driver supplied with RedHat performs very poorly and in some cases will not work in a VLANed environment. Do this.

Back to TOC

11. Installing xCAT

Installing xCAT on the management node is very straightforward:

11.1 Download the Latest Version of xCAT

http://xcat.org/download/
The latest verion of xCAT is 1.1RC9.9.5.

11.2 Unpack xCAT Into /opt/

> cd /opt
> tar xzvf /where/you/put/it/xcat-dist-1.1RC9.9.5.tgz

11.3 Install xCAT

The following sets up some environment and other stuff. It will be a more all-inclusive script in the future.

> cd /opt/xcat/sbin
> ./setupxcat

Note: setupxcat must actually be run after xCAT .tab files are setup later on.

11.4 Understanding with setupxcat does

TODO

11.5 Add xCAT Man Pages to $MANPATH and test out xCAT man pages

Add the following line to /etc/man.config:

MANPATH /opt/xcat/man

Test out the man pages:

> man site.tab

Back to TOC

12. Configuring xCAT

This section describes some of the xCAT configuration necessary for the 32 node example cluster. If your cluster differs from this example, you'll have to make changes. xCAT configuration files are located in /opt/xcat/etc. You must setup these configuration files before proceeding.

12.1 Copy the Sample Config Files to Their Required Location

> mkdir /opt/xcat/etc
> cp /opt/xcat/samples/etc/* /opt/xcat/etc

12.2 Create Your Own Custom Configuration

Edit /opt/xcat/etc/* to suit your cluster. Please read the man pages 'man site.tab', etc., to learn more about the format of these configuration files. There is a bit more detail on some of these files in some of the later sections. The following are examples that will work with our example 32 node cluster...

12.3 site.tab

/opt/xcat/etc/site.tab
# site.tab control most of xCAT's global settings.
# man site.tab for information on what each field means.
# this example uses 'c' as a subdomain private to the cluster and
# 10.0.0.1 as the corp DNS server (forwarder).
rsh			/usr/bin/ssh
rcp			/usr/bin/scp
gkhfile			/opt/xcat/etc/gkh
tftpdir			/tftpboot
tftpxcatroot		xcat
domain			c.mydomain.com
dnssearch		c.mydomain.com,mydomain.com
nameservers		192.16.100.1
forwarders		10.0.0.1
nets			172.16.0.0:255.255.0.0,172.17.0.0:255.255.0.0,172.18.0.0:255.255.0.0
dnsdir			/var/named
dnsallowq		172.16.0.0:255.255.0.0,172.17.0.0:255.255.0.0,172.18.0.0:255.255.0.0
domainaliasip		172.16.100.1
mxhosts			c.mydomain.com,man-c.c.mydomain.com
mailhosts		man-c
master			man-c
homefs			man-c:/home
localfs			man-c:/usr/local
pbshome			/var/spool/pbs
pbsprefix		/usr/local/pbs
pbsserver		man-c
scheduler		maui
xcatprefix		/opt/xcat
keyboard		us
timezone		US/Central
offutc			-6
mapperhost		NA
serialmac		0
serialbps		9600
snmpc			public
snmpd			172.17.100.1
poweralerts		Y
timeservers		man-c
logdays			7
installdir		/install
clustername		Clever-cluster-name
dhcpver			2
dhcpconf		/etc/dhcpd.conf
dynamicr		eth0,ia32,172.30.0.1,255.255.0.0,172.30.1.1,172.30.254.254
usernodes		man-c
usermaster		man-c
nisdomain		NA
nismaster		NA
nisslaves		NA
homelinks		NA
chagemin		0
chagemax		60
chagewarn		10
chageinactive		0
mpcliroot		/opt/xcat/lib/mpcli

12.4 nodelist.tab

/opt/xcat/etc/nodelist.tab
# nodelist.tab contains a list of nodes and defines groups that
# can be used in commands.  man nodelist.tab for more information.
node01	all,rack1,compute,myri,mpn1
node02	all,rack1,compute,myri,mpn1
node03	all,rack1,compute,myri,mpn1
node04	all,rack1,compute,myri,mpn1
node05	all,rack1,compute,myri,mpn1
node06	all,rack1,compute,myri,mpn1
node07	all,rack1,compute,myri,mpn1
node08	all,rack1,compute,myri,mpn1
node09	all,rack1,compute,myri,mpn2
node10	all,rack1,compute,myri,mpn2
node11	all,rack1,compute,myri,mpn2
node12	all,rack1,compute,myri,mpn2
node13	all,rack1,compute,myri,mpn2
node14	all,rack1,compute,myri,mpn2
node15	all,rack1,compute,myri,mpn2
node16	all,rack1,compute,myri,mpn2
node17	all,rack1,compute,myri,mpn3
node18	all,rack1,compute,myri,mpn3
node19	all,rack1,compute,myri,mpn3
node20	all,rack1,compute,myri,mpn3
node21	all,rack1,compute,myri,mpn3
node22	all,rack1,compute,myri,mpn3
node23	all,rack1,compute,myri,mpn3
node24	all,rack1,compute,myri,mpn3
node25	all,rack1,compute,myri,mpn4
node26	all,rack1,compute,myri,mpn4
node27	all,rack1,compute,myri,mpn4
node28	all,rack1,compute,myri,mpn4
node29	all,rack1,compute,myri,mpn4
node30	all,rack1,compute,myri,mpn4
node31	all,rack1,compute,myri,mpn4
node32	all,rack1,compute,myri,mpn4
rsa01	nan,mpa
rsa02	nan,mpa
rsa03	nan,mpa
rsa04	nan,mpa
ts01	nan,ts
ts02	nan,ts
myri01  nan

12.5 mpa.tab

/opt/xcat/etc/mpa.tab
#service processor adapter management
#
#type      = asma,rsa
#name      = internal name (must be unique)
#            internal name should = node name
#            if rsa/asma is primary management
#            processor
#number    = internal number (must be unique and > 10000)
#command   = telnet,mpcli
#reset     = http(ASMA only),mpcli,NA
#dhcp      = Y/N(RSA only)
#gateway   = default gateway or NA (for DHCP assigned)
#
rsa01	rsa,rsa01,10001,mpcli,mpcli,NA,N,NA
rsa02	rsa,rsa02,10002,mpcli,mpcli,NA,N,NA
rsa03	rsa,rsa03,10003,mpcli,mpcli,NA,N,NA
rsa04	rsa,rsa04,10004,mpcli,mpcli,NA,N,NA

12.6 mp.tab

/opt/xcat/etc/mp.tab
# mp.tab defines how the Service processor network is setup.
# node07 is accessed via the name 'node07' on the RSA 'rsa01', etc.
# man asma.tab for more information until the man page to mp.tab is ready
node01	rsa01,node01
node02	rsa01,node02
node03	rsa01,node03
node04	rsa01,node04
node05	rsa01,node05
node06	rsa01,node06
node07	rsa01,node07
node08	rsa01,node08
node09	rsa02,node09
node10	rsa02,node10
node11	rsa02,node11
node12	rsa02,node12
node13	rsa02,node13
node14	rsa02,node14
node15	rsa02,node15
node16	rsa02,node16
node17	rsa03,node17
node18	rsa03,node18
node19	rsa03,node19
node20	rsa03,node20
node21	rsa03,node21
node22	rsa03,node22
node23	rsa03,node23
node24	rsa03,node24
node25	rsa04,node25
node26	rsa04,node26
node27	rsa04,node27
node28	rsa04,node28
node29	rsa04,node29
node30	rsa04,node30
node31	rsa04,node31
node32	rsa04,node32

12.7 apc.tab

/opt/xcat/etc/apc.tab
# apc.tab  defines  the  relationship  between nodes and APC
# MasterSwitches and their assigned outlets.  In our example,
# the power for asma1 is plugged into the 1st outlet the the
# APC MasterSwitch, etc.
rsa01	apc1,1
rsa02	apc1,2
rsa03	apc1,3
rsa04	apc1,4
ts01	apc1,5
ts02	apc1,6
myri01	apc1,7

12.8 conserver.cf

/opt/xcat/etc/conserver.cf
# conserver.cf defines how serial consoles are accessed.  Our example
# uses the ELS terminal servers and node01 is connected to port 1 
# on ts01, node02 is connected to port 2 on ts01, node17 is connected to
# port 1 on ts02, etc.
# man conserver.cf for more information
#
# The character '&' in logfile names are substituted with the console
# name.  Any logfile name that doesn't begin with a '/' has LOGDIR
# prepended to it.  So, most consoles will just have a '&' as the logfile
# name which causes /var/consoles/ to be used.
#
LOGDIR=/var/log/consoles
#
# list of consoles we serve
#    name : tty[@host] : baud[parity] : logfile : mark-interval[m|h|d]
#    name : !host : port : logfile : mark-interval[m|h|d]
#    name : |command : : logfile : mark-interval[m|h|d]
#
node01:!ts01:3001:&:
node02:!ts01:3002:&:
node03:!ts01:3003:&:
node04:!ts01:3004:&:
node05:!ts01:3005:&:
node06:!ts01:3006:&:
node07:!ts01:3007:&:
node08:!ts01:3008:&:
node09:!ts01:3009:&:
node10:!ts01:3010:&:
node11:!ts01:3011:&:
node12:!ts01:3012:&:
node13:!ts01:3013:&:
node14:!ts01:3014:&:
node15:!ts01:3015:&:
node16:!ts01:3016:&:
node17:!ts02:3001:&:
node18:!ts02:3002:&:
node19:!ts02:3003:&:
node20:!ts02:3004:&:
node21:!ts02:3005:&:
node22:!ts02:3006:&:
node23:!ts02:3007:&:
node24:!ts02:3008:&:
node25:!ts02:3009:&:
node26:!ts02:3010:&:
node27:!ts02:3011:&:
node28:!ts02:3012:&:
node29:!ts02:3013:&:
node30:!ts02:3014:&:
node31:!ts02:3015:&:
node32:!ts02:3016:&:
%%   
#
# list of clients we allow
# {trusted|allowed|rejected} : machines
#
trusted: 127.0.0.1

12.9 conserver.tab

/opt/xcat/etc/conserver.tab
# conserver.tab  defines  the relationship between nodes and
# conserver servers.  Our example uses only one conserver on
# the localhost.  man conserver.tab for more information.
node01	localhost,node01
node02	localhost,node02
node03	localhost,node03
node04	localhost,node04
node05	localhost,node05
node06	localhost,node06
node07	localhost,node07
node08	localhost,node08
node09	localhost,node09
node10	localhost,node10
node11	localhost,node11
node12	localhost,node12
node13	localhost,node13
node14	localhost,node14
node15	localhost,node15
node16	localhost,node16
node17	localhost,node17
node18	localhost,node18
node19	localhost,node19
node20	localhost,node20
node21	localhost,node21
node22	localhost,node22
node23	localhost,node23
node24	localhost,node24
node25	localhost,node25
node26	localhost,node26
node27	localhost,node27
node28	localhost,node28
node29	localhost,node29
node30	localhost,node30
node31	localhost,node31
node32	localhost,node32

12.10 nodehm.tab

/opt/xcat/etc/nodehm.tab
#
#node hardware management
#
#power     = mp,baytech,emp,apc,apcp,NA
#reset     = mp,apc,apcp,NA
#cad       = mp,NA
#vitals    = mp,NA
#inv       = mp,NA
#cons      = conserver,tty,rtel,NA
#bioscons  = rcons,mp,NA
#eventlogs = mp,NA
#getmacs   = rcons,cisco3500
#netboot   = pxe,eb,ks62,elilo,file:,NA
#eth0      = eepro100,pcnet32,e100,bcm5700
#gcons     = vnc,NA
#serialbios   = Y,N,NA
#
#node	power,reset,cad,vitals,inv,cons,bioscons,eventlogs,getmacs,netboot,eth0,gcons,serialbios
#
node01  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node02  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node03  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node04  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node05  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node06  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node07  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node08  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node09  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node10  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node11  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node12  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node13  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node14  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node15  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node16  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node17  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node18  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node19  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node20  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node21  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node22  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node23  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node24  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node25  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node26  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node27  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node28  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node29  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node30  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node31  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
node32  mp,mp,mp,mp,mp,conserver,rcons,mp,rcons,pxe,eepro100,vnc,N
rsa01   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
rsa02   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
rsa03   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
rsa04   apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
ts01    apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
ts02    apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N
myri01  apc,apc,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,N

12.11 noderes.tab

/opt/xcat/etc/noderes.tab
#
#TFTP         = Where is my TFTP server? 
#               Used by makedhcp to setup /etc/dhcpd.conf
#               Used by mkks to setup update flag location
#NFS_INSTALL  = Where do I get my files?
#INSTALL_DIR  = From what directory?
#SERIAL       = Serial console port (0, 1, or NA)
#USENIS       = Use NIS to authencate (Y or N)
#INSTALL_ROLL = Am I also an installation server? (Y or N)
#ACCT         = Turn on BSD accounting
#GM           = Load GM module (Y or N)
#PBS          = Enable PBS (Y or N)
#ACCESS       = access.conf support
#GPFS         = Install GPFS
#INSTALL NIC  = eth0, eth1, ... or NA
#
#node/group	TFTP,NFS_INSTALL,INSTALL_DIR,SERIAL,USENIS,INSTALL_ROLL,ACCT,GM,PBS,ACCESS,GPFS,INSTALL_NIC
#
compute man-c,man-c,/install,0,N,N,N,Y,Y,Y,N,eth0
nan     man-c,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

12.12 nodetype.tab

/opt/xcat/etc/nodetype.tab
# nodetype.tab maps nodes to types of installs.
# Our example uses only one type, but you might have a few
# different types.. a subset of nodes with GigE, storage nodes,
# etc.  man nodetype.tab for more information.
########### !!!!!!!!!!!! this file can not contain comments !!!!
########### !!!!!!!!!!!! this file can not contain comments !!!!
########### !!!!!!!!!!!! this file can not contain comments !!!!
node01	compute73
node02	compute73
node03	compute73
node04	compute73
node05	compute73
node06	compute73
node07	compute73
node08	compute73
node09	compute73
node10	compute73
node11	compute73
node12	compute73
node13	compute73
node14	compute73
node15	compute73
node16	compute73
node17	compute73
node18	compute73
node19	compute73
node20	compute73
node21	compute73
node22	compute73
node23	compute73
node24	compute73
node25	compute73
node26	compute73
node27	compute73
node28	compute73
node29	compute73
node30	compute73
node31	compute73
node32	compute73

12.13 passwd.tab

/opt/xcat/etc/passwd.tab
# passwd.tab defines some passwords that will be used in the cluster
# man passwd.tab for more information.
cisco		cisco
rootpw		netfinity
asmauser	USERID
asmapass	PASSW0RD

Back to TOC

13. Configuring the Ethernet Switch

This section describes configuring the ethernet switch. Some of it is a bit rough. I'd appreciate some more accurate content.

13.1 Connecting to the Switch and Setting IP Address and Password

Cisco:
Connect the management node's COM1 to the switch's console port and...
> cu -l /dev/ttyS0 -s 9600
Login in and enable.

Assign an IP to the default VLAN:
cisco> conf t
cisco> int vlan1
cisco> ip address 172.17.5.1 255.255.0.0
cisco> exit

Allow telnet access and set telnet password to 'cisco':
cisco> conf t
cisco> line tty 0 15
cisco> login password cisco cisco> exit

Set enable and console passwords:
Set enable password:
cisco> conf t
cisco> enable password cisco
cisco> exit

Set console password:
cisco> conf t
cisco> line vty 0 4
cisco> password cisco
cisco> exit
Extreme Networks:
> config vlan Default ipaddress 172.16.5.1 255.255.0.0

13.1 Changing Port Settings

Cisco:
Setup 'spanning-tree portfast':
Without this option, DHCP may fail because it takes too long for a port to come online after a machine powers up. Do not set spanning-tree portfast on ports that will connect to other switches. Do the following on each port on your switch:
cisco> conf t
cisco> int Fa/1
cisco> spanning-tree portfast
cisco> exit
cisco> int Fa/2
cisco> spanning-tree portfast
cisco> exit
etc.
Extreme Networks:
None needed that I am aware of.

13.1 Setting up Remote Logging

Here we have the switch send all its logging information to the management node's management interface. (We'll enable remote sysloging on the management node later).

Cisco:
cisco> conf t
cisco> logging 172.17.5.1
cisco> exit
Extreme Networks:
> config syslog 172.17.5.1

13.1 Setting up VLANs

Cisco:
# for each interface you want to put in a VLAN Fa0/1 .. Fa0/32, and gig ports { # clearly pseudo-code
  cisco> interface Fa0/1
  cisco> switchport mode access
  cisco> switchport access vlan 2
  cisco> exit
# }
Extreme Networks:
> create vlan man
> config vlan man tag 2
> config vlan man ipaddress 172.17.5.1 255.255.0.0
> config Default delete port 1,2,3,4 # the ports you want in the managemnet VLAN
> config man add port 1,2,3,4
> show vlan

13.1 Notes on VLANs with Multiple Switches

Cisco:

 cisco> configure terminal
 cisco> interface Gi0/1
 cisco> switchport mode trunk
 cisco> switchport trunk encapsulation isl # you should prob use the standard encap instead
 cisco> exit

Extreme Networks:

 unconfigure switch
 config Default delete port ports,you,don't,want,in,management,VLAN

 create vlan cluster
 config vlan cluster tag 2
 config cluster add port ports,you,want,in,cluster,VLAN

 show vlan

 save

13.1 Saving your changes

You want to make certain your switch configuration is saved in case the switch is rebooted.

Cisco:
cisco> write mem
Extreme Networks:
extreme> save

Back to TOC

14. Configuring Networking on the Management Node

This section describes network setup on the management node:

14.1 /etc/hosts

Create your /etc/hosts file. Make sure all devices are entered... terminal servers, switches, hardware management devices, etc.
The following is an sample of the /etc/hosts for the example cluster:

#  Localhost
127.0.0.1		localhost.localdomain localhost
# 
########## Management Node ###################
#
# cluster interface (eth0) GigE
172.16.100.1  man-c.c.mydomain.com        man-c
#
# management interface (eth1)
172.17.100.1  man-m.c.mydomain.com        man-m
#
# external interface (eth2)
10.0.0.1      man.c.mydomain.com          man
#
########## Management Equipment ##############
#
# RSA adapters. You might have ASMA cards instead
172.17.1.1    rsa01.c.mydomain.com        rsa01
172.17.1.2    rsa02.c.mydomain.com        rsa02
172.17.1.3    rsa03.c.mydomain.com        rsa03
172.17.1.4    rsa04.c.mydomain.com        rsa04
#
# Terminal Servers
172.17.2.1    ts01.c.mydomain.com         ts01
172.17.2.2    ts02.c.mydomain.com         ts02
#
# APC Master Switch
172.17.3.1    apc1.c.mydomain.com         apc01
#
# Myrinet Switch's ethernet management port
172.17.4.1    myri01.c.mydomain.com       myri01
#
# Ethernet Switch
172.17.5.1    ethernet01c.c.mydomain.com  ethernet01c
172.16.5.1    ethernet01.c.mydomain.com   ethernet01
#
########## Compute Nodes #####################
#
#
172.16.1.1    node01.c.mydomain.com       node01
172.18.1.1    node01-myri0.c.mydomain.com node01-myri0
172.16.1.2    node02.c.mydomain.com       node02
172.18.1.2    node02-myri0.c.mydomain.com node02-myri0
172.16.1.3    node03.c.mydomain.com       node03
172.18.1.3    node03-myri0.c.mydomain.com node03-myri0
172.16.1.4    node04.c.mydomain.com       node04
172.18.1.4    node04-myri0.c.mydomain.com node04-myri0
172.16.1.5    node05.c.mydomain.com       node05
172.18.1.5    node05-myri0.c.mydomain.com node05-myri0
172.16.1.6    node06.c.mydomain.com       node06
172.18.1.6    node06-myri0.c.mydomain.com node06-myri0
172.16.1.7    node07.c.mydomain.com       node07
172.18.1.7    node07-myri0.c.mydomain.com node07-myri0
172.16.1.8    node08.c.mydomain.com       node08
172.18.1.8    node08-myri0.c.mydomain.com node08-myri0
172.16.1.9    node09.c.mydomain.com       node09
172.18.1.9    node09-myri0.c.mydomain.com node09-myri0
172.16.1.10   node10.c.mydomain.com       node10
172.18.1.10   node10-myri0.c.mydomain.com node10-myri0
172.16.1.11   node11.c.mydomain.com       node11
172.18.1.11   node11-myri0.c.mydomain.com node11-myri0
172.16.1.12   node12.c.mydomain.com       node12
172.18.1.12   node12-myri0.c.mydomain.com node12-myri0
172.16.1.13   node13.c.mydomain.com       node13
172.18.1.13   node13-myri0.c.mydomain.com node13-myri0
172.16.1.14   node14.c.mydomain.com       node14
172.18.1.14   node14-myri0.c.mydomain.com node14-myri0
172.16.1.15   node15.c.mydomain.com       node15
172.18.1.15   node15-myri0.c.mydomain.com node15-myri0
172.16.1.16   node16.c.mydomain.com       node16
172.18.1.16   node16-myri0.c.mydomain.com node16-myri0
172.16.1.17   node17.c.mydomain.com       node17
172.18.1.17   node17-myri0.c.mydomain.com node17-myri0
172.16.1.18   node18.c.mydomain.com       node18
172.18.1.18   node18-myri0.c.mydomain.com node18-myri0
172.16.1.19   node19.c.mydomain.com       node19
172.18.1.19   node19-myri0.c.mydomain.com node19-myri0
172.16.1.20   node20.c.mydomain.com       node20
172.18.1.20   node20-myri0.c.mydomain.com node20-myri0
172.16.1.21   node21.c.mydomain.com       node21
172.18.1.21   node21-myri0.c.mydomain.com node21-myri0
172.16.1.22   node22.c.mydomain.com       node22
172.18.1.22   node22-myri0.c.mydomain.com node22-myri0
172.16.1.23   node23.c.mydomain.com       node23
172.18.1.23   node23-myri0.c.mydomain.com node23-myri0
172.16.1.24   node24.c.mydomain.com       node24
172.18.1.24   node24-myri0.c.mydomain.com node24-myri0
172.16.1.25   node25.c.mydomain.com       node25
172.18.1.25   node25-myri0.c.mydomain.com node25-myri0
172.16.1.26   node26.c.mydomain.com       node26
172.18.1.26   node26-myri0.c.mydomain.com node26-myri0
172.16.1.27   node27.c.mydomain.com       node27
172.18.1.27   node27-myri0.c.mydomain.com node27-myri0
172.16.1.28   node28.c.mydomain.com       node28
172.18.1.28   node28-myri0.c.mydomain.com node28-myri0
172.16.1.29   node29.c.mydomain.com       node29
172.18.1.29   node29-myri0.c.mydomain.com node29-myri0
172.16.1.30   node30.c.mydomain.com       node30
172.18.1.30   node30-myri0.c.mydomain.com node30-myri0
172.16.1.31   node31.c.mydomain.com       node31
172.18.1.31   node31-myri0.c.mydomain.com node31-myri0
172.16.1.32   node32.c.mydomain.com       node32
172.18.1.32   node32-myri0.c.mydomain.com node32-myri0

14.2 Configure Network Devices

Edit /etc/modules.conf, /etc/sysconfig/network-scripts/*, and /etc/sysconfig/network, to create a network configuration that reflects the cluster's design. The following samples work with the example cluster:

/etc/modules.conf
alias eth0 e1000
alias eth1 pcnet32
alias eth2 e100

/etc/sysconfig/network
NETWORKING=yes
HOSTNAME="man-c"
GATEWAY="10.0.0.254"
GATEWAYDEV="eth2"
FORWARD_IPV4="no"

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="none"
IPADDR="172.16.100.1"
NETMASK="255.255.0.0"
ONBOOT="yes"

/etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE="eth1"
BOOTPROTO="none"
IPADDR="172.17.100.1"
NETMASK="255.255.0.0"
ONBOOT="yes"

/etc/sysconfig/network-scripts/ifcfg-eth2
DEVICE="eth2"
BOOTPROTO="none"
IPADDR="10.0.0.1"
NETMASK="255.0.0.0"
ONBOOT="yes"

14.3 Setup DNS

Setup the resolver:

> makedns master

If the management node's IP address is listed first in site.tab's nameservers field (as it is in our example), this will generate zone files from the data in /etc/hosts and start a DNS server as a master.

If the management node's IP address is listed in the nameservers field, but not first, then it will become a slave DNS server and will do a zone xfer from the master (the first IP address listed). If the management node's IP address is not in site.tab then it's possible for the first IP address not to be in the cluster. Some installations user their own DNS external to the cluster and setup all the cluster names there. In this case, you just put the IP address of this external DNS server as the #1 address in the nameservers field, and the IP of the management node as #2, then #2 is a slave and will do a zone xfer, then you'll need to update either the .kstmp or /install/post/sync (this is what I do) to copy a resolv.conf to the clients that specify only to use the slave(2) for DNS.

14.4 Verify Name Resolution

Verify that forward and reverse name resolution work like in the following examples:

> host node01
> host 172.16.1.1

Do not continue until forward and reverse name resolution are working.

14.5 Setup VLANs and Configure Ethernet Switches

If you have separate subnets for the management and compute networks, like in our example, you need to setup VLANs on the ethernet switches. Experience has shown that many strange problems are solved with the introduction of VLANS. Use VLANs to separate the ports associated with the management, and cluster subnets. A set of somewhat random notes for configuring VLANs on different switches and setting up "spanning-tree portfast" on ciscos is available here.

Note: Tie this into the configuring the ethernet switch section and remove redundant information. There are other subsections in this section that need similar treatment.

14.6 Verify Management Node Network Setup

You might have to wait until some of the management equipment is setup in the steps that follow, to do full verification. You need to check that:

You can ping all of the network interfaces
You can ping other devices on all of the subnets (cluster, management, external, etc.)
You can ping and route through your gateway

Back to TOC

15. Doing the Compute Node Preinstall (stage1)

'stage1' is an automated procedure for updating and configuring system BIOSes. This section describes an easy way to upgrade the x330 compute nodes' firmware and set their BIOS settings. It is x330 specific. You're on your own with non-x330s... just make certain that you enable PXE boot by setting the boot order to be CDROM, Floppy, Network, HD and that you disable virus detection or any other interactive BIOS features.

Note: Tonko's new unified stage1 stuff will replace this section very soon.

15.1 Create a CD/Floppy to Configure the BIOS and SPN in Each x330

If you have a lot of compute nodes, it's a good idea to make multiple sets. The flash/configuration process can take up to 10 minute per machine, so you don't want wait around for 10 minutes for each node.

For 1 GHz machines (model 8654):
Create a CD from the ISO image located at:
/opt/xcat/stage/stage1/x330.iso

Create a floppy from the floppy image located at:
/opt/xcat/stage/stage1/x330.dd
This can be done under Linux buy placing a blank DOS formatted floppy into the floppy drive and:
> dd if=x330.dd of=/dev/fd0 bs=1024 conv=sync; sync

For machines > 1 GHz (model 8674):
Create a floppy from the floppy image located at:
/opt/xcat/stage/stage1/x330-6874-1.02.dd
This can be done under Linux buy placing a blank DOS formatted floppy into the floppy drive and:
> dd if=x330-6874-1.02.dd of=/dev/fd0 bs=1024 conv=sync; sync

You don't need a CD.

15.2 Paranoia

Things just seem to work better when you do this... Remove AC Power from all compute nodes. Disconnect all the SPN serial networking.
After waiting a minute or so, restore AC Power to compute nodes and apply power to ASMA cards.

15.3 Flash the Compute Nodes

Insert the floppy (and the CD if you're flashing a model 8654) into each compute node, power on the node and let the disks do their magic. There is no need for any manual intervention during this process... ignore any BIOS errors etc, they'll go away in 15 seconds or so. When the flash/config is finished, the CD will eject and the display should have an obvious "I'm done" message.

Reconnect the SPN serial network after all the machines have been flashed.
You shouldn't make any compute node BIOS modifications after this procedure.

15.4 Hardware Configurations That Are Different Than x330*.dd

If you're nodes have hardware configurations that are slightly different that the hardware configurations that x330.dd / x330-6874-1.02.dd (different PCI cards or different memory size), you'll probably get a POST error message like "Memory Size has Changed" or similar.

You will want to create customer a customer x330.dd floppy image. Here's how:

Flash the nodes as detailed above
Reboot the nodes and fix any BIOS issues
Create a xcmosutil floppy
> dd if=/opt/xcat/stage/stage1/xcmosutil.dd of=/dev/fd0 bs=1024 conv=sync; sync
Boot the xcmosutil floppy and replace the 'compute' settings:
DOS> del compute
DOS> cmosutil /s compute
Put the xcmosutil floppy back into the management node and copy the compute file:
> mcopy a:/compute /tmp/compute
Make a copy of the x330.dd that is appropriate for your machine type:
> cp /opt/xcat/stage/stage1/x330-6874-1.02dd /opt/xcat/stage/stage1/my_x330.dd
Mount the .dd image loopback and replace 'compute':
> mount -o loop /opt/xcat/stage/stage1/my_x330.dd /mnt
> rm /mnt/compute
> cp /tmp/compute /mnt/compute
> umount /mnt
Make a floppy of your new image as flash away:
> dd if=/opt/xcat/stage/stage1/my_x330.dd of=/dev/fd0 bs=1024 conv=sync; sync

15.5 Serial Port Notes

In the past, it was necessary to make a serial port hardware change inside of the x330. This is no longer necessary. All the bugs are fixed with the BIOS so, please use ttyS0/COM1/COMa.

Back to TOC

16. Configuring the Terminal Servers

This section describes setting up ELS and ESP terminal servers and conserver. Your cluster will probably have either ELSes or ESPs so you can skip the instructions for the terminal server type that is not a part of your cluster. Terminal servers enable out-of-band administration and access to the compute nodes... e.g. watching a compute node's console remotely before the compute node can be assigned an IP address or after the network config gets messed up, etc.

16.1 Learn About Conserver

Check out conserver's website.

16.2 Shutdown Conserver

Before setting up the terminal servers, make sure that the conserver service is stopped:

> /sbin/service/conserver stop

16.3 Setup ELS Terminal Servers

This section describes how to configure the Equinox ELS terminal server. If you're using the ESP terminal servers instead of the ELSes, you'll want to skip this section and skip ahead to 16.4 and follow the ESP instructions.

16.3.1 conserver.cf Setup

/opt/xcat/etc/conserver.cf

nodeXXX:!tsx:yyyy:&:

x

yyyy

node1:!ts1:3001:&:

16.3.2 Set ELS's IP Address

> cu -l /dev/ttyS0 -s 9600

Username>

cu

> setupelsip <ELS_HOSTNAME>

> ping <ELS_HOSTNAME>

16.3.3 Final ELS Setup

> setupels <ELS_HOSTNAME>

setupelsip

16.4 Setup ESP Terminal Servers

This section describes how to configure the Equinox ESP terminal server. If you're using ELS terminal servers, as most of the examples in this document do, you should skip this section and use the ELS section instead.

16.4.1 conserver.cf Setup

/opt/xcat/etc/conserver.cf

nodeXXX:/dev/ttyQxxyy:9600p:&:

xx

yy

ttyQ01e0

16.4.2 Build ESP Driver

16.4.3 ESP Startup Configuration

/opt/xcat/sbin/updaterclocal

> cp /opt/xcat/rc.d/espx /etc/rc.d/init.d/

> chkconfig espx on

16.4.4 ESP Driver Configuration

/etc/eqnx/esp.conf

> service espx stop

> rmmod espx

> service espx start

16.5 Start Conserver

> service conserver start

16.6 Understanding How To Tell if Conserver and Terminal Servers are Working

To be written. Also bits about cntrl-e-c, etc.

Back to TOC

17. Completing Management Node Setup

Here, we setup the final services necessary for a functioning management node.

17.1 Copy xCAT init Files

This will enable some services to start at boot time and change the behavior of some existing services

> cd /opt/xcat/rc.d
> cp atftpd portmap snmptrapd syslog /etc/rc.d/init.d/

There are other init files in /opt/xcat/rc.d that you may wish to use, depending on your installation.

Note: Why portmap?

17.2 Copy the 'post' Files for RedHat 6.2 and RedHat 7.x

Copy some install files from the xCAT distribution to the post directory that is used during unattended installs:

> cd /opt/xcat/post
> find . | cpio -dump /install/post

17.3 Setup syslog

Here we enable remote logging...

> cp /opt/xcat/samples/syslog.conf /etc
> touch /var/log/pipemessages
> service syslog restart

On RH7.x based installs, you might want to edit /etc/sysconfig/syslog, changing SYSLOGD_OPTIONS and add the -r switch instead of copying the modified rc.d/syslogd. See the note here (but ignore the watchlogd stuff).

17.4 Setup snmptrapd

snmptrapd received messages from the SPN.

> chkconfig snmptrapd on
> service snmptrapd start

17.5 Setup watchlogd

watchlogd monitors syslog and sends email alerts to a user defined list of admins in the event of a hardware error. This functionality (hardware error alerts) requires a SPN.

> chkconfig watchlogd on
> service watchlogd start

You also must setup an alias in /etc/aliases called alerts. This alias is a comma delimited list of admins that should receive these hardware alerts as email.

The above is no longer supported. See this note.

17.6 Install SSH RPMs for RedHat 6.2

This step isn't required for RedHat 7.x

> cd /opt/xcat/post/rpm62
> rpm -ivh --force --nodeps openssh*.rpm

17.7 Setup NFS and NFS Exports

Make /etc/exports look something like the following:

/install node*(ro,no_root_squash)
/tftpboot node*(ro,no_root_squash)
/usr/local node*(ro,no_root_squash)
/opt/xcat node*(ro,no_root_squash)
/home node*(rw,no_root_squash)

Turn on NFS:

> chkconfig nfs on
> service nfs start
> exportfs -ar # (to source)
> exportfs # (to verify)

17.8 Setup NTP

RedHat 7.x:
> chkconfig ntpd on
> service ntpd start

RedHat 6.2:
> chkconfig xntpd on
> service xntpd start

Note: setupxcat does this now, but you might want to configure the ntpd conf file to point to an external time server.

17.9 Setup TFTP

The following works with RH6.2 and 7.x:

> rpm -ivh /opt/xcat/post/rpmxx/atftp-0.6-1.i386.rpm

> chkconfig atftpd on
> service atftpd start

Test that tftp is working by monitoring /var/log/messages and:

> tftp localhost
> get bogus_file
> quit

You should see tftp try to service the 'get' request in the log output.

Note: The tftpd that comes by default with RedHat 7.2,7.3 might work better than atftpd. atftpd scales well, but it has been know to die mysteriously.

17.10 Initial DHCP Setup

17.10.1 Collect the MAC Addresses of Cluster Equipment

/opt/xcat/etc/<MANAGEMENT_NET>.tab

macnet.tab

17.10.2 Make the Initial dhcpd.conf Config File

> makedhcp --new

17.10.3 Edit dhcpd.conf

> vi /etc/dhcpd.conf

17.10.4 Important DHCP Note

dhcpd.conf

/etc/rc.d/init.d/dhcpd

# BEGIN portion of /etc/rc.d/init.d/dhcpd

daemon /usr/sbin/dhcpd eth1 eth2

# END portion of /etc/rc.d/init.d/dhcpd

/etc/sysconfig/dhcpd

/etc/rc.d/init.d/dhcpd

DHCPDARGS="eth1 eth2"

17.11 Setup NIS

If you're using the management node as a NIS server for your cluster:

17.11.1 Verify That xCAT Is Configured for NIS

site.tab

17.11.2 Setup Management Node as an NIS Server

> gennis

17.12 Copy the Custom Kernel to a Place Where Installs Can Access It

Copy the custom kernel tarball you installed in step 10 to /install/post/kernel/.
Edit the KERNELVER variable where appropriate in /opt/xcat/ksxx/*.kstmp.

17.13 Copy the RedHat Install CD(s)

RedHat 7.x
Do the right thing substituting 1 or 2 for x for RedHat 7.1 or 7.2.
1. Mount Install CD #1
  > mount /mnt/cdrom
2. Copy the files from CD #1
  > cd /mnt/cdrom
  > find . | cpio -dump /install/rh7x
3. Unmount Install CD #1
  > cd / ; umount /mnt/cdrom ; eject
4. Mount Install CD #2
  > mount /mnt/cdrom
5. Copy the files from CD #2 (and CD #3 for RH 7.3)
  > cd /mnt/cdrom
  > find RedHat | cpio -dump /install/rh7x
6. Patch the files
  > /opt/xcat/build/rh7x/applypatch
  Check out this note from the mailing list, if you have storage nodes hooked to mass-storage.
RedHat 6.2
1. Mount the Install CD
  > mount /mnt/cdrom
2. Copy the files
  > cd /mnt/cdrom
  > find . | cpio -dump /install/rh62
3. Patch the files
  Accept reverse patch errors (just hit enter)
  The following should be typed as one line
  > patch -p0 -d /install/rh62 < /opt/xcat/build/rh62/ks62.patch

17.14 Generate root's SSH Keypair

The following command create's a SSH keypair for root with an empty passphrase, sets up root's ssh configuration, and copies keypair and config to /install/post/.ssh so that all installed nodes will have the same root keypair/config.

> gensshkeys root

Note: This is a good candidate for setupxcat. Also, I've found that I need to modify gensshkeys so that it places "Host myri* node* master*" in .ssh/config. (adds the master* for the master node).

17.15 Clean Up the Unneeded .tab Files

In /opt/xcat/etc/, move unneeded .tab files somewhere out of the way e.g. rtel.tab, tty.tab, etc.

Back to TOC

18. Collecting MAC Addresses (stage2)

In this section, we collect the MAC addresses of the compute nodes and create entries in dhcpd.conf for them.

18.1 Setup

> cd /opt/xcat/stage
> ./mkstage

18.2 Prepare to Monitor stage2 Progress

> wcons -t 8 compute (or a subset like rack01)
> tail -f /var/log/messages

You should always be watching messages. It's a very good way to get information about what's happening with your cluster. Watching it is a great habit to get into.

18.3 Reboot Compute Nodes

You'll have to do this manually.

When the machine's boot, they should PXE boot syslinux, get a dynamic IP address, and then load a linux kernel and a special RAMdisk that contains a script that prints the machine's MAC address to the console.

18.4 Observe Output in wcons Windows

If your terminal servers are working correctly, you should see the machines boot their kernels and then something like this:

A closeup:

18.5 Notes on wcons, xterms and Changing Font Size

The wcons windows are xterms. When viewing a large number of consoles on the screen at the same time, the xterms come up with the 'unreadable' font size. xterms have a feature that allows you to change the size of the font very easily. This is very useful when you have a screen of 'unreadable' consoles and you want to zoom in on one to view the output in greater detail.

To do this, move the mouse over the text portion of the xterm in question, hold down the control key, and press the right mouse button. You'll see a menu like the following:

Move the mouse down to select a larger font and then release the mouse button as shown:

Using this xterm feature, you can switch to a large font for detailed viewing and back to the 'unreadable' font to view all the consoles at once.

18.6 Notes on wcons, conserver and 'Ctrl-E .'

This is a placeholder to remind me to document the 'Ctrl-E .' escape sequence that conserver uses to provide a lot of terminal functionality.
> getmacs compute

18.7 Collect the MACs

Once you see that all the compute nodes are spitting their MAC addresses out of their serial consoles...
> getmacs compute

18.8 Kill the wcons Windows

> wkill

18.9 Populate dhcpd.conf

> makedhcp compute

At this point, a dhcpd will be running. So you might want to again make certain that it is only listening on the interfaces that you want it to be.

18.10 Notes on Collecting MAC addresses without a terminal server

Configure cisco3500.tab with an example of the following:

node01   ethernet01,1
node02   ethernet01,2
node03   ethernet01,3
node04   ethernet01,4
...

Make nodehm.tab have entries like:
nodexx  mp,mp,mp,mp,mp,conserver,mp,mp,rcons,cisco3500,eepro100,vnc


Make sure the switch has a hostname and DNS resolves.
Verify that the nodes pluged into the switch ports match what you put
into cisco3500.tab  IE node1 port1 node2 port2

Make sure you can ping the switch, telnet to it and login.   Make sure the
password you set on the switch is the same in passwd.tab.  Put the nodes in
stage2.  Power them on and getmacs as usual.   What the getmacs command
does is issue the show mac-address-table on the switch and grab the macs
from it.

Back to TOC

19. Configuring MPN/ASMA/RSA (stage3)

Stage3 is a mostly automated procedure for configuring the Management Processor Network (MPN) (formally known as Service Processor Network) on IBM xSeries machines. This section describes how to perform stage3 with ASMA or RSA adapters. If your cluster doesn't have a Management Processor Network, you can skip this section.

19.1 Read the xCAT xSeries Management Processor HOWTO

Learn more about the Service Processor Network at: http://xcat.org/docs/managementprocessor-HOWTO.html

19.2 Perform the manual steps

For ASMA adapters:
1. Download the MPN/ASMA/RSA config utility:
  http://www.pc.ibm.com/qtechinfo/MIGR-4QHMCT.html
  v1.06 at the time of this writing
2. Create the ASMA config floppy:
  Using DOS (be certain that you use 'command' and not 'cmd' under NT or Win2k), run the .exe and follow the prompts.
3. Configure ASMA cards:
  For each node that contains an ASMA card, you need to take the following steps, using the MPN/ASMA/RSA setup floppy disk. Once you have booted this floppy disk, select Configuration Settings -> Systems Management Adapter and apply the following changes under Network Settings:
  - Enable network interface
  - Set local IP address for ASMA network interface
  - Set subnet mask
  - Set gateway
For RSA adapters:
1. Record the adapters' MAC addresses:
  Place the mac addresses in in <management-network>.tab (172.17.0.0.tab in our example).
2. Modify dhcpd.conf and boot the RSA adapters:
  > gendhcp -new
  This puts static entries into dhcpd.conf from <management-network>.tab. Remove the power from the RSA adapters and then reapply it. The adapters should boot with the correct IP addresses.

19.3 Check That the Management Processors Are on the Network

> pping mpa

This does a parallel ping to all the nodes that are defined as being a memeber of the group 'asma' in nodelist.tab. If you're using RSA adapters, you may have a group 'rsa' instead of 'asma'. If any of the adapters show up as 'noping', you'll have to investigate why they are not connecting to the network before you continue.

19.4 Program the Management Processors

> mpasetup asma

This command sets up things like alerts, SNMP information, etc on each management adapter in the 'asma' group over IP. Again you might want to user a 'rsa' group if you have RSA adapters.

19.5 Verify the Management Processors Were Programed Correctly

> mpacheck asma

At this point all of the Management Processors should be correctly programmed. Next, we'll configure the Service Processor devices that are in each compute node...

19.6 Nodeset

The following command makes the nodes PXE boot the stage3 image. (it alters the files in /tftpboot/pxelinux.cfg/)
> nodeset compute stage3

19.7 Prepare to Monitor the stage3 Progress

> wcons -t 8 compute (or a subset like rack01)
> tail -f /var/log/messages (you should always be watching messages)

19.8 Reboot Compute Nodes

You'll have to do this manually.

19.9 Watch the Show

You should see the SPN procedure move forward smoothly in all the compute nodes' wcons windows.

After a successful stage3 procedure, your wcons windows should look something like:

wcons windows after stage3

Closeup:

19.10 Test Out Some SPN Commands

Read the man pages for rvitals, rinv, and rpower, etc. and then try out some of these commands on your cluster.

Example:
> rvitals compute ambtemp

19.11 mpncheck

> mpncheck compute

Back to TOC

20. Configuring APC Master Switches

An APC Master Switch or similar remotely controlled Power Distribution Unit can be very valuable in easing administration headaches that are sometimes involved in maintaining clusters.

In our example cluster, the APC Master Switch is connected to the management ethernet VLAN and powers the MPAs, terminal servers, and Myrinet switch. After proper configuration, this allows the administrator to reset these devices by cycling their power from their desktop or another remote location with xCAT's rpower command-line utility. This is useful for debugging, troubleshooting, or just dealing with a flaky component and becomes a real requirement when you only have remote access to a cluster... Who wants get out of your chair and walk into the server room anyways?

APC's Installation guide can be found here.

20.1 Reset the APC MasterSwitch

To be written

20.2 Connect to the Switch via a Serial Connection

lala

20.3 Configure the Switch's IP Settings

To be written

20.4 Test Switch's IP Connectivity

lala

20.5 Figure out which Devices are connected to what Switch Ports

To be written

20.6 Verify/Setup apc.tab and nohm.tab

To be written

20.7 Test with rpower

To be written

Back to TOC

21. Installing Compute Nodes

LaLaLa.

21.1 Edit/Generate Kickstart Scripts

Modify kickstart template file if needed. Substitute your version of RedHat for xx... (/opt/xcat/ksxx/computexx.kstmp)

Generate real kickstart scripts from the templates:
> cd /opt/xcat/ksxx; ./mkks

21.2 Nodeset

The following command makes the nodes PXE boot the RedHat kickstart image. (it alters the files in /tftpboot/pxelinux.cfg/)
> nodeset compute install

21.3 Prepare to Monitor the Installation Progress

> wcons -t 8 compute (or a subset like rack01)
> tail -f /var/log/messages (you should always be watching messages)

21.4 Reboot the Compute Nodes

You might want to do only a subset of 'compute'
> rpower compute boot

21.5 A Better Way to Install

Instead of the three steps: nodeset, wcons, rpower, its much easier to user the single command:

> winstall -t 8 compute

This command accomplishes the above three commands in one step.

When going through the install procedure, you'll probably want to install onto only a single machine until you're fairly certain that the install is working well.. then do installs on the whole 'compute' group.

When installing with wcons, you should see something like the following:

wcons windows during compute node installation

wcons windows during compute node installation

Closeup:

21.6 Installs with No Terminal Servers

To be written.

21.7 Verify that the Compute Nodes Installed Correctly

To be written.

21.8 Update the SSH Global Known Hosts File

> makesshgkh compute (or, again, a subset of 'compute')

21.9 Test SSH and psh

> psh compute date | sort
The output here will be a good way to see if SSH/gkh is setup correctly on all of the compute nodes (a requirement for most cluster tasks). If a node doesn't appear here correctly, you must go back and troubleshoot the individual node, make certain the install happens correctly, rerun makesshgkh, and finally test again with psh. You really must get psh working correctly before continuing.

Back to TOC

22. Installing/Configuring Myrinet Software

The following section only applies to clusters that use Myrinet. It gives an example of creating a GM rpm on RedHat 6.2 and installing this driver on the compute nodes. With RedHat 7.1, the procedure will be slightly different (the kernel version will be different). You may wish to do the rpm building part of this section before you do the previous step to avoid having to install the compute nodes multiple times.

22.1 Make Certain xCAT Configuration Is Ready for Myrinet

Set up a separate subnet for each compute node, node01-myri0, node02-myri0, etc. in /etc/hosts, DNS, etc.
Verify forward and reverse name resolution for these host names.
Add all the hosts that have Myrinet cards to the 'myri' group in nodelist.tab.

22.2 Get the Latest GM Source from Myricom

ftp://ftp.myri.com/pub/GM/gm-1.5.1_Linux.tar.gz (contact Myricom for a username/password)

22.3 Copy the Source to a Build Location

> cp gm-1.5.1_Linux.tar.gz /tmp

22.4 Build the GM rpm

> cd /tmp ; /opt/xcat/build/gm/gmmaker 1.5.1_Linux
The result should be a GM rpm in /usr/src/redhat/RPMS/i686

If you're using a RedHat kernel, you'll need to prep the kernel tree before you do the above step (clearly some of the details change if you're using a different kernel than I am below)..

> cd /usr/src
> ln -s 2.4.9-31 linux
> cd /usr/src/linux
> make mrproper
> cp kernel-2.4.9-i686-smp.config ../.config
> make menuconfig (change PII to PIII), save and exit
> make deps
> make modules (for 30 secs, then ctrl c), this usually setups a few links that are required.

22.5 Copy the rpm to a Place That Is Accessible by the Kickstart Install Process

> cp /usr/src/redhat/RPMS/i686/gm-1.5.1_Linux-2.2.19-4hpc.i686.rpm /install/post/kernel

22.6 Make Kickstart Files Aware of This GM Version

Edit your .kstmp, setting the GMVER variable to the appropriate value... ( GMVER=1.5.1_Linux in our example )

22.7 Regenerate Kickstart Files

> cd /opt/xcat/ksxx
> ./mkks

22.8 Do New Compute Node Installs

Refer to the previous section.

22.9 Look at the Pretty Lights

Check the Myrinet switch. If the compute nodes come up with the GM driver loaded and all cabling is correct, you should see lights on all the connected ports. Do you see the light?

22.10 Generate Myrinet Routing Maps

> makegmroutes myri

22.11 Verify Connectivity Between All Myrinet Hosts

> psh myri /opt/xcat/sbin/gmroutecheck myri
No output means success.

22.12 Install the GM rpm on the Management Node

> rpm -ivh /usr/src/redhat/RPMS/i686/gm-1.5.1_Linux-2.2.19-4hpc.i686.rpm
If the management node doesn't have a Myrinet card, you'll want to keep GM from loading at boot...
> chkconfig --level 0123456 gm off

Back to TOC

23. Installing Portland Group Compilers

This section is out of date. You will benefit from reading the FAQ at: http://www.pgroup.com/faq.htm

Connect to the Portland Group's ftp server: ftp://ftp.pgroup.com/.

If a temporary installation:
Get x86/linux86-HPF-CC.tar.gz
Make a temporary directory and move the .tar file there.
```
temp> tar -xzvf linux86-HPF-CC.tar.gz
```
Read the INSTALL file.
./install
Enter accept
Enter 5 (PGI workstation or PGI server)
Install Directory? /usr/local/pgi
Create eval Licence? Yes
Would you like the installation to be read only? Yes

There are prob errors in this documentation... check and correct
For a permanent IBM customer install:
Get ftp.pgroup.com/x86_ibm/pgi-3.2.i386.rpm
```
> rpm -ivh pgi-3.2.i386.rpm
> cd /usr/src/redhat/BUILD/pgi-3.2
```
Edit Install, modifying INSTALL_DIR to use /usr/local/pgi.
```
> ./install
```
accept
Create a file with all of the nodes in your cluster when asked by the install script.
```
> lmutil lmhostid (If you have a license key)
```
Now go to https://www.pgroup.com/License .
Enter Username: (Provided by PGI)
Enter Password:
Click Generate License Key
Scroll down to the bottom and click the <Issue Main Keys> button. In the FLEXln hostid field, enter the characters from the lmutil command above.

Copy the output and replace the hostname in the generated license.
Look at the path names in this file. Change /usr/pgi/... to /usr/local/pgi/...
Place the contents in $PGI/license.dat

Edit /etc/profile, adding
```
export LM_LICENSE_FILE=$(PGI)/license.dat
```
Edit /usr/local/pgi/linux86/bin/lmgrd.rc, modifying the PGI environment variable to look like this:
```
PGI=${PGI:-/usr/local/pgi}
```
Start the license manager and have it start at boot:
```
> cp /usr/local/pgi/linux86/bin/lmgrd.rc /etc/rc.d/init.d
> chkconfig --add lmgrd.rc
> chkconfig --level 345 lmgrd.rc on
> service lmgrd.rc start
```

Back to TOC

24. Installing MPICH MPI Libraries

MPI is a standard library used for message passing in parallel applications. This section documents how to install the MPICH MPI implementations that are used over ethernet and Myrinet.

MPICH
Use MPICH if you want to run MPI over ethernet. Skip this if you only want to run MPICH over Myrinet.
1. Learn about MPICH:
  MPICH's homepage is at: http://www-unix.mcs.anl.gov/mpi/mpich/
2. Download MPICH:
  ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz
3. Build MPICH:
  > cd /opt/xcat/build/mpi
  > cp /where/ever/you/put/it/mpich.tar.gz .
  > mv mpich.tar.gz mpich-1.2.3.tar.gz (or whatever the current version acutally is)
  > ./mpimaker (This will show you the possible arguments. You may want to use different ones.)
  > ./mpimaker 1.2.3 smp gnu ssh
  You can 'tail -f mpich-1.2.3/make.log' to view the build's progress.
  When done, you should have stuff in /usr/local/mpich/1.2.3/ip/smp/gnu/ssh
4. Adjust environment:
  Add the following to your ~/.bashrc. You can put it in /etc/profile if you are only using this MPI lib:
  export MPICH="/usr/local/mpich/1.2.3/ip/smp/gnu/ssh"
  export MPICH_PATH="${MPICH}/bin"
  export MPICH_LIB="${MPICH}/lib"
  export PATH="${MPICH_PATH}:${PATH}"
  export LD_LIBRARY_PATH="${MPICH_LIB}:${LD_LIBRARY_PATH}"
5. Test the environment:
  After re-sourcing the environment changes that you've made, it's a good idea to validate that everything is correct. A simple, but not complete, way to do this is:
  
  > which mpicc
  
  If you're setup for MPICH like in the above example, the output of this command should be:
  
  /usr/local/mpich/1.2.3/ip/smp/gnu/ssh/bin/mpicc
MPICH-GM
MPICH-GM is a special version of MPICH that communicates over Myrinet's low-level GM layer.
Use MPICH-GM if you want to run MPI over Myrinet. Skip this if you don't have Myrinet.
1. Learn about MPICH-GM:
  Some information is available in Myricom's Software FAQ
2. Download MPICH-GM:
  ftp://ftp.myri.com/pub/MPICH-GM/mpich-1.2.1..7b.tar.gz (contact Myricom for a username / password)
3. Build MPICH-GM:
  > cd /tmp
  > cp /where/ever/you/put/it/mpich-1.2.1..7b.tar.gz .
  > /opt/xcat/build/mpi/mpimaker (This will show you the possible arguments. You may want to use different ones.)
  > /opt/xcat/build/mpi/mpimaker 1.2.1..7b:1.5.1_Linux-2.4.18-5smp smp gnu ssh
  You can 'tail -f mpich-1.2.1..7b/make.log' to view the build's progress.
  When done, you should have stuff in /usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.4.18-5smp/smp/gnu/ssh
4. Adjust environment:
  Add the following to your ~/.bashrc. You can put it in /etc/profile if you are only using this MPI lib:
  export MPICH="/usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.2.19-4hpc/smp/gnu/ssh"
  export MPICH_PATH="${MPICH}/bin"
  export MPICH_LIB="${MPICH}/lib"
  export PATH="${MPICH_PATH}:${PATH}"
  export LD_LIBRARY_PATH="${MPICH_LIB}:${LD_LIBRARY_PATH}"
5. Test the environment:
  After re-sourcing the environment changes that you've made, it's a good idea to validate that everything is correct. A simple, but not complete, way to do this is:
  
  > which mpicc
  
  If you're setup for MPICH like in the above example, the output of this command should be:
  
  /usr/local/mpich//gm-1.5.1_Linux-2.2.19-4hpc/smp/gnu/ssh/bin/mpicc

Back to TOC

25. Installing LAM MPI Libraries

LAM is a lesser used alternative to MPICH for message passing with MPI. It is reportedly faster then MPICH over TCP/IP. The stable version of LAM runs only over TCP/IP (no GM). This section documents how to install LAM. Skip this section if you don't need to use LAM.

Learn about LAM:
LAM's homepage is at http://www.lam-mpi.org/
Download LAM:
http://www.lam-mpi.org/download/files/lam-6.5.6.tar.gz
Build LAM:
(You may wish to use xcat/buil/lam/lammaker instead of these instructions. Run it without any arguments for usage.)
> cd /usr/src
> cp /where/ever/you/put/it/lam-6.5.6.tar.gz .
For the Portland Group compilers:
> ./configure --prefix=/usr/local/lam-6.5.6/ip/pgi/ssh --with-rpi=usysv \
--with-rsh='ssh -x' --with-fc=pgf90 \
--with-cc=pgcc --with-cxx=pgCC
> make
> make install
For the GNU compilers:
> ./configure --prefix=/usr/local/lam-6.5.6/ip/gnu/ssh --with-rpi=usysv \
--with-rsh='ssh -x'
> make
> make install
Adjust environment:
Add the following to your ~/.bashrc. You can put it in /etc/profile if you are only using this MPI lib:
export LAM="/usr/local/lam-6.5.6/ip/gnu/ssh"
export LAM_PATH="${LAM}/bin"
export LAM_LIB="${LAM}/lib"
export PATH="${LAM_PATH}:${PATH}"
export LD_LIBRARY_PATH="${LAM_LIB}:${LD_LIBRARY_PATH}"
Test the environment:
After re-sourcing the environment changes that you've made, it's a good idea to validate that everything is correct. A simple, but not complete, way to do this is:

> which mpicc

If you're setup for LAM like in the gcc example from above, the output of this command should be:

/usr/local/lam-6.5.6/ip/gnu/ssh/bin/mpicc

Back to TOC

26. Installing PBS Resource Manager

PBS is a free tool that enables you to run batch jobs on a cluster. Here's how it can be setup quickly to work with our example:

26.1 Learn About PBS

Check out the PBS homepage at http://www.openpbs.com/. You need to go through some rig-a-ma-roll to get a username and password. After you get a username and password, download and read the manual.

26.2 Download the PBS Source

Make certain you have a username and password. You can't access the source without them.
Download the source from: http://www.openpbs.org/UserArea/Download/OpenPBS_2_3_12.tar.gz

26.3 Build PBS

> cd /tmp
> cp /where/ever/you/put/it/OpenPBS_2_3_16.tar.gz /tmp
> /opt/xcat/build/pbs/pbsmaker OpenPBS_2_3_16.tar.gz scp

You should now have stuff in /usr/local/pbs. Configuration and environmnet setup is completed in the genpbs step later on.

Back to TOC

27. Installing Maui Scheduler

Maui is an OpenSource scheduler that offers advanced scheduling algorithms and integrates with PBS. Here's how it can be setup quickly to work with our example:

27.1 Learn About Maui

The homepage is at: http://www.supercluster.org/.
It would be a good idea to read the docs.

27.2 Download Maui Source

Download the source from: http://supercluster.org/downloads/maui/maui-3.0.7p2.tar.gz

27.3 Build Maui

> cd /where/ever/your/put/the/tarball
> tar -xzvf maui-3.0.7p7.tar.gz
> cd maui-3.0.7
> ./configure
     Maui Installation Directory? /usr/local/maui
     Maui Home Directory? /usr/local/maui
     Compiler? gcc
     Checksum SEED? 123
     Correct? Y
     Do you want to use PBS? [Y|N] default (Y)
     PBS Target Directory: /usr/local/pbs
> make
> make install

You should now have stuff in /usr/local/maui. Environment setup and configuration will happen is the genpbs section later on.

> mkdir /var/log/maui

Back to TOC

28. Deploying PBS on the Cluster

If you're running PBS and Maui and you've installed them in the above two steps, you'll want to follow the instructions in this section to finish their setup and deploy them on the compute nodes.

28.1 Deploy

> genpbs compute
(Where 'compute' is a nodelist.tab group that includes all your compute nodes)

28.2 Verify

> showq
An example of part of the expected output on a 32 node, dual CPU cluster follows:

0 Active Jobs       0 of  64 Processors Active (0.00%)
                    0 of  32 Nodes Active      (0.00%)

If you don't see this kind of output, something is wrong with your PBS setup. You should fix it before you continue.

Back to TOC

29. Adding Users and Setting Up User Environment

There are a number of thing necessary to setup for each user, before they can run jobs within the framework of the example architecture. Some of the things covered in this section have been covered previously. Use your judgment on if and where to apply the following:

29.1 Setup MPI and Other Environment in /etc/skel/

If you don't set the MPI environment globally in /etc/profile, /etc/csh.login, /etc/profile.d/mpi.sh, or you haven't done this already, you'll need to add something like the following to /etc/skel/.bashrc and the csh equilivant to /etc/skel/.cshrc, so that the adduser command below will automatically pick up this environment. This example is for MPICH-GM. Use whatever MPI library your users plan on using:

export MPICH="/usr/local/mpich/1.2.1..7/gm-1.5_Linux-2.2.19-4hpc/smp/gnu/ssh"
export MPICH_PATH="${MPICH}/bin"
export MPICH_LIB="${MPICH}/lib"
export PATH="${MPICH_PATH}:${PATH}"
export LD_LIBRARY_PATH="${MPICH_LIB}:${LD_LIBRARY_PATH}"

29.2 Add User

> addclusteruser username

This command automates a lot of user setup. More on what it does after I get a chance to play with it a little. I wish there was a man page, but the source tells all.

Back to TOC

30. Verifying Cluster Operation With Pseudo Jobs and PBS

At this point, the cluster is almost ready to go. This section outlines a number of tests that will show that the infrastructure is in place for jobs to be successfully run on the cluster:

30.1 Check that the Compute Nodes Know About Your Users

> ssh node01 (as root)
> su - ibm (or what ever user your testing)
> touch ~/test_of_read_write_home_dir (or what ever user your testing)

If you can't su to this user, something is wrong. If you're using NIS, there's an NIS problem... 'ypwhich; ypcat passwd' to test. If you're not using NIS, you probably haven't added this user to the compute node's /etc/passwd, etc. If you cant touch a file in the user's home directory, you don't have a writable home directory... fix it.

Make the above test work before continuing.

30.2 Test ssh to Compute Node as Regular User

> su - ibm (or what ever user you're testing (on the management node))
> ssh node01

If you're using access control, like we are in the example xCAT configuration (ACCESS = Y in noderes.tab for node01), you should get a permission denied error. This is the correct behavior.... user's can only ssh to a resource after PBS has allocated the node to the user. If you're not using access control, you should be able to ssh to node01 as a regular user.

30.3 Test Interactive PBS Job to a Single Node

This will validate that PBS is working and that your user's ssh key-pair is setup correctly.

Request a job:
> su - ibm (or whatever user you're testing (on the management node))
> qsub -l nodes=1,walltime=10:00 -I

This submits an interactive job to PBS, asking for 1 node for 10 minutes. After a bit, PBS should put you on one of the compute nodes. This should look something like:

qsub: waiting for job 1.man-c to start
qsub: job 1.man-c ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 16:44:56 MST 2001
Job ID:		1.man-c
Username:	ibm
Group:		ibm
Nodes:		node32
End PBS Prologue Tue Oct 30 16:44:56 MST 2001
----------------------------------------
[ibm@node32 ibm]$

If it doesn't, your PBS setup is broken. Fix it before continuing. If you get a permission denied type of error, your user's ssh key-pair or ssh configuration isn't setup correctly. Fix it before continuing.

When you can successfully get to the compute node via PBS as a regular user, try to ssh back to the head node. You should be able to without supplying a password. If you can't, something's broken. Fix it before continuing.

30.4 Test Interactive PBS Job to Multiple Nodes

This will validate that PBS is working, that your user's ssh key-pair is setup correctly, and that jobs will work across more than one compute node.

Request a job:
> su - ibm (or whatever user you're testing (on the management node))
> qsub -l nodes=2,walltime=10:00 -I

This submits an interactive job to PBS, asking for 2 nodes for 10 minutes. After a bit, PBS should put you on one of the compute nodes and give you a list of the compute nodes that you have access to. This should look something like:

qsub: waiting for job 2.man-c to start
qsub: job 2.man-c ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 16:47:36 MST 2001
Job ID:		2.man-c
Username:	ibm
Group:		ibm
Nodes:		node31 node32
End PBS Prologue Tue Oct 30 16:47:36 MST 2001
----------------------------------------
[ibm@node32 ibm]$

Test using ssh between the compute nodes you have access to and the headnode. You should be able to ssh to and from all these nodes without supplying a password. If you can't, something's broken. Fix it before continuing.

Back to TOC

31. Running a Simple MPI Job Interactively via PBS

This section outlines how to build and run a simple MPI job:

Be ready:
Make certain that everything from section 24 works.
Build a simple test MPI program:
Here we build the simple MPI program cpi with a few different MPI libraries. cpi calculates the value of pi using numerical integration, in C. You only need to build with the libs that you are interested in running. cpi is not a particularly good test in terms of completeness or performance, but it does serve as a good first step for validating MPI and parallel operation.

Create a place to build:
> su - ibm (or whatever user you're using to test)
> mkdir ~/cpi

Copy the cpi source to this directory:
> cp /where/you/put/it/mpich-1.2.2.3.tar.gz ~/cpi/
> cd ~/cpi
> tar -xzvf mpich-1.2.2.3.tar.gz
> cp mpich-1.2.2.3/examples/basic/cpi.c ~/cpi

Build the program:
These are just examples. Exact path names, etc. may vary with your setup.
- MPICH-IP
  1. Verify environment:
    > su - ibm (or whatever user you're using to test)
    Make certain that your environment (section 23) is setup correctly for MPICH-IP. i.e. 'which mpicc' should result in /usr/local/mpich/1.2.2.3/ip/smp/gnu/ssh/bin/mpicc, etc.
  2. Build:
    > cd ~/cpi
    > mkdir mpich-ip; cd mpich-ip
    > cp ~/cpi/cpi.c ~/cpi/mpich-ip
    > mpicc -o cpi cpi.c
- MPICH-GM
  1. Verify environment:
    > su - ibm (or whatever user you're using to test)
    Make certain that your environment (section 23) is setup correctly for MPICH-GM. i.e. 'which mpicc' should result in something like /usr/local/mpich/1.2.1..7/gm-1.5_Linux-2.2.19-4hpc/smp/gnu/ssh/bin/mpicc, etc.
  2. Build:
    > cd ~/cpi
    > mkdir mpich-gm; cd mpich-gm
    > cp ~/cpi/cpi.c ~/cpi/mpich-gm
    > mpicc -o cpi cpi.c
- LAM-IP
  1. Verify environment:
    > su - ibm (or whatever user you're using to test)
    Make certain that your environment (section 23) is setup correctly for LAM-IP. i.e. 'which mpicc' should result in /usr/local/lam/blah-blah/mpicc, etc.
  2. Build:
    > cd ~/cpi
    > mkdir lam-ip; cd lam-ip
    > cp ~/cpi/cpi.c ~/cpi/lam-ip
    > mpicc -o cpi cpi.c

Run simple MPI jobs interactively:
These are just examples. Exact path names, etc. may vary with your setup.

MPICH-IP

Verify environment:
> su - ibm (or whatever user you're using to test)
Make certain that your environment (section 23) is setup correctly for MPICH-IP. i.e. 'which mpicc' should result in /usr/local/mpich/1.2.2.3/ip/smp/gnu/ssh/bin/mpicc, etc.

Request resources and run the job:
Your session should look something like the following...

[ibm@man-c ibm]$ cd ~/cpi/mpich-ip
[ibm@man-c mpich-ip]$ qsub -l nodes=4,walltime=10:00:00 -I
qsub: waiting for job 3.man-c to start
qsub: job 3.man-c ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 19:35:17 MST 2001
Job ID:         3.man-c
Username:       ibm
Group:          ibm
Nodes:          node32 node31 node30 node29
End PBS Prologue Tue Oct 30 19:35:17 MST 2001
----------------------------------------
[ibm@node32 ibm]$ cd $PBS_O_WORKDIR
[ibm@node32 mpich-ip]$ which mpirun
/usr/local/mpich/1.2.2.3/ip/smp/gnu/ssh/bin/mpirun
[ibm@node32 mpich-ip]$ NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
[ibm@node32 mpich-ip]$ mpirun -machinefile $PBS_NODEFILE  -np $NP cpi
Process 0 of 4 on node1
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.002015
Process 3 of 4 on node4
Process 1 of 4 on node2
Process 2 of 4 on node3
[ibm@node32 mpich-ip]$ logout
qsub: job 3.man-c completed
[ibm@man-c mpich-ip]$

MPICH-GM

Verify environment:
> su - ibm (or whatever user you're using to test)
Make certain that your environment (section 23) is setup correctly for MPICH-GM. i.e. 'which mpicc' should result in something like /usr/local/mpich/1.2.1..7/gm-1.5_Linux-2.2.19-4hpc/smp/gnu/ssh/bin/mpicc, etc.

Request resources and run the job:
Your session should look something like the following...

[ibm@man-c ibm]$ cd ~/cpi/mpich-gm
[ibm@man-c mpich-gm]$ qsub -l nodes=4:ppn=2,walltime=10:00:00 -I
qsub: waiting for job 4.man-c to start
qsub: job 4.man-c ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 17:59:06 MST 2001
Job ID:         3.man-c
Username:       ibm
Group:          ibm
Nodes:          node32 node31 node30 node29
End PBS Prologue Tue Oct 30 17:59:06 MST 2001
----------------------------------------
[matt@node1 matt]$ test -d ~/.gmpi || mkdir ~/.gmpi
[matt@node1 matt]$ GMCONF=~/.gmpi/conf.$PBS_JOBID
[matt@node1 matt]$ /usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE \
>$GMCONF
[matt@node1 matt]$ NP=$(head -1 $GMCONF)
[matt@node1 matt]$ cd $PBS_O_WORKDIR
[matt@node1 mpich-gm]$ RECVMODE="polling"
[matt@node1 mpich-gm]$ mpirun.ch_gm --gm-f $GMCONF --gm-recv $RECVMODE \
--gm-use-shmem -np $NP PBS_JOBID=$PBS_JOBID cpi

Process 4 of 8 on node32
Process 1 of 8 on node31
Process 6 of 8 on node30
Process 7 of 8 on node29
Process 5 of 8 on node31
Process 2 of 8 on node30
Process 0 of 8 on node32
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.000805
Process 3 of 8 on node29
[matt@node1 mpich-gm]$ logout
qsub: job 4.man-c completed
[matt@man-c mpich-gm]

LAM-IP

Verify environment:
> su - ibm (or whatever user you're using to test)
Make certain that your environment (section 23) is setup correctly for LAM-IP. i.e. 'which mpicc' should result in /usr/local/lam/6.5.4/ip/gnu/ssh/bin/mpicc, etc.

Request resources and run the job:
Your session should look something like the following...

[ibm@man-c ibm]$ cd ~/cpi/lam-ip
[ibm@man-c lam-ip]$ qsub -l nodes=4:ppn=2,walltime=10:00:00 -I
qsub: waiting for job 4.man-c to start
qsub: job 4.neptune ready

----------------------------------------
Begin PBS Prologue Tue Oct 30 20:19:06 MST 2001
Job ID:         4.man-c
Username:       ibm
Group:          ibm
Nodes:          node32 node31 node30 node29
End PBS Prologue Tue Oct 30 20:19:06 MST 2001
----------------------------------------
[ibm@node32 ibm]$ which lamboot
/usr/local/lam/6.5.4/ip/gnu/ssh/bin/lamboot
[ibm@node32 ibm]$ lamboot -v $PBS_NODEFILE

LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame

Executing hboot on n0 (node32 - 2 CPUs)...
Executing hboot on n1 (node2 - 2 CPUs)...
Executing hboot on n2 (node3 - 2 CPUs)...
Executing hboot on n3 (node4 - 2 CPUs)...
[ibm@node32 ibm]$ which mpirun
/usr/local/lam/6.5.4/ip/gnu/ssh/bin/mpirun
[ibm@node32 ibm]$ cd $PBS_O_WORKDIR
[ibm@node32 lam-ip]$ mpirun C cpi
Process 0 of 8 on node32
Process 1 of 8 on node32
Process 2 of 8 on node2
Process 3 of 8 on node2
Process 4 of 8 on node3
Process 6 of 8 on node4
Process 7 of 8 on node4
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.000807
Process 5 of 8 on node3
[ibm@node32 lam-ip]$ lamclean
[ibm@node32 lam-ip]$ logout
qsub: job 4.man-c completed
[ibm@man-c lam-ip]$

Back to TOC

32. Running a Simple MPI Job in Batch Mode via PBS

Now we're ready to run cpi in batch mode via PBS:

Grok /opt/xcat/samples/pbs/ in fullness:
Before running batch jobs via PBS, would be a good time to scan the sample PBS batch files that are available in the xCAT distribution in /opt/xcat/samples/pbs. Further, once again examining the PBS documentation and searching the web for PBS examples using the MPI libraries that you are using would be a good idea.
Again make certain your environment is correct:
Understand the the user's environment $PATH, etc. has to be correct for the MPI library you are using.

Run the sample MPI program via PBS:

MPICH-IP

Create your PBS file:

[ibm@man-c ibm]$ cd ~/cpi/mpich-ip

cpi-mpichip.pbs should look something like the following...

#!/bin/bash
#PBS -l nodes=4:ppn=2,walltime=00:30:00
#PBS -N cpi
PROG=cpi

#How many proc do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')

#messing with this can increase performance or break your code
export P4_SOCKBUFSIZE=0x40000
#export P4_GLOBMEMSIZE=33554432
export P4_GLOBMEMSIZE=16777296
#export P4_GLOBMEMSIZE=8388648
#export P4_GLOBMEMSIZE=4194304

#cd into the directory where I typed qsub
cd $PBS_O_WORKDIR

#run it
mpirun -machinefile $PBS_NODEFILE -np $NP $PROG

Submit your job and watch its progress.
An example session follows:

[ibm@man-c mpich-ip]$ pwd
/home/ibm/cpi/mpich-ip
[ibm@man-c mpich-ip]$ qsub cpi-mpichip.pbs
[ibm@man-c mpich-ip]$ showq 
ACTIVE JOBS--------------------
JOBNAME USERNAME    STATE  PROC   REMAINING            STARTTIME

5.man-c      ibm  Running     8     0:15:00  Wed Oct 31 12:06:33

1 Active Job        8 of   64 Processors Active (12.50%)
                    4 of   32 Nodes Active      (12.50%)

Observe the output

[ibm@man-c mpich-ip]$ ls -lrt
-rw-------    1 ibm     ibm         1031 Oct 31 12:06 cpi.o5
-rw-------    1 ibm     ibm            0 Oct 31 12:06 cpi.e5

You'll note some errors in this output. They don't seem to be fatal and only appear when using multiple CPUs/node (SMP/shared memory)

# cpi.o5
----------------------------------------
Begin PBS Prologue Thu Nov  1 14:29:45 MST 2001
Job ID:         5.man-c
Username:       ibm
Group:          ibm
Nodes:          node32 node31 node30 node29
End PBS Prologue Thu Nov  1 14:29:45 MST 2001
----------------------------------------
Process 0 of 8 on node32
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.002537
Process 2 of 8 on node31
Process 4 of 8 on node30
Process 6 of 8 on node29
Process 3 of 8 on node31
p3_14667:  p4_error: OOPS: semop lock failed
: 983043
Process 5 of 8 on node30
p5_14865:  p4_error: OOPS: semop lock failed
: 983043
Process 7 of 8 on node29
p7_14148:  p4_error: OOPS: semop lock failed
: 983043
Process 1 of 8 on node32
p1_3787: (0.829781) net_recv failed for fd = 5
p1_3787:  p4_error: net_recv read, errno = : 9
----------------------------------------
Begin PBS Epilogue Thu Nov  1 14:29:52 MST 2001
Job ID:         5.man-c
Username:       ibm
Group:          ibm
Job Name:       cpi
Session:        3490
Limits:         neednodes=4:ppn=2,walltime=00:30:00
Resources:      cput=00:00:00,mem=10844kb,vmem=39444kb,walltime=00:00:02
Queue:          dque
Account:
Nodes:          node32 node31 node30 node29

Killing leftovers...
node1:  killing node32 3792
node3:  killing node30 14865

End PBS Epilogue Thu Nov  1 14:29:53 MST 2001
----------------------------------------

MPICH-GM

Create your PBS file:
[ibm@man-c ibm]$ cd ~/cpi/mpich-gm

It (cpi-mpichgm.pbs) should look something like the following...

#!/bin/bash
#PBS -l nodes=4:ppn=2,walltime=00:15:00
#PBS -N cpi

#prog name
PROG=cpi
PROGARGS=""

#Make .gmpi directory if is does not exist for the gm conf files
test -d ~/.gmpi || mkdir ~/.gmpi

#Define uniq gm conf filename
GMCONF=~/.gmpi/conf.$PBS_JOBID

GMCONF=~/.gmpi/conf.$PBS_JOBID

#Make gm conf file from pbs nodefile
if /usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE >$GMCONF
then
        :
#       echo "GM Nodefile:"
#       echo
#       cat $GMCONF
else
        echo "pbsnodefile2gmconf failed to create gm conf file!"
        exit
fi

#How many proc do I have?
NP=$(head -1 $GMCONF)

#cd into the directory where I typed qsub
cd $PBS_O_WORKDIR

#Set receive mode, default: polling
#RECVMODE="blocking"
#RECVMODE="hybrid"
RECVMODE="polling"

#remove --gm-use-shmem if you do not want to use shared memory
#
#use --gm-v and --gm-recv-verb for additional info at run,
#check both .o and .e files for output
mpirun.ch_gm \
        --gm-f $GMCONF \
        --gm-recv $RECVMODE \
        --gm-use-shmem \
        --gm-kill 5 \
        -np $NP \
        PBS_JOBID=$PBS_JOBID \
        TMPDIR=/scr/$PBS_JOBID \
        $PROG $PROGARGS

#clean up
rm -f $GMCONF

exit 0

Submit your job and watch its progress.
An example session follows:

[ibm@man-c mpich-gm]$ pwd
/home/ibm/cpi/mpich-gm
[ibm@man-c mpich-gm]$ qsub cpi-mpichgm.pbs
[ibm@man-c mpich-gm]$ showq 
ACTIVE JOBS--------------------
JOBNAME USERNAME  STATE    PROC  REMAINING            STARTTIME

6.man-c      ibm  Running     8    0:15:00  Wed Oct 31 12:06:33

1 Active Job      8 of   64 Processors Active (12.50%)
                  4 of   32 Nodes Active      (12.50%)

Observe the output

[ibm@man-c mpich-gm]$ ls -lrt
-rw-------    1 ibm     ibm         1031 Oct 31 12:06 mcpi.o6
-rw-------    1 ibm     ibm            0 Oct 31 12:06 mcpi.e6

# mcpi.o6
----------------------------------------
Begin PBS Prologue Wed Oct 31 12:06:34 MST 2001
Job ID:         6.man-c
Username:       ibm
Group:          ibm
Nodes:          node32 node31 node30 node29
End PBS Prologue Wed Oct 31 12:06:34 MST 2001
----------------------------------------
Process 7 of 8 on node29
Process 4 of 8 on node32
Process 5 of 8 on node31
Process 1 of 8 on node31
Process 2 of 8 on node30
Process 3 of 8 on node29
Process 6 of 8 on node30
Process 0 of 8 on node32
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.000789
----------------------------------------
Begin PBS Epilogue Wed Oct 31 12:06:48 MST 2001
Job ID:         6.man-c
Username:       ibm
Group:          ibm
Job Name:       cpi
Session:        20298
Resources:      cput=00:00:00,mem=320kb,vmem=1400kb,walltime=00:00:10
Queue:          dque
Account:
Nodes:          node32 node31 node30 node29

Killing leftovers...

End PBS Epilogue Wed Oct 31 12:06:49 MST 2001
----------------------------------------

LAM-IP

Create your PBS file:
[ibm@man-c lam-ip]$ cd ~/cpi/lam-ip
It (cpi-lam.pbs) should look something like the following...

#!/bin/bash
#PBS -l nodes=4:ppn=2,walltime=00:30:00
#PBS -N cpi

#How many processors do I have?
NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')

#cd into the directory where I typed qsub
cd $PBS_O_WORKDIR

#lamboot
lamboot $PBS_NODEFILE

#run it
mpirun C cpi

#cleanup
lamclean

Submit your job and observe its progress
An example session with a bit of commentary follows:

[ibm@man-c lam-ip]$ qsub cpi-lam.pbs
[ibm@man-c lam-ip]$ showq  (note the active nodes and processors)
ACTIVE JOBS--------------------
JOBNAME USERNAME    STATE PROC REMAINING            STARTTIME

7.man-c      ibm  Running    8   0:30:00  Tue Oct 30 20:40:24

1 Active Job        8 of   64 Processors Active (12.50%)
                    4 of   32 Nodes Active      (12.50%)

Observe the output

[ibm@man-c lam-ip]$ ls -l
-rw-------    1 ibm     ibm            0 Oct 30 20:40 mcpi.e7
-rw-------    1 ibm     ibm         1189 Oct 30 20:40 mcpi.o7

# mcpi.o7
----------------------------------------
Begin PBS Prologue Tue Oct 30 20:40:25 MST 2001
Job ID:         7.man-c
Username:       ibm
Group:          ibm
Nodes:          node32 node31 node30 node29
End PBS Prologue Tue Oct 30 20:40:25 MST 2001
----------------------------------------

LAM 6.5.4/MPI 2 C++/ROMIO - University of Notre Dame

Process 0 of 8 on node32
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.000757
Process 1 of 8 on node32
Process 2 of 8 on node31
Process 6 of 8 on node29
Process 4 of 8 on node30
Process 3 of 8 on node31
Process 7 of 8 on node29
Process 5 of 8 on node30
----------------------------------------
Begin PBS Epilogue Tue Oct 30 20:40:31 MST 2001
Job ID:         7.man-c
Username:       ibm
Group:          ibm
Job Name:       cpi
Session:        8009
Resources:      cput=00:00:00,mem=320kb,vmem=1400kb,walltime=00:00:02
Queue:          dque
Account:
Nodes:          node32 node31 node30 node29

Killing leftovers...
node32:  killing node32 8085
node30:  killing node30 6201
node31:  killing node31 6302
node29:  killing node29 6201

End PBS Epilogue Tue Oct 30 20:40:32 MST 2001
----------------------------------------

Running jobs that are a bit more substantial:
1. Adjust your .pbs file to make the cpi job run on all of the CPUs in your cluster.
2. Sumit a bunch of cpi jobs into the queue and watch their progress with showq and qstat.
3. Check out http://xcat.org/docs/running_mpi_jobs-HOWTO.html for more examples of running MPI benchmarks and test jobs.
4. Check out http://xcat.org/docs/top500-HOWTO.html for documentation on running the HPL benchmark
5. Look for more documentation on this subject in the future.

Back to TOC

33. Performing Advanced Tasks

This section is a place holder to remind me to provide information about GPFS, storage nodes, and other more advanced stuff.

Back to TOC

34. Contributing to xCAT

Join the xCAT-dev mailing list and post your suggestions, bug-fixes, code, etc.

Back to TOC

35. Changelog

2002-09-27		Reference the new redbook.
2002-06-20		a few things added to the todo section. todos and notes added to main document. turn off conserver before configuring terminal servers. no need to chkconfig conserver on or copy init.d files for it.
2002-06-18		setupxcat must be run after .tab files are setup. nodetype.tab can not contain comments. More info on setup of extreme ethernet switches.
2002-06-16		Got rid of many references to /usr/local/xcat that are now /opt/xcat. Updated versions of PBS and Maui. Feature list. Got rid of blipfert download references.
2002-06-06		Updates to reflect xCAT 1.1RC9.2. RH 7.3 stuff.
2002-04-18		"Contributing to xCAT" section. New stuff in "Understanding What xCAT Does".
2002-04-11		Modified noderes.tab example to match xCAT v 1.1RC8.3
2002-03-26		Removed myri.com user/pass. Added note about bug in mpimaker.
2002-03-17		More ethernet switch setup up. xCAT is at 1.1RC8.3
2002-03-12		Added "Configuring the Ethernet Switch" section. Added stuff about collecting MAC addresses without term servers.
2002-03-06		New section: "Configuring APC Master Switches". Added info about x330 8674 stage1 image and howto create a custom stage1 image with xcmosutil.
2002-03-05		New sections: "Understanding What xCAT Does", "Getting the xCAT Software Distribution", "Getting Help", "Performing Advanced Tasks".
2002-03-01		A little more stage2 detail. Howto change font sizes in an xterm. Better screen shots for stage2, stage3, and install.
2002-02-23		Links to firmware upgrades. Updated GM/MPICH-GM versions. Updated ServeRAID to v4.84.
2002-02-17		Added images of wcons install process to stage3 and compute node install sections.
2002-02-12		xCAT 1.1RC8.1
2002-01-28		New versions of LAM-MPI and MPICH
2002-01-27		Spelling fixes.
2002-01-20		Added another diagram to the architecture section.
2002-01-22		Added stuff to be controlled by apc master switch to nodehm.tab
2002-01-16		BIOS v1.02 stage1 dd for 8674. Don't use COM2. noderes.tab changed to '0' (COM1). Changed serialmac to 0 in site.tab.
2002-01-09		gensshkeys, not makesshkeys. DHCPDARGS="eth1 eth2" (the quotes are important)
2002-01-06		Don't need nodeset for initial stage2. Fixed mpacheck/mpncheck confusion. winstall as alternative to 'nodeset; rpower'. Use addcluster user and recommend using /etc/skel, etc.
2002-01-04		watchlogd setup documented. Don't need to install ssh rpms for RH 7.x. Don't need to setup ssh keys for root by hand. dhcpd configuration options on RH7.x. Use maui 3.0.7p1.
2002-01-03		Subsection formating changes to they stick out better... not quite done yet though. More, new, and improved ELS and ESP config documentation.
2001-12-21		Updated to xCAT 1.1RC7.5. Added RSA stuff and documented the more automated way of configuring the ASMA adapters. Added Copyright stuff in the intro. All links to xCAT software are password protected. Links to bohnsack.com documentation have been removed.
2001-12-18		Added links to man pages where possible. More content in the architecture section.
2001-12-10		Updated to xCAT 1.1RC7.4
2001-12-09		Many formating adjustments to make it work inside of xcat.org. Changed .png to .gif to make it display better with Netscape.
2001-12-08		Changed to PHP from HTML::Mason and moved to xcat.org.
2001-12-06		Updated xCAT version to 1.1RC7.3.
2001-11-09		Updated GM version to 1.5_Linux. Updated xCAT version to 1.1RC7.2.
2001-11-01		Added documentation on running MPICH-IP cpi via a PBS batch job.
2001-10-31		Added documentation on running MPICH-GM cpi via a PBS batch job.
2001-10-30		Added a .png diagram to the architecture section. Filled in the testing cluster operation section with major new content (and split it into 3 sections in the process). Updated GM version to 1.5_Linux_beta3. Moved additional reading to the top of the document.
2001-10-29		Fixed a bunch of spelling mistakes.... I'm certain new ones have already surfaced.
2001-10-26		Started keeping track of changes. Added xcat configuration examples. Added network configuration examples. Added architecture section. Added more stuff in the introduction. Moved VLAN configuration examples to a separate document. Moved ESP configuration to a separate document. Added examples of what services to remove. Fixed sequence problem with installing the correct OpenSSH rpms. Split LAM and MPICH into separate sections. Added stuff about making ypind come up at boot on management node.

Back to TOC

36. TODO

Tonko's new stage1 stuff
/install/post/sync documentation
document passwd group user sync with prsync
Patches for PGI on RedHat 7.x
New GM version. Running gm_stress
PVM stuff
'makedns' section is very unclear... improve. Also explain why we use a bogus DNS sub-domain and DNS forwarders.
Include some simple stuff about using conserver cntrl-e-c, etc.
dhcp setup is unclear or incorrect in parts.
How to deal with a 'user' or 'interactive' nodes... PBS, PXE and ks on an x340, etc.
How to install a RedHat update kernel as a part of kickstart
ia64 stuff.
Changing the enable and telnet passwords on cisco 3500s.
Make certain each section has a few sentences at the top that explain why we want to do what we're about to do.
Validate the correctness of licensed PGI compiler install... possibly condense into only talking about the licensed install
Make the stage1,2,3 sections have why we do stage-x intro material
Provide content in the 'before you begin' section
flesh out 'verify cluster is setup for NIS'
Move a number of the other docs into this document
Spelling check
Go through the redbook and add any content that is in the redbook, missing from this document, and relevant.
Sync with Egan and others before 1.1 release to this document can be included as an accurate guide with the release
work so its easy to provide single page and multi-page versions of the document in html.
Convert to DocBook, work on stylesheets, and auto-generate HTML, PDF, man, and text document versions... yeah right.

Back to TOC

37. Thanks

Thanks go out to the following people. They helped this document become what it is today.

Mike Galicki, Chris DeYoung, Mark Atkinson, Greg Kettmann, people from POSDATA, Kevin Rudd, and Tonko L De rooy, for suggestions and contributions.
Egan Ford for writing xCAT, answering my xCAT questions, and contributing to this document.
Jay Urbanski for answering my xCAT questions and getting me started with Linux clustering at IBM.

Back to TOC

xCAT.org: Extreme Cluster Administration Toolkit	home download docs mailing lists support
home / docs / xCAT HOWTO

If a temporary installation:

For a permanent IBM customer install: