xCAT Warewulf HOWTO

xCAT Warewulf (wareCAT) HOWTO (WIP)

Introduction
Installation Notes (Homogeneous)
Advanced Notes (Heterogeneous)
Hybrid RAM/NFS File System
Local HD Notes
Warewulf Command Notes
MPI Notes
MTU 9000 Notes
PPC64 Notes
Support

Introduction

Warewulf is a popular complete diskless solution for HPC clusters and xCAT is a complete diskfull solution for large HPC clusters. wareCAT is support added to xCAT for Warewulf. i.e. xCAT using Warewulf, not Warewulf using xCAT. xCAT will be the primary cluster management package. Warewulf will be used for maintaining and monitoring diskless images and nodes.

IANS Warewulf will create a number of kernel and initrd images and place them in /tftpboot. Network booting from DHCP and TFTP servers is how images are booted. xCAT provides a scalable infrastructure for DHCP and TFTP.

Supported configurations:

Homogeneous Simple. A single xCAT/WW management node for X number of nodes. A single node will act as the only DHCP and TFTP server. This solution scales to around 128 nodes. If staggered boot is used, this can scale to 1000s of nodes. A reasonable limit would be 1024.
Homogeneous Scale. A single xCAT/WW management node with multiple other nodes providing some services to increasing scaling. This can scale to 100s and 1000s of nodes. xCAT already provides this support for large diskfull clusters and it can by applied to WW as well. The number of nodes and network infrastructure will determine the number of DHCP servers and TFTP server required for a parallel reboot with no DHCP/TFTP timeouts. The number of DHCP servers does not need to equal the number of TFTP servers. TFTP servers NFS mount read-only the /tftpboot directory from the management node to provide a consistent set of kernel and initrd images.

The xCAT/WW management node must export /tftpboot as read-only. The additional TFTP servers must NFS mount /tftpboot from the management node. $XCATROOT/etc/noderes.tab must be configured by node attribute, group, or name to determine what nodes will use what TFTP server.
Heterogeneous. A single xCAT/WW management node with one or more WW image management nodes (at least one per OS/ARCH type, e.g. rhas3/x86_64, rhas3u4/ia64). The WW image management nodes may also double as TFTP servers for a heterogeneous scale solution. Additional TFTP servers may also be added that are not WW image management nodes. This can be configured as simple or scale.

The xCAT/WW management node must export /tftpboot as read-write, /vnfs as read-write, /etc/warewulf as read-only, and /install as read-only. The WW image management nodes must mount /tftpboot, /vnfs, /etc/warewulf, and /install from the management node. The additional TFTP only servers must mount /tftpboot from the management node. $XCATROOT/etc/noderes.tab must be configured by node attribute, group, or name to determine what nodes will use what TFTP server.

NOTE: DHCP and TFTP setup can be as simple or as complicated as you like. xCAT also supports sub DHCP servers for routed subnets sans DHCP forwarding for the ultimate in complexity.

NOTE: Nodes with disks can toggle from diskfull to diskless. e.g. nodeset noderange boot (or rboot noderange) for local disk, nodeset noderange netboot (or rnetboot noderange) for diskless boot.

NOTE: Applications with heavy local file I/O requirements are not excluded from the benefits of diskless administration. i.e. the cluster can be booted diskless, but still access local disk.

This document is for xCAT 1.2.0 and Warewulf 2.2.4-2.

x86 (i386, i486, i586, i686) supported distributions:

CentOS 3.3 (RHEL3 U3 Clone)
CentOS 3.4 (RHEL3 U4 Clone)
Red Hat Enterprise Linux AS 3 U4 (Tested)
Any Red Hat like OS should work. (2.4 kernel only)

x86_64 (Opteron and EMT64) supported distributions:

CentOS 3.3 (RHEL3 U3 Clone) (EM64T Tested)
CentOS 3.4 (RHEL3 U4 Clone)
Red Hat Enterprise Linux AS 3 U4 (Opteron Tested)
Any Red Hat like OS should work. (2.4 kernel only)

ia64 supported distributions:

Red Hat Enterprise Linux AS 3 U4 (Tested)
Any Red Hat like OS should work. (2.4 kernel only)

ppc64 supported distributions:

Red Hat Enterprise Linux AS 3 U4 (Tested)

NOTE: SuSE OSes to be done.

NOTE: 2.6 kernel distributions to be done.

Installation Notes (Homogeneous)

Install management node and xCAT per the xCAT mini HOWTO. Follow all steps except node install.

NOTE: copycds command must NOT be skipped. This is needed for YUM.
Got YUM? Warewulf uses YUM to build the diskless images.

Check with:

yum --version

No YUM?

Type:

rpm -i $XCATROOT/src/yum-2.0.8-1.noarch.rpm

NOTE: YUM 2.1.x unsupported.
Edit YUM configuration (/etc/yum.conf). Change remote directories to local directories created by copycds (usually /install/OSVER/ARCH). For this document OSVER will be rhas3 and ARCH will be x86_64.

e.g.:

Change:

[base]
name=CentOS-$releasever - Base
baseurl=http://mirror.centos.org/centos-3/$releasever/os/$basearch/

to

[base]
name=CentOS-$releasever - Base
baseurl=file:///install/rhas3/x86_64/RedHat/RPMS

Change:

[update]
name=CentOS-$releasever - Updates
baseurl=http://mirror.centos.org/centos-3/$releasever/updates/$basearch/
gpgcheck=1

#packages used/produced in the build but not released
[addons]
name=CentOS-$releasever - Addons
baseurl=http://mirror.centos.org/centos-3/$releasever/addons/$basearch/
gpgcheck=1

#additional packages that may be useful
[extras]
name=CentOS-$releasever - Extras
baseurl=http://mirror.centos.org/centos-3/$releasever/extras/$basearch/
gpgcheck=1

to

[update]
name=CentOS-$releasever - Updates
baseurl=file:///install/post/updates/rhas3/x86_64

Change all instances of gpgcheck from 1 to 0.

Sample /etc/yum.conf for RHAS3 file:

[main]
cachedir=/var/cache/yum
debuglevel=2
logfile=/var/log/yum.log
pkgpolicy=newest
distroverpkg=redhat-release
tolerant=1
exactarch=1

[base]
name=Red Hat Linux $releasever - $basearch - Base
baseurl=file:///install/rhas3/x86_64/RedHat/RPMS
gpgcheck=0

[updates]
name=Red Hat Linux $releasever - Updates
baseurl=file:///install/post/updates/rhas3/x86_64
gpgcheck=0
Setup YUM for xCAT. For the this document OSVER will be rhas3 and ARCH will be x86_64.

yum-arch /install/OSVER/ARCH/RedHat/RPMS/
mkdir -p /install/post/updates/OSVER/ARCH
yum-arch /install/post/updates/OSVER/ARCH

e.g.:

yum-arch /install/rhas3/x86_64/RedHat/RPMS/
mkdir -p /install/post/updates/rhas3/x86_64
yum-arch /install/post/updates/rhas3/x86_64
Build MAKEDEV and perl-Term-Screen.

Download from http://downloads.warewulf-cluster.org/DEPS/ and save in /tmp.

Build with:

cd /tmp
rpm -i MAKEDEV-3.3.8-5.ww.src.rpm
rpm -i perl-Term-Screen-1.02-2.caos.src.rpm
cd /usr/src/redhat/SPECS
rpmbuild -ba perl-Term-Screen.spec
rpmbuild -ba MAKEDEV.spec

PPC64 Note: rpmbuild --target ppc -ba MAKEDEV.spec
Build Warewulf RPMs.

Download from http://downloads.warewulf-cluster.org/ and save to /tmp.

Build with:

cd /tmp
rpm -i warewulf-2.2.4-2.src.rpm
cd /usr/src/redhat/SPECS
rpmbuild -ba warewulf.spec

PPC64 Note:

patch -p0 <$XCATROOT/patch/warewulf/warewulf.spec.patch
rpmbuild --target ppc -ba warewulf.spec
Setup YUM for Warewulf.

Type:

yum-arch /usr/src/redhat/RPMS/

Append to /etc/yum.conf:

[local]
name=Local repository
baseurl=file:///usr/src/redhat/RPMS
Install Warewulf.

Type:

yum install warewulf warewulf-tools
Setup Warewulf.

The previous steps follow the Warewulf documentation (http://warewulf-cluster.org/installation.html), however the next steps are specific for a hybrid xCAT/Warewulf environment. xCAT and Warewulf alter/rely on many of the same system files. Since xCAT is a superset of Warewulf functionality xCAT will maintain system files.

Warewulf as is will edit the following files:

/etc/hosts
/etc/exports
/etc/dhcpd.conf
/tftpboot/pxelinux.cfg/*

In an xCAT environment this is undesirable. /etc/hosts and /etc/exports will be edited manually. /etc/dhcpd.conf and /tftpboot/pxelinux.cfg/* is managed by xCAT.

Edit Warewulf /etc/warewulf/master.conf.

nodename should equal the short node name of the Warewulf management node (e.g. mgmt1), node prefix and node number format are used for WW only environments for automatic discovery and setup. Since xCAT controls this, leave defaults.

Change the boot, admin, sharedfs, and cluster devices to match the correct NIC (e.g. eth0).

Sample configuration file:

[network]
# Hostname that the nodes know this host as
  nodename              = mgmt1
  node prefix           = node

# The printf format string that would follow the % character, e.g.: 04d
  node number format    = 04d

# WARNING: boot and admin devices should be the same on clusters that 
# utilize a traditional net architecture (not FNN, or bonded)
  boot device           = eth0
  admin device          = eth0
  sharedfs device       = eth0
  cluster device        = eth0

Edit Warewulf /etc/warewulf/vnfs.conf.

This file defines the diskless images. The defaults are a good start, but: glibc.i386 must be changed to glibc, and comment out disable virtual terminals = 1456.

Add a compute image, append:

[compute]
   excludes file        = /etc/warewulf/vnfs/excludes
   sym links file       = /etc/warewulf/vnfs/symlinks
   fstab template       = /etc/warewulf/vnfs/fstab
   path                 = /compute

NOTE: To add an additional image (e.g. test) append:

[test]
   excludes file        = /etc/warewulf/vnfs/excludes
   sym links file       = /etc/warewulf/vnfs/symlinks
   fstab template       = /etc/warewulf/vnfs/fstab
   path                 = /test

Export WWIMAGE to be the name of the image (e.g. compute) you are working on, e.g.:

export WWIMAGE=compute
Edit Warewulf /etc/warewulf/nodes.conf.

Edit the users and groups list and uncomment the pxeboot line.
Setup NFS.

Edit /etc/exports:

Append:

/usr/local *(ro,async,no_root_squash)
/vnfs *(ro,async,no_root_squash)
/home *(rw,async,root_squash)

NOTE: Only /vnfs is required if /home and /usr/local are serviced by another server.

Type:

service nfs restart
chkconfig --level 345 nfs on
Setup /etc/hosts and DNS.

Alias your Warewulf management node host name entry with -admin, -clust, -sfs suffixes.

E.g.:

172.20.0.1 mgmt1 mgmt1-admin mgmt1-clust mgmt1-sfs

Rebuild DNS, type:

makedns
Create VNFS. NOTE: Jump to Hybrid RAM/NFS File System if using a hybrid RAM/NFS file system, before creating the VNFS.

wwvnfs.create --vnfs=$WWIMAGE
Put DNS info in VNFS.

cp /etc/resolv.conf /vnfs/$WWIMAGE/etc/
Add xCAT and /usr/local NFS support.

Edit /etc/warewulf/vnfs/fstab and append:

%{sharedfs ipaddr}:/opt/xcat /opt/xcat nfs ro,rsize=8192,wsize=8192 0 0
%{sharedfs ipaddr}:/usr/local /usr/local nfs ro,rsize=8192,wsize=8192 0 0

NOTE: If xCAT, /home, or /usr/local are exported by another server simple hardcode the IP or name in place of %{sharedfs ipaddr}.

Create mount points, type:

mkdir -p /vnfs/$WWIMAGE/opt/xcat
mkdir -p /vnfs/$WWIMAGE/usr/local
Setup xCAT sysconfig file.

Type:

cp /etc/sysconfig/xcat /vnfs/$WWIMAGE/etc/sysconfig/

xCAT also needs /etc/redhat-release and xcat.sh, type:

cp /etc/redhat-release /vnfs/$WWIMAGE/etc
cp /etc/profile.d/xcat.sh /vnfs/$WWIMAGE/etc/profile.d
Put NTP in VNFS. (Optional, but recommended).

Install NTP with YUM:

yum --installroot /vnfs/$WWIMAGE install ntp

Create NTP configuration files, type:

cd /vnfs/$WWIMAGE
mkdir -p etc/ntp
chown ntp etc/ntp

echo "server $(hostname -s)
driftfile /etc/ntp/drift
multicastclient
broadcastdelay 0.008
authenticate no
keys /etc/ntp/keys
trustedkey 65535
requestkey 65535
controlkey 65535" >etc/ntp.conf

echo "$(hostname -s)" >etc/ntp/step-tickers

wwvnfs.chroot --vnfs=$WWIMAGE
chkconfig --level 345 ntpd on
exit
Put SSH in VNFS. (Optional, but recommended).

Edit /etc/warewulf/vnfs/excludes and append:

+ var/empty/sshd/

after the line:

+ var/tmp/

Install SSH with YUM:

yum --installroot /vnfs/$WWIMAGE install openssh-clients openssh-server

Copy root SSH keys to VNFS:

cp -r /root/.ssh /vnfs/$WWIMAGE/root

Setup SSH on boot and generate keys, type:

wwvnfs.chroot --vnfs=$WWIMAGE

chkconfig --level 345 sshd on

/usr/bin/ssh-keygen -t rsa1 -f /etc/ssh/ssh_host_key -C '' -N ''
chmod 600 /etc/ssh/ssh_host_key
chmod 644 /etc/ssh/ssh_host_key.pub
/usr/bin/ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -C '' -N ''
chmod 600 /etc/ssh/ssh_host_rsa_key
chmod 644 /etc/ssh/ssh_host_rsa_key.pub
/usr/bin/ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -C '' -N ''
chmod 600 /etc/ssh/ssh_host_dsa_key
chmod 644 /etc/ssh/ssh_host_dsa_key.pub

exit
Install HPC and xCAT dependencies. (Optional, but recommended).

Type:

yum --installroot /vnfs/$WWIMAGE install perl pdksh libf2c elfutils-libelf libaio
Got Serial? Consider a serial terminal login. (Optional, but recommended).

Append something like:

co:2345:respawn:/sbin/agetty ttyS0 19200 xterm

to /vnfs/$WWIMAGE/etc/inittab, then:

echo ttyS0 >>/vnfs/$WWIMAGE/etc/securetty

PPC64 (JS20) Note:

Use:

co:2345:respawn:/sbin/agetty hvc0 38400 xterm

Run:

cd /vnfs/$WWIMAGE/
echo hvc0 >>etc/securetty
mknod -m 620 dev/hvc0 c 229 0
chgrp tty dev/hvc0
Install Myrinet RPM. (See myrinet-HOWTO). (Optional)

E.g.:

wwvnfs.rpm --vnfs $WWIMAGE --nodeps -i gm-2.0.16_Linux-2.4.21-20.EL.c0.x86_64.rpm

Change GM to load after NFS, type:

perl -pi -e 's/# chkconfig: 2345 11 60/# chkconfig: 2345 54 60/' /vnfs/$WWIMAGE/etc/rc.d/init.d/gm
wwvnfs.chroot --vnfs=$WWIMAGE
chkconfig --level 345 gm on
exit

Install RPM on master as well for development.

Remove from /etc/warewulf/vnfs/master-includes all GM related files, i.e. remove:

/dev/gm0
/dev/gm1
/dev/gm2
/dev/gm3
/dev/gm4
/dev/gm5
/dev/gm6
/dev/gm7
/dev/gmp0
/dev/gmp1
/dev/gmp2
/dev/gmp3
/dev/gmp4
/dev/gmp5
/dev/gmp6
/dev/gmp7

# Some defaults for Myrinet when installed in /usr/local
/etc/init.d/gm
/etc/rc3.d/S20gm
/usr/local/bin/gm_allsize
/usr/local/bin/gm_board_info
/usr/local/bin/gm_board_mapping
/usr/local/bin/gm_collide
/usr/local/bin/gm_counters
/usr/local/bin/gm_dirsend
/usr/local/bin/gm_dma_malloc
/usr/local/bin/gm_ethertune
/usr/local/bin/gm_fault
/usr/local/bin/gm_nway
/usr/local/bin/gm_set_name
/usr/local/bin/gm_set_speed
/usr/local/bin/gm_simpleroute
/usr/local/bin/gm_time_registration
/usr/local/sbin/gm_mapper
/usr/local/lib/libgm.so
/usr/local/lib/libgm.so.0
/usr/local/lib/libgm.so.0.0.0
/usr/local/etc/gm/gm_config_info
/usr/local/etc/gm/gm_libtool
/usr/local/etc/gm/UTS_RELEASE
Setup Torque. (Optional).

Edit /etc/warewulf/vnfs/excludes and append:

+ var/spool/pbs/

after the line:

+ var/tmp/

Append to /etc/warewulf/vnfs/master-includes /dev/pty*:
ls -1 /dev/pty* >>/etc/warewulf/vnfs/master-includes
Run the following scripts to setup the PBS directories:

$XCATROOT/install/diskless/pbs_mom /vnfs/$WWIMAGE $PBS_SERVER

where $PBS_SERVER is short hostname of the PBS server.

Optionally but recommended add support for /etc/security/access.conf. Without this support any user can SSH/RSH to any node at anytime. xCAT's PBS startup and takedown script will manage /etc/security/access.conf. To add this support type:

$XCATROOT/install/diskless/setuppam /vnfs/$WWIMAGE

Setup Torque on boot, type:

wwvnfs.chroot --vnfs=$WWIMAGE

chkconfig --level 345 pbs_mom on

exit

The rest of Torque setup is performed using the xCAT genpbs command, to be run after all nodes are up.
Build network boot image.

Type:

>/vnfs/$WWIMAGE/var/log/lastlog
wwvnfs.build --vnfs=$WWIMAGE

IGNORE: NOTE: if any nodes use PXELINUX, you must rerun wwmaster.dhcpd!

DO NOT run wwmaster.dhcpd, it will hose you (if you hose yourself, just type: makedhcp --new --allmac)

NOTE: Copy /tftpboot/compute.* files to all TFTP servers unless Warewulf server was installed by xCAT as a staging server, then this should be automatic.
Setup and start wwautosyncd daemon.

Type:

cp -f $XCATROOT/rc.d/wwautosyncd /etc/rc.d/init.d
chkconfig --level 345 wwautosyncd on
service wwautosyncd restart
Setup and start warewulfd daemon.

Type:

chkconfig --level 345 warewulfd on
service warewulfd restart
Boot up nodes!

Verify the node entries in $XCATROOT/etc/nodetype.tab are:

nodename    warewulf,ARCH,compute

e.g.

node13      warewulf,x86_64,compute
node14      warewulf,x86_64,compute

Then, type:

rnetboot noderange

NOTE: Use wcons or wvid and try one node first.

NOTE: Use wwnode.list and wwtop to monitor nodes.
Collect SSH host keys after boot:

makesshgkh noderange
Add users with addclusteruser.

E.g.

# addclusteruser
Enter username: bob
Enter group: users
Enter UID (return for next): 500
Enter absolute home directory root:
addclusteruser: must start with /
Enter absolute home directory root: /home
Enter passwd (blank for random): NkeryW4a
Creating home directory /home/bob
Adding user: useradd -u 500 -g users -d /home/bob bob
Setting passwd for bob to NkeryW4a
Changing password for user bob.
passwd: all authentication tokens updated successfully.

Don't forget to run pushuser [noderange] bob

IGNORE: pushuser command, use wwnode.sync for Warewulf.

Edit /etc/warewulf/nodes.conf users and groups.

Push out updates with:

wwnode.sync
Install compilers and libraries. Read the xCAT HPC Benchmark HOWTO for details.
Enjoy your cluster. Do some work.

Advanced Notes (Heterogeneous)

Complete the Homogeneous Installation Notes first.
Install a diskfull node that matches the OS and ARCH of the target diskless nodes. E.g. an x86 node can xCAT install an x86_64 node with a different OS. Install all packages. This node will be referred to as the Warewulf Image Management node or WWIM.
From the xCAT/WW management node (not the WWIM node) append to/edit /etc/exports:

/install *(ro,async,no_root_squash)
/tftpboot *(rw,async,no_root_squash)
/vnfs *(rw,async,no_root_squash)
/etc/warewulf *(ro,async,no_root_squash)

Then export with:

exportfs -a
From the xCAT/WW management node (not the WWIM node) setup YUM for the new OS/ARCH. copycds should have already been run again this new OS/ARCH since it was required to install the diskfull WWIM node. Lets assume that your xCAT/WW management node is x86 (i686) based and you want to add x86_64 based diskless nodes and your newly installed WWIM node is x86_64. Lets also assume that your are running RHAS3 on both x86 and x86_64 but with the correct ARCH version for each node.

The following commands should have already been run as part of the Homogenous Installation Notes (above).

yum-arch /install/rhas3/x86/RedHat/RPMS/
mkdir -p /install/post/updates/rhas3/x86
yum-arch /install/post/updates/rhas3/x86

For the new OS/ARCH (assuming rhas3/x86_64) type:

yum-arch /install/rhas3/x86_64/RedHat/RPMS/
mkdir -p /install/post/updates/rhas3/x86_64
yum-arch /install/post/updates/rhas3/x86_64
From the xCAT/WW management node (not the WWIM node) edit /etc/warewulf/vnfs.conf.

Add a compute-ARCH image, e.g. append:

[compute-x86_64]
   excludes file        = /etc/warewulf/vnfs/excludes
   sym links file       = /etc/warewulf/vnfs/symlinks
   fstab template       = /etc/warewulf/vnfs/fstab
   path                 = /compute-x86_64

NOTE: for IA64/RHAS3 append:

[compute-ia64]
   excludes file        = /etc/warewulf/vnfs/excludes
   sym links file       = /etc/warewulf/vnfs/symlinks
   fstab template       = /etc/warewulf/vnfs/fstab
   path                 = /compute-ia64
   include packages     = warewulf-vnfs glibc coreutils binutils

NOTE: Warewulf does not differentiate hardware architecture, to avoid confusion and loss of images, please use the following convention imagename-arch when defining heterogeneous images in vnfs.conf.
From the WWIM node (not the xCAT/WW management node) setup /etc/fstab, append:

xCATWW:/tftpboot      /tftpboot      nfs timeo=14,intr 1 2
xCATWW:/install       /install       nfs timeo=14,intr 1 2
xCATWW:/vnfs          /vnfs.master   nfs timeo=14,intr 1 2
xCATWW:/etc/warewulf /etc/warewulf nfs timeo=14,intr 1 2

Where xCATWW is the hostname of the xCAT/WW management node.
From the WWIM node (not the xCAT/WW management node) create mount points and mount /tftpboot and /install only, type:

mkdir -p /tftpboot
mount /tftpboot
mkdir -p /install
mount /install
mkdir -p /etc/warewulf
umount /etc/warewulf
mkdir -p /vnfs.master
umount /vnfs.master
From the WWIM node (not the xCAT/WW management node) follow step 2. Got YUM? from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 3. Edit YUM configuration (/etc/yum.conf). from the Homogeneous Installation Notes.

NOTE: OS/ARCH must match WWIM.
From the WWIM node (not the xCAT/WW management node) follow step 5. Build MAKEDEV and perl-Term-Screen. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 6. Build Warewulf RPMs. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 7. Setup YUM for Warewulf. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 8. Install Warewulf. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) patch wwvnfs.build to support WWIM nodes.

cd /usr/sbin
patch -p0 <$XCATROOT/patch/wwvnfs.build-2.2.4-2.patch
From the WWIM node (not the xCAT/WW management node) copy /etc/hosts from xCAT/WW node.

scp xCATWW:/etc/hosts /etc

Where xCATWW is the hostname of the xCAT/WW management node.
IA64/RHAS3 only notes:

From the WWIM node (not the xCAT/WW management node) create a gzip'd kernel, type:

cd /boot
gzip -9 -c <vmlinux-2.4.21-27.EL >vmlinuz-2.4.21-27.EL
Export WWIMAGE as image-arch. E.g.:

export WWIMAGE=compute-x86_64
Mount /etc/warewulf and /vnfs, e.g.:

mount /etc/warewulf
mount /vnfs.master
From the WWIM node (not the xCAT/WW management node) follow step 15. Create VNFS. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 16. Put DNS info in VNFS. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 17. Add xCAT and /usr/local NFS support. from the Homogeneous Installation Notes.

NOTE: Only create the VNFS mount points, e.g.:

mkdir -p /vnfs/$WWIMAGE/opt/xcat
mkdir -p /vnfs/$WWIMAGE/usr/local
From the WWIM node (not the xCAT/WW management node) follow step 18. Setup xCAT sysconfig file. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 19. Put NTP in VNFS. from the Homogeneous Installation Notes. (Optional, but recommended).

NOTE: Replace $(hostname -s) with xCATWW. Where xCATWW is the hostname of the xCAT/WW management node.
From the WWIM node (not the xCAT/WW management node) follow step 20. Put SSH in VNFS. from the Homogeneous Installation Notes. (Optional, but recommended).

NOTE: Skip /etc/warewulf/vnfs/excludes edit. This was done as part of the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step 21. Install HPC and xCAT dependencies. from the Homogeneous Installation Notes. (Optional, but recommended).
From the WWIM node (not the xCAT/WW management node) follow step 22. Got Serial? from the Homogeneous Installation Notes. (Optional, but recommended)
From the WWIM node (not the xCAT/WW management node) follow step 23. Install Myrinet RPM. from the Homogeneous Installation Notes. (Optional)
From the WWIM node (not the xCAT/WW management node) follow step 24. Setup Torque. from the Homogeneous Installation Notes. (Optional)
From the WWIM node (not the xCAT/WW management node) follow step 25. Build network boot image. from the Homogeneous Installation Notes.
Backup work to xCAT/WW management node. From the WWIM node (not the xCAT/WW management node) type:

>/vnfs/$WWIMAGE/var/log/lastlog
rm -rf /vnfs.master/$WWIMAGE/
cd /vnfs
find $WWIMAGE -print | cpio -dump /vnfs.master/
From the xCAT/WW management node (not the WWIM node), check /tftpboot and /vnfs for new heterogeneous image, e.g.:

ls -l /tftpboot/compute-x86_64*
ls -ld /vnfs/compute-x86_64
Boot up nodes!

Verify the node entries in $XCATROOT/etc/nodetype.tab are:

nodename    warewulf,ARCH,image{-arch}

e.g.

node13      warewulf,x86_64,compute-x86_64
node14      warewulf,x86_64,compute-x86_64

NOTE: The redundancy of specifying ARCH twice. This is required because WW does not directly support heterogeneous clusters. Also note that this is just a convention. The image field in $XCATROOT/etc/nodetype.tab must match an entry in /etc/warewulf/vnfs.conf.

Then, type:

rnetboot noderange

NOTE: Use wcons or wvid and try one node first.

NOTE: Use wwnode.list and wwtop to monitor nodes.
From the WWIM node (not the xCAT/WW management node) jump to step 28. Collect SSH host keys after boot. from the Homogeneous Installation Notes and continue from there.

Hybrid RAM/NFS File System

What?!? My initrd image is huge (~32MB for x86_64) and I just lost a large chunk of short term memory (i.e. RAM ~100MB for x86_64) and I am having TFTP timeouts when I boot my 1024 node cluster. If this concerns you, consider WW Hybrid mode (~8MB initrd, ~20MB RAM usage).

What is Hybrid RAM/NFS? Well the above notes build a RAM root system, all O/S system files are in memory and NFS is only used for user data and applications. This large RAM root system can create TFTP scaling issues. This can be addressed with multiple TFTP servers, this is easy to setup with xCAT and is recommended for any size initrd image. However the use of the ~100MB of RAM cannot be reduced unless OS system files are moved from RAM to NFS. There is still some RAM used, but it is reduced to only ~20MB (x86_64 tested).

But... NFS could get pounded. NFS can scale and scale well, it is NFS mounting that has timeout problems. A staggered boot or script to retry can resolve this problem. Once mounted all should be fine.

NOTE: You can have multiple NFS servers too.

Bottom line, test both. A single WW cluster can have Hybrid and RAM root only nodes.

Get non-hybrid working first as a test. Hybrid RAM/NFS works with homogeneous and heterogeneous clusters. The steps below should be performed before creating the initial VNFS (see homogeneous notes).

Create an /etc/warewulf/vnfs.conf entry like this:

[compute-x86_64-hybrid]
   ramdisk nfs hybrid   = 1
   excludes file        = /etc/warewulf/vnfs/excludes-nfs
   sym links file       = /etc/warewulf/vnfs/symlinks-nfs
   fstab template       = /etc/warewulf/vnfs/fstab-nfs
   path                 = /compute-x86_64-hybrid

Edit /etc/warewulf/vnfs/fstab-nfs like /etc/warewulf/vnfs/fstab per the homogeneous notes above, however the line in bold must also be added, e.g.:

mgmt1:%{vnfs dir} %{vnfs dir} nfs ro,rsize=8192,wsize=8192 0 0
mgmt1:%{sharedfs dir} %{sharedfs dir} nfs rw,rsize=8192,wsize=8192 0 0
mgmt1:/opt/xcat /opt/xcat nfs ro,rsize=8192,wsize=8192 0 0
mgmt1:/usr/local /usr/local nfs ro,rsize=8192,wsize=8192 0 0
Torque setup. (Optional)

Edit /etc/warewulf/vnfs/excludes-nfs and append:

+ var/spool/pbs/

after the line:

+ var/lib/

This is required for xCAT Torque epilogue and prologue support to control access to each node.

Edit /etc/warewulf/vnfs/excluded-nfs and change:

/etc/security/

to

+ /etc/security/

Edit /etc/warewulf/vnfs/symlinks-nfs and change:

%{vnfs}/%{link} etc/security

to

#%{vnfs}/%{link} etc/security
Now perform the homogenous (and optionally the heterogeneous) notes.
Additional Myrinet setup. This step to be done after Myrinet step above. (Optional)

Obtain the name of the GM directory, e.g.:

ls -d /vnfs/$WWIMAGE/opt/gm-*

output:

/vnfs/compute-x86_64-hybrid/opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64

Note the above in bold and append to /etc/warewulf/vnfs/excludes-nfs (IMPORTANT: do not skip a single slash or add any extra):

#GM stuff
/opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64/

Append to /etc/warewulf/vnfs/symlinks-nfs (IMPORTANT: do not skip a single slash or add any extra):

#GM stuff
%{vnfs}/%{link} opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64
YUM or RPM install more packages in to $WWIMAGE. The minimalist restrictions of RAM root only are lifted.
Build image and boot per homogeneous/heterogeneous notes.

Local HD Notes (WIP)

Accessing the local HD is no different that with diskfull nodes. However since no OS was installed partitions must be created and formatted for file systems and swap. Consider scripting sfdisk, e.g.:

Boot up a node.
SSH/RSH to that node.
Use fdisk to setup disk.
Run something like 'sfdisk /dev/sda -d >{NFS mount point}/sda.sfdisk' to save partition information.
Write a script to:

A. Dump partition information.
B. Compare partition information with {NFS mount point}/sda.sfdisk, if no difference jump to E.
C. If different then run something like 'sfdisk /dev/sda <{NFS mount point}/file.sfdisk'.
D. Use mke2fs and mkswap to format partitions.
E. Create mount points, mount file systems.
F. Run swapon for each swap partition.
Put script in rc.local.

There are many ways to do this, the above is just an example.

Warewulf Command Notes

Once your xCAT/WW cluster is built and running, use as any other cluster, however the following additional commands may be useful.

wwnode.list shows a list of the currently running nodes in the cluster. You can use it to make a MPI or PVM machine file, or just get a list of
the nodes in the current ready position (--help for options). E.g.:

# Nodes ready:
node13
node14
node2
node6

# Nodes in unknown state:
node15
node5

wwnode.stats displays the current usage summary for your cluster by querying and pulling the information from the warewulfd daemon directly. It lists the current CPU, memory, swap, and network utilization (--help for options). E.g.:

--------------------------------------------------------------------------------
Total CPU utilization: 0%                          
          Total Nodes: 6         
               Living: 6                           Warewulf
          Unavailable: 0                      Cluster Statistics
             Disabled: 0                 http://warewulf-cluster.org/
                Error: 0         
                 Dead: 0         
--------------------------------------------------------------------------------
 Node        CPU       Memory (MB)    Swap (MB)      Network (KB)   Current
 Name     [util/num] [% used/total] [% used/total] [send - receive] Status
node13       0% (2)    2% 37/1946        none           0 - 0       READY
node14       0% (2)    2% 35/1946        none           0 - 0       READY
node15       0% (2)    1% 38/5711        none           0 - 0       READY
node2        0% (2)    3% 25/1001        none           0 - 0       READY
node5        0% (2)   3% 104/4014        none           0 - 0       READY
node6        0% (2)   3% 110/4014        none           0 - 0       READY

wwnode.sync will update nodes that are already booted so they are accessible to the users and updates these files on the nodes: /etc/passwd, /etc/group, /etc/hosts, and /etc/hosts.equiv (--help for options).

wwsumary is a summary of aggregate CPU utilization and number of nodes ready (up and sync'd) (--help for options). E.g.:

localhost: 0% CPU util, 6 Node(s) Ready

wwtop is the Warewulf 'top' like monitor. It shows the nodes ordered by the highest utilization and important statistics about each node and general summary type data. This is an interactive curses based tool (? for commands).

E.g.: Hybrid cluster of x86, x86_64, and ia64.

Top: 6 nodes, 12 cpus, 20062 MHz, 18.20 GB mem, 158 procs, uptime 0 days
Avg:    0% cputil,  58.00 MB memutil, load 0.00,  26 procs, uptime   0 day(s)
High:   0% cputil, 110.00 MB memutil, load 0.00,  27 procs, uptime   0 day(s)
Low:    0% cputil,  25.00 MB memutil, load 0.00,  25 procs, uptime   0 day(s)
Node status:    6 ready,    0 unavailable,    0 down,    0 unknown
18:38:06 mgmt1> 
Node name    CPU  MEM SWAP Uptime   MHz   ARCH  PROC   Load  Net:KB/s Stats/Util
node13        0%   2%   0%   0.05  3984  x86_64   27   0.00         0 |  IDLE  |
node14        0%   2%   0%   0.05  3984  x86_64   27   0.00         0 |  IDLE  |
node15        0%   1%   0%   0.01  4386  x86_64   26   0.00         0 |  IDLE  |
node2         0%   3%   0%   0.05  1728    i686   27   0.00         0 |  IDLE  |
node5         0%   3%   0%   0.01  2990    ia64   25   0.00         0 |  IDLE  |
node6         0%   3%   0%   0.20  2990    ia64   26   0.00         0 |  IDLE  |

E.g.: PPC64 cluster with 3 nodes down.

Top: 3 nodes, 6 cpus, 13164 MHz, 11.31 GB mem, 87 procs, uptime 0 days
Avg:    0% cputil,  27.00 MB memutil, load 0.00,  29 procs, uptime   0 day(s)
High:   0% cputil,  28.00 MB memutil, load 0.00,  29 procs, uptime   0 day(s)
Low:    0% cputil,  27.00 MB memutil, load 0.00,  29 procs, uptime   0 day(s)
Node status:    3 ready,    0 unavailable,    0 down,    3 unknown
21:19:45 wopr.cluster> 
Node name    CPU  MEM SWAP Uptime   MHz   ARCH  PROC   Load  Net:KB/s Stats/Util
blade001      0%   1%   0%   0.06  4388   ppc64   29   0.00         0 |  IDLE  |
blade002      0%   1%   0%   0.06  4388   ppc64   29   0.00         0 |  IDLE  |
blade003      0%   1%   0%   0.06  4388   ppc64   29   0.00         0 |  IDLE  |
blade005    ---- ---- ---- ------ ----- ------- ---- ------   ------- |UN-KNOWN|
blade006    ---- ---- ---- ------ ----- ------- ---- ------   ------- |UN-KNOWN|
blade007    ---- ---- ---- ------ ----- ------- ---- ------   ------- |UN-KNOWN|

For a list of specific nodes export NODES=node,node,node,... before running wwtop.

E.g. to use an xCAT noderange:

NODES=$(nr noderange) wwtop

E.g. to use PBS:

NODES=$(pbsjobnodes PBS_JOB_ID) wwtop
NODES=$(pbsusernodes USERID) wwtop

MPI Notes

No swap? This can be a problem for large MPI jobs. Specifically MPICH. MPICH uses RSH or SSH to launch one process per processor. E.g. If you have a 1024 node cluster with 2 processors per node you have 2048 processors, MPICH will launch 2048 instances of RSH or SSH. Each instance stays in memory until the job completes. (NOTE: RSH cannot support more that 512 processors because it only uses privileged ports (there are only 1024 of them) and RSH uses 2 ports per processor one for STDOUT and one for STDERR). 2048 SSH processes will take more than 1 GB of RAM. Normally for diskfull clusters this is not an issue, it gets swapped out as application needs grow to use more RAM.

Solutions:

Use LAM MPI. LAM has a lamboot command to be run before mpirun is run. lamboot will RSH/SSH serially to each node and startup a lightweight daemon then exit. LAM's mpirun will work with the lightweight daemons to start the job. (Tested.)

E.g. for xhpl/LAM-IP:

export LAM=/usr/local/lam7
export PATH=$LAM/bin:$PATH
export LAMRSH=/usr/bin/ssh
export NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')

lamboot $PBS_NODEFILE

mpirun -np $NP xhpl 2>&1 | tee output
Use MPICH MPD. MPICH MPD is like LAM's lamboot. (Untested, failed to work.)
Use mpiexec (if using Torque). mpiexec is a replacement program for the script mpirun, which is part of the MPICH package. It is used to initialize a parallel job from within a Torque batch or interactive environment. mpiexec uses the task manager library of Torque to spawn copies of the executable on the nodes in a Torque allocation. (Tested.)

mpiexec installation notes:

Install Torque/Maui per this document, the xCAT mini-HOWTO, and benchmark-HOWTO.

Download mpiexec from http://www.osc.edu/~pw/mpiexec/ and save in /tmp. Read the documentation for more information.

Type:

cd /usr/local/pbs/$(uname -m)
ln -s ../include .
cd /tmp
tar zxvf mpiexec-0.77.tgz
cd mpiexec-0.77
./configure \
    --with-pbs=/usr/local/pbs/$(uname -m) \
    --with-default-comm=mpich-gm \
    --prefix=/usr/local/pbs \
    --bindir=/usr/local/pbs/$(uname -m)/bin
make
make install

To use put this in your .pbs script to run manually after a qsub -I.

E.g. for xhpl/MPICH-GM:

export PATH=/usr/local/pbs/$(uname -m)/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/goto/lib:/usr/local/mpich/mpichgm-1.2.6..14a/gm/x86_64/smp/gnu64/ssh/lib/shared
export OMP_NUM_THREADS=1

mpiexec -comm mpich-gm xhpl

E.g. for xhpl/MPICH-IP:

export PATH=/usr/local/pbs/$(uname -m)/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/goto/lib
export P4_GLOBMEMSIZE=16000000

mpiexec -comm mpich-p4 xhpl
Use the new --gm-tree-spawn option of MPICH-GM 1.2.6..14. It distributes the SSH connections on a tree to limit resources usage per node for large job. (Tested.)

MTU 9000 Notes

Warewulf does not have native support for MTU 9000. Below are notes for the most complex solution (i.e. single NIC).

For MTU 9000 to function correctly everything must be MTU 9000. The problem with a single NIC Warewulf solution is that PXE booting uses MTU 1500. The solution is to have two NICs on the server in the MTU 9000 VLAN or switch. One NIC should be set to MTU 1500 and the other to MTU 9000. The MTU 9000 NIC should be the primary cluster NIC. The MTU 1500 NIC will be used for PXE/TFTP booting only.

In the notes below the primary cluster VLAN NIC is eth0 (MTU 9000), the PXE/TFTP NIC is eth1 (MTU 1500), and the systems management VLAN NIC is eth2 (MTU 1500 required for many embedded devices).

NOTE: Your switch must support MTU 9000.

HINT: Cisco 3750 commands: system mtu jumbo 9000 and switchport access vlan number.

Setup management node NICs. Remember to set one for MTU 9000 and one for MTU 1500 and both must be in the same VLAN. Each NIC must be on a different subnet.
Setup /etc/hosts per this HOWTO and the xCAT mini HOWTO. E.g.:

172.90.0.1 mgmt1 mgmt1-admin mgmt1-clust mgmt1-sfs
172.20.0.1 mgmt1-eth1
172.30.0.1 mgmt1-eth2

172.90.1.1 blade001
172.20.1.1 blade001-eth1
172.90.1.2 blade002
172.20.1.2 blade002-eth1
...

Although the nodes only are only using one NIC two entries are required. One for PXE/TFTP booting at MTU 1500 and the other at MTU 9000 for runtime.
Setup $XCATROOT/etc/mac.tab. Because we are using the same NIC as eth1 (PXE/TFTP booting at MTU 1500) and as eth0 (runtime MTU 9000) we must have the same MAC defined for both the physical and virtual NIC. E.g.:

blade001-eth0 00:11:25:4a:35:a2
blade001-eth1 00:11:25:4a:35:a2
...
Update $XCATROOT/etc/noderes.tab. The TFTP field must be the host name of the TFTP server NIC (e.g. mgmt1-eth1). The rest of the fields should be unchanged.

E.g. old single NIC solution for MTU 1500.

compute mgmt1,mgmt1,/install,NA,N,N,N,Y,N,N,N,eth0,eth0,mgmt1

New single NIC solution for MTU 9000.

compute mgmt1-eth1,mgmt1,/install,NA,N,N,N,Y,N,N,N,eth1,eth0,mgmt1

The fields in bold determine what host should be used in DHCP for TFTP booting, the host NIC used for TFTP booting, and the host NIC used for the primary runtime NIC. In the example above the virtual eth1 NIC will get a 172.20 address to TFTP download its kernel and initrd image from mgmt1-eth1 (172.20.0.1), but eth0 will use a 172.90 address for runtime and will communicate with mgmt1 (172.90.0.1) at MTU 9000.
Update $XCATROOT/etc/site.tab with any new networks for DNS, then update DNS with:

makedns
Update DHCP, type:

makedhcp noderange
Verify that /vnfs/$WWIMAGE/etc/resolv.conf is correct for the MTU 9000 network.
Update/create /vnfs/$WWIMAGE/etc/sysconfig/network-scripts/ifcfg-eth0:

echo "MTU=9000" >> /vnfs/$WWIMAGE/etc/sysconfig/network-scripts/ifcfg-eth0
Rebuild network boot image.

Type:

>/vnfs/$WWIMAGE/var/log/lastlog
wwvnfs.build --vnfs=$WWIMAGE
rnetboot nodes, e.g.:

rnetboot noderange

PPC64 Notes

Follow homogeneous or heterogeneous notes with the following exceptions:

OpenFirmware does not have a network downloadable secondary boot loader like x86/x86_64 (pxelinux.0) or ia64 (elilo). A combined kernel/initrd image must be created using the following instructions:

cp -f /tftpboot/$WWIMAGE.img.gz /usr/src/linux-2.4/arch/ppc64/boot/ramdisk.image.gz
cd /usr/src/linux-2.4
perl -pi -e 's/custom$//' Makefile
make distclean
cp -f configs/kernel-2.4.21-ppc64.config .config
make oldconfig
make dep
make zImage.initrd
cp -f /usr/src/linux-2.4/arch/ppc64/boot/zImage.initrd /tftpboot/$WWIMAGE.zimage
Do not delete .img.gz files from /tftpboot.
The OpenFirmware combined kernel/initrd image must be less than 12MB in size. Hybrid RAM/NFS should be used.

NOTE: 7582217 tested. Greater failed. If you have a problem with this you may need to slim down your image. There are 2 ways to do this.

A. Use the Warewulf exclude and symlink lists.
B. Use the following hack (you may want to run the script internals by hand to be safe). WWIMAGE must be exported first.

/tftpboot
cp $WWIMAGE.img.gz /tmp
cd /tmp
$XCATROOT/build/warewulf/slimit $WWIMAGE.img.gz
cp $WWIMAGE.img.gz /tftpboot

Rebuild the zImage.

Support

http://xcat.org
http://warewulf-cluster.org

Egan Ford
egan@us.ibm.com
February 2005