xCAT Warewulf (wareCAT) HOWTO (WIP)
Introduction
Installation Notes (Homogeneous)
Advanced Notes (Heterogeneous)
Hybrid RAM/NFS File System
Local HD Notes
Warewulf Command Notes
MPI Notes
MTU 9000 Notes
PPC64 Notes
Support
Warewulf is a popular complete diskless solution for HPC clusters and xCAT is a complete diskfull solution for large HPC clusters. wareCAT is support added to xCAT for Warewulf. i.e. xCAT using Warewulf, not Warewulf using xCAT. xCAT will be the primary cluster management package. Warewulf will be used for maintaining and monitoring diskless images and nodes.
IANS Warewulf will create a number of kernel and initrd images and place them in /tftpboot. Network booting from DHCP and TFTP servers is how images are booted. xCAT provides a scalable infrastructure for DHCP and TFTP.
Supported configurations:
Homogeneous Simple. A single xCAT/WW management node for X
number of nodes. A single node will act as the only DHCP and TFTP server.
This solution scales to around 128 nodes. If staggered boot is used, this
can scale to 1000s of nodes. A reasonable limit would be 1024.
NOTE: DHCP and TFTP setup can be as simple or as complicated as
you like. xCAT also supports sub DHCP servers for routed subnets sans DHCP
forwarding for the ultimate in complexity.
NOTE: Nodes with disks can toggle from diskfull to diskless.
e.g. nodeset noderange boot (or
rboot noderange) for
local disk, nodeset noderange netboot
(or rnetboot noderange)
for diskless boot.
NOTE: Applications with heavy local file I/O requirements are not
excluded from the benefits of diskless administration. i.e. the cluster
can be booted diskless, but still access local disk.
This document is for xCAT 1.2.0 and Warewulf 2.2.4-2.
x86 (i386, i486, i586, i686) supported distributions:
x86_64 (Opteron and EMT64) supported distributions:
ia64 supported distributions:
ppc64 supported distributions:
NOTE: SuSE OSes to be done.
NOTE: 2.6 kernel distributions to be done.
Installation Notes (Homogeneous)
[network] # Hostname that the nodes know this host as nodename = mgmt1 node prefix = node # The printf format string that would follow the % character, e.g.: 04d node number format = 04d # WARNING: boot and admin devices should be the same on clusters that # utilize a traditional net architecture (not FNN, or bonded) boot device = eth0 admin device = eth0 sharedfs device = eth0 cluster device = eth0
ls -1 /dev/pty* >>/etc/warewulf/vnfs/master-includes
Run the following scripts to setup the PBS directories:
$XCATROOT/install/diskless/pbs_mom /vnfs/$WWIMAGE
$PBS_SERVER
where $PBS_SERVER is short
hostname of the PBS server.
Optionally but recommended add support for
/etc/security/access.conf. Without this support any user can SSH/RSH
to any node at anytime. xCAT's PBS startup and takedown script will manage
/etc/security/access.conf. To
add this support type:
$XCATROOT/install/diskless/setuppam
/vnfs/$WWIMAGE
Setup Torque on boot, type:
wwvnfs.chroot --vnfs=$WWIMAGE
chkconfig --level 345 pbs_mom on
exit
The rest of Torque setup is performed using the xCAT
genpbs command, to be run after all
nodes are up.
Advanced Notes (Heterogeneous)
Complete the Homogeneous Installation Notes first.
Install a diskfull node that matches the OS and ARCH of the
target diskless nodes. E.g. an x86 node can xCAT install an x86_64 node
with a different OS. Install all packages. This node will be
referred to as the Warewulf Image Management node or WWIM.
From the xCAT/WW management node (not the WWIM node) append
to/edit
/etc/exports:
/install *(ro,async,no_root_squash)
/tftpboot *(rw,async,no_root_squash)
/vnfs *(rw,async,no_root_squash)
/etc/warewulf *(ro,async,no_root_squash)
Then export with:
exportfs -a
From the xCAT/WW management node (not the WWIM node) setup YUM
for the new OS/ARCH. copycds
should have already been run again this new OS/ARCH since it was required to
install the diskfull WWIM node. Lets assume that your xCAT/WW management
node is x86 (i686) based and you want to add x86_64 based diskless nodes and
your newly installed WWIM node is x86_64. Lets also assume that your are
running RHAS3 on both x86 and x86_64 but with the correct ARCH version for each
node.
The following commands should have already been run as part of the Homogenous
Installation Notes (above).
yum-arch /install/rhas3/x86/RedHat/RPMS/
mkdir -p /install/post/updates/rhas3/x86
yum-arch /install/post/updates/rhas3/x86
For the new OS/ARCH (assuming rhas3/x86_64) type:
yum-arch /install/rhas3/x86_64/RedHat/RPMS/
mkdir -p /install/post/updates/rhas3/x86_64
yum-arch /install/post/updates/rhas3/x86_64
From the xCAT/WW management node (not the WWIM node) edit /etc/warewulf/vnfs.conf.
Add a compute-ARCH image, e.g. append:
[compute-x86_64]
excludes file = /etc/warewulf/vnfs/excludes
sym links file = /etc/warewulf/vnfs/symlinks
fstab template = /etc/warewulf/vnfs/fstab
path
= /compute-x86_64
NOTE: for IA64/RHAS3 append:
[compute-ia64]
excludes file = /etc/warewulf/vnfs/excludes
sym links file = /etc/warewulf/vnfs/symlinks
fstab template = /etc/warewulf/vnfs/fstab
path
= /compute-ia64
include packages = warewulf-vnfs glibc
coreutils binutils
NOTE: Warewulf does not differentiate hardware architecture, to
avoid confusion and loss of images, please use the following convention
imagename-arch when defining
heterogeneous images in vnfs.conf.
From the WWIM node (not the xCAT/WW management node) setup
/etc/fstab, append:
xCATWW:/tftpboot
/tftpboot nfs timeo=14,intr 1 2
xCATWW:/install /install
nfs timeo=14,intr 1 2
xCATWW:/vnfs /vnfs.master nfs timeo=14,intr 1 2
xCATWW:/etc/warewulf /etc/warewulf nfs timeo=14,intr 1 2
Where xCATWW is the
hostname of the xCAT/WW management node.
From the WWIM node (not the xCAT/WW management node) create
mount points and mount /tftpboot and
/install only, type:
mkdir -p /tftpboot
mount /tftpboot
mkdir -p /install
mount /install
mkdir -p /etc/warewulf
umount /etc/warewulf
mkdir -p /vnfs.master
umount /vnfs.master
From the WWIM node (not the xCAT/WW management node) follow step
2. Got YUM? from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
3. Edit YUM configuration (/etc/yum.conf).
from the Homogeneous Installation Notes.
NOTE: OS/ARCH must match WWIM.
From the WWIM node (not the xCAT/WW management node) follow step
5. Build MAKEDEV and perl-Term-Screen. from the Homogeneous Installation
Notes.
From the WWIM node (not the xCAT/WW management node) follow step
6. Build Warewulf RPMs. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
7. Setup YUM for Warewulf. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
8. Install Warewulf. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) patch
wwvnfs.build to support WWIM nodes.
cd /usr/sbin
patch -p0 <$XCATROOT/patch/wwvnfs.build-2.2.4-2.patch
From the WWIM node (not the xCAT/WW management node) copy
/etc/hosts from xCAT/WW node.
scp xCATWW:/etc/hosts /etc
Where xCATWW is the hostname of
the xCAT/WW management node.
IA64/RHAS3 only notes:
From the WWIM node (not the xCAT/WW management node) create a gzip'd kernel, type:
cd /boot
gzip -9 -c <vmlinux-2.4.21-27.EL >vmlinuz-2.4.21-27.EL
Export WWIMAGE as
image-arch. E.g.:
export WWIMAGE=compute-x86_64
Mount /etc/warewulf and
/vnfs, e.g.:
mount /etc/warewulf
mount /vnfs.master
From the WWIM node (not the xCAT/WW management node) follow step
15. Create VNFS. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
16. Put DNS info in VNFS. from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
17. Add xCAT and /usr/local NFS support.
from the Homogeneous Installation Notes.
NOTE: Only create the VNFS mount points, e.g.:
mkdir -p /vnfs/$WWIMAGE/opt/xcat
mkdir -p /vnfs/$WWIMAGE/usr/local
From the WWIM node (not the xCAT/WW management node) follow step
18. Setup xCAT sysconfig file.
from the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
19. Put NTP in VNFS. from the Homogeneous Installation Notes.
(Optional, but recommended).
NOTE: Replace $(hostname -s)
with xCATWW. Where
xCATWW is the hostname of the xCAT/WW
management node.
From the WWIM node (not the xCAT/WW management node) follow step
20. Put SSH in VNFS. from the Homogeneous Installation Notes.
(Optional, but recommended).
NOTE: Skip /etc/warewulf/vnfs/excludes
edit. This was done as part of the Homogeneous Installation Notes.
From the WWIM node (not the xCAT/WW management node) follow step
21. Install HPC and xCAT dependencies. from the Homogeneous Installation
Notes. (Optional, but recommended).
From the WWIM node (not the xCAT/WW management node) follow step
22. Got Serial? from the Homogeneous Installation Notes.
(Optional, but recommended)
From the WWIM node (not the xCAT/WW management node) follow step
23. Install Myrinet RPM. from the Homogeneous Installation Notes.
(Optional)
From the WWIM node (not the xCAT/WW management node) follow step
24. Setup Torque. from the Homogeneous Installation Notes.
(Optional)
From the WWIM node (not the xCAT/WW management node) follow step
25. Build network boot image. from the Homogeneous Installation Notes.
Backup work to xCAT/WW management node. From the WWIM node (not the xCAT/WW
management node) type:
>/vnfs/$WWIMAGE/var/log/lastlog
rm -rf /vnfs.master/$WWIMAGE/
cd /vnfs
find $WWIMAGE -print | cpio -dump /vnfs.master/
From the xCAT/WW management node (not the WWIM node), check
/tftpboot and
/vnfs for new heterogeneous image,
e.g.:
ls -l /tftpboot/compute-x86_64*
ls -ld /vnfs/compute-x86_64
Boot up nodes!
Verify the node entries in $XCATROOT/etc/nodetype.tab
are:
nodename warewulf,ARCH,image{-arch}
e.g.
node13
warewulf,x86_64,compute-x86_64
node14 warewulf,x86_64,compute-x86_64
NOTE: The redundancy of specifying ARCH twice. This
is required because WW does not directly support heterogeneous clusters.
Also note that this is just a convention. The
image field in $XCATROOT/etc/nodetype.tab
must match an entry in /etc/warewulf/vnfs.conf.
Then, type:
rnetboot noderange
NOTE: Use wcons
or wvid and try one node first.
NOTE: Use wwnode.list
and wwtop to monitor nodes.
From the WWIM node (not the xCAT/WW management node) jump to
step 28. Collect SSH host keys after boot. from the Homogeneous
Installation Notes and continue from there.
What?!? My initrd image is huge (~32MB for x86_64) and I just lost a large chunk of short term memory (i.e. RAM ~100MB for x86_64) and I am having TFTP timeouts when I boot my 1024 node cluster. If this concerns you, consider WW Hybrid mode (~8MB initrd, ~20MB RAM usage).
What is Hybrid RAM/NFS? Well the above notes build a RAM root system, all O/S system files are in memory and NFS is only used for user data and applications. This large RAM root system can create TFTP scaling issues. This can be addressed with multiple TFTP servers, this is easy to setup with xCAT and is recommended for any size initrd image. However the use of the ~100MB of RAM cannot be reduced unless OS system files are moved from RAM to NFS. There is still some RAM used, but it is reduced to only ~20MB (x86_64 tested).
But... NFS could get pounded. NFS can scale and
scale well, it is NFS mounting that has timeout problems. A staggered boot
or script to retry can resolve this problem. Once mounted all should be
fine.
NOTE: You can have multiple NFS servers too.
Bottom line, test both. A single WW cluster can have
Hybrid and RAM root only nodes.
Get non-hybrid working first as a test. Hybrid RAM/NFS works with
homogeneous and heterogeneous clusters. The steps below should be
performed before creating the initial VNFS (see homogeneous notes).
Create an /etc/warewulf/vnfs.conf entry like this:
[compute-x86_64-hybrid] ramdisk nfs hybrid = 1 excludes file = /etc/warewulf/vnfs/excludes-nfs sym links file = /etc/warewulf/vnfs/symlinks-nfs fstab template = /etc/warewulf/vnfs/fstab-nfs path = /compute-x86_64-hybrid
Edit /etc/warewulf/vnfs/fstab-nfs
like /etc/warewulf/vnfs/fstab per the
homogeneous notes above, however the line in bold must also be added, e.g.:
mgmt1:%{vnfs dir} %{vnfs dir} nfs ro,rsize=8192,wsize=8192
0 0
mgmt1:%{sharedfs dir} %{sharedfs dir} nfs rw,rsize=8192,wsize=8192 0 0
mgmt1:/opt/xcat /opt/xcat nfs ro,rsize=8192,wsize=8192 0 0
mgmt1:/usr/local /usr/local nfs ro,rsize=8192,wsize=8192 0 0
Torque setup. (Optional)
Edit /etc/warewulf/vnfs/excludes-nfs and
append:
+ var/spool/pbs/
after the line:
+ var/lib/
This is required for xCAT Torque epilogue and prologue support to
control access to each node.
Edit /etc/warewulf/vnfs/excluded-nfs
and change:
/etc/security/
to
+ /etc/security/
Edit /etc/warewulf/vnfs/symlinks-nfs
and change:
%{vnfs}/%{link} etc/security
to
#%{vnfs}/%{link} etc/security
Now perform the homogenous (and optionally the heterogeneous)
notes.
Additional Myrinet setup. This step to be done after
Myrinet step above. (Optional)
Obtain the name of the GM directory, e.g.:
ls -d /vnfs/$WWIMAGE/opt/gm-*
output:
/vnfs/compute-x86_64-hybrid/opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64
Note the above in bold and append to /etc/warewulf/vnfs/excludes-nfs
(IMPORTANT: do not skip a single slash or add any extra):
#GM stuff
/opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64/
Append to /etc/warewulf/vnfs/symlinks-nfs
(IMPORTANT: do not skip a single slash or add any extra):
#GM stuff
%{vnfs}/%{link} opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64
YUM or RPM install more packages in to
$WWIMAGE. The minimalist
restrictions of RAM root only are lifted.
Build image and boot per homogeneous/heterogeneous notes.
Local HD Notes (WIP)
Accessing the local HD is no different that with diskfull nodes. However since no OS was installed partitions must be created and formatted for file systems and swap. Consider scripting sfdisk, e.g.:
Boot up a node.
SSH/RSH to that node.
Use fdisk to setup disk.
Run something like 'sfdisk /dev/sda -d >{NFS mount point}/sda.sfdisk' to save partition information.
Write a script to:
A. Dump partition information.
B. Compare partition information with {NFS
mount point}/sda.sfdisk, if no difference jump to E.
C. If different then run something like 'sfdisk
/dev/sda <{NFS mount point}/file.sfdisk'.
D. Use mke2fs and
mkswap to format partitions.
E. Create mount points, mount file systems.
F. Run swapon for each swap
partition.
Put script in rc.local.
There are many ways to do this, the above is just an example.
Once your xCAT/WW cluster is built and running, use as any other cluster, however the following additional commands may be useful.
wwnode.list
shows a list of the currently running nodes in the cluster. You can use it
to make a MPI or PVM machine file, or just get a list of
the nodes in the current ready position (--help
for options). E.g.:
# Nodes ready:
node13
node14
node2
node6
# Nodes in unknown state:
node15
node5
wwnode.stats displays the current usage summary for your cluster
by querying and pulling the information from the
warewulfd daemon directly. It
lists the current CPU, memory, swap, and network utilization (--help
for options). E.g.:
-------------------------------------------------------------------------------- Total CPU utilization: 0% Total Nodes: 6 Living: 6 Warewulf Unavailable: 0 Cluster Statistics Disabled: 0 http://warewulf-cluster.org/ Error: 0 Dead: 0 -------------------------------------------------------------------------------- Node CPU Memory (MB) Swap (MB) Network (KB) Current Name [util/num] [% used/total] [% used/total] [send - receive] Status node13 0% (2) 2% 37/1946 none 0 - 0 READY node14 0% (2) 2% 35/1946 none 0 - 0 READY node15 0% (2) 1% 38/5711 none 0 - 0 READY node2 0% (2) 3% 25/1001 none 0 - 0 READY node5 0% (2) 3% 104/4014 none 0 - 0 READY node6 0% (2) 3% 110/4014 none 0 - 0 READY
wwnode.sync will
update nodes that are already booted so they are accessible to the users and
updates these files on the nodes: /etc/passwd,
/etc/group,
/etc/hosts, and
/etc/hosts.equiv (--help
for options).
wwsumary is a summary of
aggregate CPU utilization and number of nodes ready (up and sync'd) (--help
for options). E.g.:
localhost: 0% CPU util, 6 Node(s)
Ready
wwtop is the Warewulf 'top'
like monitor. It shows the nodes ordered by the highest utilization and
important statistics about each node and general summary type data. This is an
interactive curses based tool (? for
commands).
E.g.: Hybrid cluster of x86, x86_64, and ia64.
Top: 6 nodes, 12 cpus, 20062 MHz, 18.20 GB mem, 158 procs, uptime 0 days Avg: 0% cputil, 58.00 MB memutil, load 0.00, 26 procs, uptime 0 day(s) High: 0% cputil, 110.00 MB memutil, load 0.00, 27 procs, uptime 0 day(s) Low: 0% cputil, 25.00 MB memutil, load 0.00, 25 procs, uptime 0 day(s) Node status: 6 ready, 0 unavailable, 0 down, 0 unknown 18:38:06 mgmt1> Node name CPU MEM SWAP Uptime MHz ARCH PROC Load Net:KB/s Stats/Util node13 0% 2% 0% 0.05 3984 x86_64 27 0.00 0 | IDLE | node14 0% 2% 0% 0.05 3984 x86_64 27 0.00 0 | IDLE | node15 0% 1% 0% 0.01 4386 x86_64 26 0.00 0 | IDLE | node2 0% 3% 0% 0.05 1728 i686 27 0.00 0 | IDLE | node5 0% 3% 0% 0.01 2990 ia64 25 0.00 0 | IDLE | node6 0% 3% 0% 0.20 2990 ia64 26 0.00 0 | IDLE | E.g.: PPC64 cluster with 3 nodes down.
Top: 3 nodes, 6 cpus, 13164 MHz, 11.31 GB mem, 87 procs, uptime 0 days Avg: 0% cputil, 27.00 MB memutil, load 0.00, 29 procs, uptime 0 day(s) High: 0% cputil, 28.00 MB memutil, load 0.00, 29 procs, uptime 0 day(s) Low: 0% cputil, 27.00 MB memutil, load 0.00, 29 procs, uptime 0 day(s) Node status: 3 ready, 0 unavailable, 0 down, 3 unknown 21:19:45 wopr.cluster> Node name CPU MEM SWAP Uptime MHz ARCH PROC Load Net:KB/s Stats/Util blade001 0% 1% 0% 0.06 4388 ppc64 29 0.00 0 | IDLE | blade002 0% 1% 0% 0.06 4388 ppc64 29 0.00 0 | IDLE | blade003 0% 1% 0% 0.06 4388 ppc64 29 0.00 0 | IDLE | blade005 ---- ---- ---- ------ ----- ------- ---- ------ ------- |UN-KNOWN| blade006 ---- ---- ---- ------ ----- ------- ---- ------ ------- |UN-KNOWN| blade007 ---- ---- ---- ------ ----- ------- ---- ------ ------- |UN-KNOWN|
For a list of specific nodes
export NODES=node,node,node,... before running
wwtop.
E.g. to use an xCAT
noderange:
NODES=$(nr noderange) wwtop
E.g. to use PBS:
NODES=$(pbsjobnodes PBS_JOB_ID) wwtop
NODES=$(pbsusernodes USERID) wwtop
MPI Notes
No swap? This can be a problem for large MPI jobs. Specifically MPICH. MPICH uses RSH or SSH to launch one process per processor. E.g. If you have a 1024 node cluster with 2 processors per node you have 2048 processors, MPICH will launch 2048 instances of RSH or SSH. Each instance stays in memory until the job completes. (NOTE: RSH cannot support more that 512 processors because it only uses privileged ports (there are only 1024 of them) and RSH uses 2 ports per processor one for STDOUT and one for STDERR). 2048 SSH processes will take more than 1 GB of RAM. Normally for diskfull clusters this is not an issue, it gets swapped out as application needs grow to use more RAM.
Solutions:
Use LAM MPI. LAM has a
lamboot command to be run before mpirun
is run. lamboot will RSH/SSH
serially to each node and startup a lightweight daemon then exit. LAM's
mpirun will work with the lightweight
daemons to start the job. (Tested.)
E.g. for xhpl/LAM-IP:
export LAM=/usr/local/lam7
export PATH=$LAM/bin:$PATH
export LAMRSH=/usr/bin/ssh
export NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')
lamboot $PBS_NODEFILE
mpirun -np $NP xhpl 2>&1 | tee output
Use MPICH MPD. MPICH MPD is like LAM's
lamboot. (Untested, failed to
work.)
Use mpiexec (if using
Torque). mpiexec is a replacement
program for the script mpirun, which is
part of the MPICH package. It is used to initialize a parallel job from within a
Torque batch or interactive environment.
mpiexec uses the task manager library of Torque to spawn copies of the
executable on the nodes in a Torque allocation. (Tested.)
mpiexec installation notes:
Install Torque/Maui per this document, the xCAT
mini-HOWTO, and benchmark-HOWTO.
Download mpiexec from
http://www.osc.edu/~pw/mpiexec/
and save in /tmp. Read the
documentation for more information.
Type:
cd /usr/local/pbs/$(uname -m)
ln -s ../include .
cd /tmp
tar zxvf mpiexec-0.77.tgz
cd mpiexec-0.77
./configure \
--with-pbs=/usr/local/pbs/$(uname -m) \
--with-default-comm=mpich-gm \
--prefix=/usr/local/pbs \
--bindir=/usr/local/pbs/$(uname -m)/bin
make
make install
To use put this in your .pbs script
to run manually after a qsub -I.
E.g. for xhpl/MPICH-GM:
export PATH=/usr/local/pbs/$(uname -m)/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/goto/lib:/usr/local/mpich/mpichgm-1.2.6..14a/gm/x86_64/smp/gnu64/ssh/lib/shared
export OMP_NUM_THREADS=1
mpiexec -comm mpich-gm xhpl
E.g. for xhpl/MPICH-IP:
export PATH=/usr/local/pbs/$(uname -m)/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/goto/lib
export P4_GLOBMEMSIZE=16000000
mpiexec -comm mpich-p4 xhpl
Use the new --gm-tree-spawn
option of MPICH-GM 1.2.6..14. It distributes the SSH connections on a tree
to limit resources usage per node for large job. (Tested.)
Warewulf does not have native support for MTU 9000. Below are notes for the most complex solution (i.e. single NIC).
For MTU 9000 to function correctly everything must be MTU 9000.
The problem with a single NIC Warewulf solution is that PXE booting uses MTU
1500. The solution is to have two NICs on the server in the MTU 9000 VLAN
or switch. One NIC should be set to MTU 1500 and the other to MTU 9000.
The MTU 9000 NIC should be the primary cluster NIC. The MTU 1500 NIC will
be used for PXE/TFTP booting only.
In the notes below the primary cluster VLAN NIC is eth0 (MTU 9000), the PXE/TFTP
NIC is eth1 (MTU 1500), and the systems management VLAN NIC is eth2 (MTU 1500
required for many embedded devices).
NOTE: Your switch must support MTU 9000.
HINT: Cisco 3750 commands:
system mtu jumbo 9000 and switchport
access vlan number.
Setup management node NICs. Remember to set one for MTU
9000 and one for MTU 1500 and both must be in the same VLAN. Each NIC must
be on a different subnet.
Setup /etc/hosts per
this HOWTO and the xCAT mini HOWTO. E.g.:
172.90.0.1 mgmt1 mgmt1-admin mgmt1-clust
mgmt1-sfs
172.20.0.1 mgmt1-eth1
172.30.0.1 mgmt1-eth2
172.90.1.1 blade001
172.20.1.1 blade001-eth1
172.90.1.2 blade002
172.20.1.2 blade002-eth1
...
Although the nodes only are only using one NIC two entries are required.
One for PXE/TFTP booting at MTU 1500 and the other at MTU 9000 for runtime.
Setup $XCATROOT/etc/mac.tab.
Because we are using the same NIC as eth1 (PXE/TFTP booting at MTU 1500) and as
eth0 (runtime MTU 9000) we must have the same MAC defined for both the physical
and virtual NIC. E.g.:
blade001-eth0 00:11:25:4a:35:a2
blade001-eth1 00:11:25:4a:35:a2
...
Update $XCATROOT/etc/noderes.tab.
The TFTP field must be the host name of the TFTP server NIC (e.g. mgmt1-eth1).
The rest of the fields should be unchanged.
E.g. old single NIC solution for MTU 1500.
compute mgmt1,mgmt1,/install,NA,N,N,N,Y,N,N,N,eth0,eth0,mgmt1
New single NIC solution for MTU 9000.
compute mgmt1-eth1,mgmt1,/install,NA,N,N,N,Y,N,N,N,eth1,eth0,mgmt1
The fields in bold determine what host should be used in DHCP for TFTP booting,
the host NIC used for TFTP booting, and the host NIC used for the primary
runtime NIC. In the example above the virtual eth1 NIC will get a 172.20
address to TFTP download its kernel and initrd image from mgmt1-eth1
(172.20.0.1), but eth0 will use a 172.90 address for runtime and will
communicate with mgmt1 (172.90.0.1) at MTU 9000.
Update $XCATROOT/etc/site.tab
with any new networks for DNS, then update DNS with:
makedns
Update DHCP, type:
makedhcp noderange
Verify that /vnfs/$WWIMAGE/etc/resolv.conf
is correct for the MTU 9000 network.
Update/create /vnfs/$WWIMAGE/etc/sysconfig/network-scripts/ifcfg-eth0:
echo "MTU=9000" >> /vnfs/$WWIMAGE/etc/sysconfig/network-scripts/ifcfg-eth0
Rebuild network boot image.
Type:
>/vnfs/$WWIMAGE/var/log/lastlog
wwvnfs.build --vnfs=$WWIMAGE
rnetboot nodes, e.g.:
rnetboot noderange
Follow homogeneous or heterogeneous notes with the following exceptions:
OpenFirmware does not have a network downloadable secondary boot
loader like x86/x86_64 (pxelinux.0) or
ia64 (elilo). A combined kernel/initrd
image must be created using the following instructions:
cp -f /tftpboot/$WWIMAGE.img.gz /usr/src/linux-2.4/arch/ppc64/boot/ramdisk.image.gz
cd /usr/src/linux-2.4
perl -pi -e 's/custom$//' Makefile
make distclean
cp -f configs/kernel-2.4.21-ppc64.config .config
make oldconfig
make dep
make zImage.initrd
cp -f /usr/src/linux-2.4/arch/ppc64/boot/zImage.initrd /tftpboot/$WWIMAGE.zimage
Do not delete .img.gz
files from /tftpboot.
The OpenFirmware combined kernel/initrd image must be less than
12MB in size. Hybrid RAM/NFS should be used.
NOTE: 7582217 tested. Greater failed. If you have a
problem with this you may need to slim down your image. There are 2 ways
to do this.
A. Use the Warewulf exclude and symlink lists.
B. Use the following hack (you may want to run the script internals by
hand to be safe). WWIMAGE must be
exported first.
/tftpboot
cp $WWIMAGE.img.gz /tmp
cd /tmp
$XCATROOT/build/warewulf/slimit $WWIMAGE.img.gz
cp $WWIMAGE.img.gz /tftpboot
Rebuild the zImage.
http://xcat.org
http://warewulf-cluster.org
Egan Ford
egan@us.ibm.com
February 2005