xCAT Warewulf (wareCAT) HOWTO (WIP)

Introduction
Installation Notes (Homogeneous)
Advanced Notes (Heterogeneous)
Hybrid RAM/NFS File System
Local HD Notes
Warewulf Command Notes
MPI Notes
MTU 9000 Notes
PPC64 Notes
Support

Introduction

Warewulf is a popular complete diskless solution for HPC clusters and xCAT is a complete diskfull solution for large HPC clusters.  wareCAT is support added to xCAT for Warewulf.  i.e. xCAT using Warewulf, not Warewulf using xCAT.  xCAT will be the primary cluster management package.  Warewulf will be used for maintaining and monitoring diskless images and nodes.

IANS Warewulf will create a number of kernel and initrd images and place them in /tftpboot.  Network booting from DHCP and TFTP servers is how images are booted.  xCAT provides a scalable infrastructure for DHCP and TFTP.

Supported configurations:

  1. Homogeneous Simple.  A single xCAT/WW management node for X number of nodes.  A single node will act as the only DHCP and TFTP server.  This solution scales to around 128 nodes.  If staggered boot is used, this can scale to 1000s of  nodes.  A reasonable limit would be 1024.


     

  2. Homogeneous Scale.  A single xCAT/WW management node with multiple other nodes providing some services to increasing scaling.  This can scale to 100s and 1000s of nodes.  xCAT already provides this support for large diskfull clusters and it can by applied to WW as well.  The number of nodes and network infrastructure will determine the number of DHCP servers and TFTP server required for a parallel reboot with no DHCP/TFTP timeouts.  The number of DHCP servers does not need to equal the number of TFTP servers.  TFTP servers NFS mount read-only the /tftpboot directory from the management node to provide a consistent set of kernel and initrd images.

    The xCAT/WW management node must export /tftpboot as read-only.  The additional TFTP servers must NFS mount /tftpboot from the management node.  $XCATROOT/etc/noderes.tab must be configured by node attribute, group, or name to determine what nodes will use what TFTP server.


     
  3. Heterogeneous.  A single xCAT/WW management node with one or more WW image management nodes (at least one per OS/ARCH type, e.g. rhas3/x86_64, rhas3u4/ia64).  The WW image management nodes may also double as TFTP servers for a heterogeneous scale solution.  Additional TFTP servers may also be added that are not WW image management nodes.  This can be configured as simple or scale.

    The xCAT/WW management node must export /tftpboot as read-write, /vnfs as read-write, /etc/warewulf as read-only, and /install as read-only.  The WW image management nodes must mount /tftpboot, /vnfs, /etc/warewulf, and /install from the management node.  The additional TFTP only servers must mount /tftpboot from the management node.  $XCATROOT/etc/noderes.tab must be configured by node attribute, group, or name to determine what nodes will use what TFTP server.

NOTE:  DHCP and TFTP setup can be as simple or as complicated as you like.  xCAT also supports sub DHCP servers for routed subnets sans DHCP forwarding for the ultimate in complexity.

NOTE
:  Nodes with disks can toggle from diskfull to diskless.  e.g. nodeset noderange boot (or rboot noderange) for local disk, nodeset noderange netboot (or rnetboot noderange) for diskless boot.

NOTE:  Applications with heavy local file I/O requirements are not excluded from the benefits of diskless administration.  i.e. the cluster can be booted diskless, but still access local disk.

This document is for xCAT 1.2.0 and Warewulf 2.2.4-2.

x86 (i386, i486, i586, i686) supported distributions:

x86_64 (Opteron and EMT64) supported distributions:

ia64 supported distributions:

ppc64 supported distributions:

NOTE:  SuSE OSes to be done.

NOTE:  2.6 kernel distributions to be done.


Installation Notes (Homogeneous)

  1. Install management node and xCAT per the xCAT mini HOWTO.  Follow all steps except node install.

    NOTEcopycds command must NOT be skipped.  This is needed for YUM.
     
  2. Got YUM?  Warewulf uses YUM to build the diskless images.

    Check with:

    yum --version

    No YUM?

    Type:

    rpm -i $XCATROOT/src/yum-2.0.8-1.noarch.rpm

    NOTE:  YUM 2.1.x unsupported.
     
  3. Edit YUM configuration (/etc/yum.conf).  Change remote directories to local directories created by copycds (usually /install/OSVER/ARCH).  For this document OSVER will be rhas3 and ARCH will be x86_64.

    e.g.:

    Change:

    [base]
    name=CentOS-$releasever - Base
    baseurl=http://mirror.centos.org/centos-3/$releasever/os/$basearch/


    to

    [base]
    name=CentOS-$releasever - Base
    baseurl=file:///install/rhas3/x86_64/RedHat/RPMS


    Change:

    [update]
    name=CentOS-$releasever - Updates
    baseurl=http://mirror.centos.org/centos-3/$releasever/updates/$basearch/
    gpgcheck=1

    #packages used/produced in the build but not released
    [addons]
    name=CentOS-$releasever - Addons
    baseurl=http://mirror.centos.org/centos-3/$releasever/addons/$basearch/
    gpgcheck=1

    #additional packages that may be useful
    [extras]
    name=CentOS-$releasever - Extras
    baseurl=http://mirror.centos.org/centos-3/$releasever/extras/$basearch/
    gpgcheck=1


    to

    [update]
    name=CentOS-$releasever - Updates
    baseurl=file:///install/post/updates/rhas3/x86_64


    Change all instances of gpgcheck from 1 to 0.

    Sample /etc/yum.conf for RHAS3 file:

    [main]
    cachedir=/var/cache/yum
    debuglevel=2
    logfile=/var/log/yum.log
    pkgpolicy=newest
    distroverpkg=redhat-release
    tolerant=1
    exactarch=1

    [base]
    name=Red Hat Linux $releasever - $basearch - Base
    baseurl=file:///install/rhas3/x86_64/RedHat/RPMS
    gpgcheck=0

    [updates]
    name=Red Hat Linux $releasever - Updates
    baseurl=file:///install/post/updates/rhas3/x86_64
    gpgcheck=0

     
  4. Setup YUM for xCAT.  For the this document OSVER will be rhas3 and ARCH will be x86_64.

    yum-arch /install/OSVER/ARCH/RedHat/RPMS/
    mkdir -p /install/post/updates/OSVER/ARCH
    yum-arch /install/post/updates/OSVER/ARCH


    e.g.:

    yum-arch /install/rhas3/x86_64/RedHat/RPMS/
    mkdir -p /install/post/updates/rhas3/x86_64
    yum-arch /install/post/updates/rhas3/x86_64

     
  5. Build MAKEDEV and perl-Term-Screen.

    Download from http://downloads.warewulf-cluster.org/DEPS/ and save in /tmp.

    Build with:

    cd /tmp
    rpm -i MAKEDEV-3.3.8-5.ww.src.rpm
    rpm -i perl-Term-Screen-1.02-2.caos.src.rpm
    cd /usr/src/redhat/SPECS
    rpmbuild -ba perl-Term-Screen.spec
    rpmbuild -ba MAKEDEV.spec

    PPC64 Note: rpmbuild --target ppc -ba MAKEDEV.spec
     
  6. Build Warewulf RPMs.

    Download from http://downloads.warewulf-cluster.org/ and save to /tmp.

    Build with:

    cd /tmp
    rpm -i warewulf-2.2.4-2.src.rpm
    cd /usr/src/redhat/SPECS
    rpmbuild -ba warewulf.spec

    PPC64 Note:

    patch -p0 <$XCATROOT/patch/warewulf/warewulf.spec.patch
    rpmbuild --target ppc -ba warewulf.spec
     
  7. Setup YUM for Warewulf.

    Type:

    yum-arch /usr/src/redhat/RPMS/

    Append to /etc/yum.conf:

    [local]
    name=Local repository
    baseurl=file:///usr/src/redhat/RPMS

     
  8. Install Warewulf.

    Type:

    yum install warewulf warewulf-tools
     
  9. Setup Warewulf.

    The previous steps follow the Warewulf documentation (http://warewulf-cluster.org/installation.html), however the next steps are specific for a hybrid xCAT/Warewulf environment.  xCAT and Warewulf alter/rely on many of the same system files.  Since xCAT is a superset of Warewulf functionality xCAT will maintain system files.

    Warewulf as is will edit the following files:

    /etc/hosts
    /etc/exports
    /etc/dhcpd.conf
    /tftpboot/pxelinux.cfg/*


    In an xCAT environment this is undesirable.  /etc/hosts and /etc/exports will be edited manually.  /etc/dhcpd.conf and /tftpboot/pxelinux.cfg/* is managed by xCAT.
     
  10. Edit Warewulf /etc/warewulf/master.conf.

    nodename should equal the short node name of the Warewulf management node (e.g. mgmt1), node prefix and node number format are used for WW only environments for automatic discovery and setup.  Since xCAT controls this, leave defaults.

    Change the boot, admin, sharedfs, and cluster devices to match the correct NIC (e.g. eth0).

    Sample configuration file:
    [network]
    # Hostname that the nodes know this host as
      nodename              = mgmt1
      node prefix           = node
    
    # The printf format string that would follow the % character, e.g.: 04d
      node number format    = 04d
    
    # WARNING: boot and admin devices should be the same on clusters that 
    # utilize a traditional net architecture (not FNN, or bonded)
      boot device           = eth0
      admin device          = eth0
      sharedfs device       = eth0
      cluster device        = eth0
    
  11. Edit Warewulf /etc/warewulf/vnfs.conf.

    This file defines the diskless images.  The defaults are a good start, but: glibc.i386 must be changed to glibc, and comment out disable virtual terminals = 1456.

    Add a compute image, append:

    [compute]
       excludes file        = /etc/warewulf/vnfs/excludes
       sym links file       = /etc/warewulf/vnfs/symlinks
       fstab template       = /etc/warewulf/vnfs/fstab
       path                 = /compute


    NOTE:  To add an additional image (e.g. test) append:

    [test]
       excludes file        = /etc/warewulf/vnfs/excludes
       sym links file       = /etc/warewulf/vnfs/symlinks
       fstab template       = /etc/warewulf/vnfs/fstab
       path                 = /test


    Export WWIMAGE to be the name of the image (e.g. compute) you are working on, e.g.:

    export WWIMAGE=compute
     
  12. Edit Warewulf /etc/warewulf/nodes.conf.

    Edit the users and groups list and uncomment the pxeboot line.
     
  13. Setup NFS.

    Edit /etc/exports:

    Append:

    /usr/local *(ro,async,no_root_squash)
    /vnfs *(ro,async,no_root_squash)
    /home *(rw,async,root_squash)


    NOTE:  Only /vnfs is required if /home and /usr/local are serviced by another server.

    Type:

    service nfs restart
    chkconfig --level 345 nfs on

     
  14. Setup /etc/hosts and DNS.

    Alias your Warewulf management node host name entry with -admin, -clust, -sfs suffixes.

    E.g.:

    172.20.0.1        mgmt1 mgmt1-admin mgmt1-clust mgmt1-sfs

    Rebuild DNS, type:

    makedns
     
  15. Create VNFS.  NOTE:  Jump to Hybrid RAM/NFS File System if using a hybrid RAM/NFS file system, before creating the VNFS.

    wwvnfs.create --vnfs=$WWIMAGE
     
  16. Put DNS info in VNFS.

    cp /etc/resolv.conf /vnfs/$WWIMAGE/etc/
     
  17. Add xCAT and /usr/local NFS support.

    Edit /etc/warewulf/vnfs/fstab and append:

    %{sharedfs ipaddr}:/opt/xcat /opt/xcat nfs ro,rsize=8192,wsize=8192 0 0
    %{sharedfs ipaddr}:/usr/local /usr/local nfs ro,rsize=8192,wsize=8192 0 0

    NOTE:  If xCAT, /home, or /usr/local are exported by another server simple hardcode the IP or name in place of %{sharedfs ipaddr}.

    Create mount points, type:

    mkdir -p /vnfs/$WWIMAGE/opt/xcat
    mkdir -p /vnfs/$WWIMAGE/usr/local
     
  18. Setup xCAT sysconfig file.

    Type:

    cp /etc/sysconfig/xcat /vnfs/$WWIMAGE/etc/sysconfig/

    xCAT also needs /etc/redhat-release and xcat.sh, type:

    cp /etc/redhat-release /vnfs/$WWIMAGE/etc
    cp /etc/profile.d/xcat.sh /vnfs/$WWIMAGE/etc/profile.d

     
  19. Put NTP in VNFS.  (Optional, but recommended).

    Install NTP with YUM:

    yum --installroot /vnfs/$WWIMAGE install ntp

    Create NTP configuration files, type:

    cd /vnfs/$WWIMAGE
    mkdir -p etc/ntp
    chown ntp etc/ntp

    echo "server $(hostname -s)
    driftfile /etc/ntp/drift
    multicastclient
    broadcastdelay 0.008
    authenticate no
    keys /etc/ntp/keys
    trustedkey 65535
    requestkey 65535
    controlkey 65535" >etc/ntp.conf

    echo "$(hostname -s)" >etc/ntp/step-tickers

    wwvnfs.chroot --vnfs=$WWIMAGE
    chkconfig --level 345 ntpd on
    exit

     
  20. Put SSH in VNFS.  (Optional, but recommended).

    Edit /etc/warewulf/vnfs/excludes and append:

    + var/empty/sshd/

    after the line:

    + var/tmp/

    Install SSH  with YUM:

    yum --installroot /vnfs/$WWIMAGE install openssh-clients openssh-server

    Copy root SSH keys to VNFS:

    cp -r /root/.ssh /vnfs/$WWIMAGE/root

    Setup SSH on boot and generate keys, type:

    wwvnfs.chroot --vnfs=$WWIMAGE

    chkconfig --level 345 sshd on

    /usr/bin/ssh-keygen -t rsa1 -f /etc/ssh/ssh_host_key -C '' -N ''
    chmod 600 /etc/ssh/ssh_host_key
    chmod 644 /etc/ssh/ssh_host_key.pub
    /usr/bin/ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -C '' -N ''
    chmod 600 /etc/ssh/ssh_host_rsa_key
    chmod 644 /etc/ssh/ssh_host_rsa_key.pub
    /usr/bin/ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key -C '' -N ''
    chmod 600 /etc/ssh/ssh_host_dsa_key
    chmod 644 /etc/ssh/ssh_host_dsa_key.pub

    exit
     
  21. Install HPC and xCAT dependencies.  (Optional, but recommended).

    Type:

    yum --installroot /vnfs/$WWIMAGE install perl pdksh libf2c elfutils-libelf libaio
     
  22. Got Serial?  Consider a serial terminal login.  (Optional, but recommended).

    Append something like:

    co:2345:respawn:/sbin/agetty ttyS0 19200 xterm

    to /vnfs/$WWIMAGE/etc/inittab, then:

    echo ttyS0 >>/vnfs/$WWIMAGE/etc/securetty

    PPC64 (JS20) Note:

    Use:

    co:2345:respawn:/sbin/agetty hvc0 38400 xterm

    Run:

    cd /vnfs/$WWIMAGE/
    echo hvc0 >>etc/securetty
    mknod -m 620 dev/hvc0 c 229 0
    chgrp tty dev/hvc0

     
  23. Install Myrinet RPM.  (See myrinet-HOWTO).  (Optional)

    E.g.:

    wwvnfs.rpm --vnfs $WWIMAGE --nodeps -i gm-2.0.16_Linux-2.4.21-20.EL.c0.x86_64.rpm

    Change GM to load after NFS, type:

    perl -pi -e 's/# chkconfig: 2345 11 60/# chkconfig: 2345 54 60/' /vnfs/$WWIMAGE/etc/rc.d/init.d/gm
    wwvnfs.chroot --vnfs=$WWIMAGE
    chkconfig --level 345 gm on
    exit


    Install RPM on master as well for development.

    Remove from /etc/warewulf/vnfs/master-includes all GM related files, i.e. remove:

    /dev/gm0
    /dev/gm1
    /dev/gm2
    /dev/gm3
    /dev/gm4
    /dev/gm5
    /dev/gm6
    /dev/gm7
    /dev/gmp0
    /dev/gmp1
    /dev/gmp2
    /dev/gmp3
    /dev/gmp4
    /dev/gmp5
    /dev/gmp6
    /dev/gmp7

    # Some defaults for Myrinet when installed in /usr/local
    /etc/init.d/gm
    /etc/rc3.d/S20gm
    /usr/local/bin/gm_allsize
    /usr/local/bin/gm_board_info
    /usr/local/bin/gm_board_mapping
    /usr/local/bin/gm_collide
    /usr/local/bin/gm_counters
    /usr/local/bin/gm_dirsend
    /usr/local/bin/gm_dma_malloc
    /usr/local/bin/gm_ethertune
    /usr/local/bin/gm_fault
    /usr/local/bin/gm_nway
    /usr/local/bin/gm_set_name
    /usr/local/bin/gm_set_speed
    /usr/local/bin/gm_simpleroute
    /usr/local/bin/gm_time_registration
    /usr/local/sbin/gm_mapper
    /usr/local/lib/libgm.so
    /usr/local/lib/libgm.so.0
    /usr/local/lib/libgm.so.0.0.0
    /usr/local/etc/gm/gm_config_info
    /usr/local/etc/gm/gm_libtool
    /usr/local/etc/gm/UTS_RELEASE

     
  24. Setup Torque.  (Optional).

    Edit /etc/warewulf/vnfs/excludes and append:

    + var/spool/pbs/

    after the line:

    + var/tmp/

    Append to /etc/warewulf/vnfs/master-includes /dev/pty*:

    ls -1 /dev/pty* >>/etc/warewulf/vnfs/master-includes

    Run the following scripts to setup the PBS directories:

    $XCATROOT/install/diskless/pbs_mom /vnfs/$WWIMAGE $PBS_SERVER

    where $PBS_SERVER is short hostname of the PBS server.

    Optionally but recommended add support for /etc/security/access.conf.  Without this support any user can SSH/RSH to any node at anytime.  xCAT's PBS startup and takedown script will manage /etc/security/access.conf.  To add this support type:

    $XCATROOT/install/diskless/setuppam /vnfs/$WWIMAGE

    Setup Torque on boot, type:

    wwvnfs.chroot --vnfs=$WWIMAGE

    chkconfig --level 345 pbs_mom on

    exit

    The rest of Torque setup is performed using the xCAT genpbs command, to be run after all nodes are up.
     

  25. Build network boot image.

    Type:

    >/vnfs/$WWIMAGE/var/log/lastlog
    wwvnfs.build --vnfs=$WWIMAGE


    IGNORE:  NOTE: if any nodes use PXELINUX, you must rerun wwmaster.dhcpd!

    DO NOT run wwmaster.dhcpd, it will hose you (if you hose yourself, just type: makedhcp --new --allmac)

    NOTE:  Copy /tftpboot/compute.* files to all TFTP servers unless Warewulf server was installed by xCAT as a staging server, then this should be automatic.
     
  26. Setup and start wwautosyncd daemon.

    Type:

    cp -f $XCATROOT/rc.d/wwautosyncd /etc/rc.d/init.d
    chkconfig --level 345 wwautosyncd on
    service wwautosyncd restart

     
  27. Setup and start warewulfd daemon.

    Type:

    chkconfig --level 345 warewulfd on
    service warewulfd restart
     
  28. Boot up nodes!

    Verify the node entries in $XCATROOT/etc/nodetype.tab are:

    nodename    warewulf,ARCH,compute

    e.g.

    node13      warewulf,x86_64,compute
    node14      warewulf,x86_64,compute

    Then, type:

    rnetboot noderange

    NOTE:  Use wcons or wvid and try one node first.

    NOTE:  Use wwnode.list and wwtop to monitor nodes.
     
  29. Collect SSH host keys after boot:

    makesshgkh noderange
     
  30. Add users with addclusteruser.

    E.g.

    # addclusteruser
    Enter username: bob
    Enter group: users
    Enter UID (return for next): 500
    Enter absolute home directory root:
    addclusteruser: must start with /
    Enter absolute home directory root: /home
    Enter passwd (blank for random): NkeryW4a
    Creating home directory /home/bob
    Adding user: useradd -u 500 -g users -d /home/bob bob
    Setting passwd for bob to NkeryW4a
    Changing password for user bob.
    passwd: all authentication tokens updated successfully.

    Don't forget to run pushuser [noderange] bob


    IGNORE: pushuser command, use wwnode.sync for Warewulf.

    Edit /etc/warewulf/nodes.conf users and groups.

    Push out updates with:

    wwnode.sync
     
  31. Install compilers and libraries.  Read the xCAT HPC Benchmark HOWTO for details.
     
  32. Enjoy your cluster.  Do some work.
     

Advanced Notes (Heterogeneous)

  1. Complete the Homogeneous Installation Notes first.
     

  2. Install a diskfull node that matches the OS and ARCH of the target diskless nodes.  E.g. an x86 node can xCAT install an x86_64 node with a different OS.  Install all packages.  This node will be referred to as the Warewulf Image Management node or WWIM.
     

  3. From the xCAT/WW management node (not the WWIM node) append to/edit /etc/exports:

    /install *(ro,async,no_root_squash)
    /tftpboot *(rw,async,no_root_squash)
    /vnfs *(rw,async,no_root_squash)
    /etc/warewulf *(ro,async,no_root_squash)

    Then export with:

    exportfs -a
     

  4. From the xCAT/WW management node (not the WWIM node) setup YUM for the new OS/ARCH.  copycds should have already been run again this new OS/ARCH since it was required to install the diskfull WWIM node.  Lets assume that your xCAT/WW management node is x86 (i686) based and you want to add x86_64 based diskless nodes and your newly installed WWIM node is x86_64.  Lets also assume that your are running RHAS3 on both x86 and x86_64 but with the correct ARCH version for each node.

    The following commands should have already been run as part of the Homogenous Installation Notes (above).

    yum-arch /install/rhas3/x86/RedHat/RPMS/
    mkdir -p /install/post/updates/rhas3/x86
    yum-arch /install/post/updates/rhas3/x86

    For the new OS/ARCH (assuming rhas3/x86_64) type:

    yum-arch /install/rhas3/x86_64/RedHat/RPMS/
    mkdir -p /install/post/updates/rhas3/x86_64
    yum-arch /install/post/updates/rhas3/x86_64
     

  5. From the xCAT/WW management node (not the WWIM node) edit /etc/warewulf/vnfs.conf.

    Add a compute-ARCH image, e.g. append:

    [compute-x86_64]
       excludes file        = /etc/warewulf/vnfs/excludes
       sym links file       = /etc/warewulf/vnfs/symlinks
       fstab template       = /etc/warewulf/vnfs/fstab
       path                 = /compute-x86_64


    NOTE:  for IA64/RHAS3 append:

    [compute-ia64]
       excludes file        = /etc/warewulf/vnfs/excludes
       sym links file       = /etc/warewulf/vnfs/symlinks
       fstab template       = /etc/warewulf/vnfs/fstab
       path                 = /compute-ia64
       include packages     = warewulf-vnfs glibc coreutils binutils


    NOTE
    :  Warewulf does not differentiate hardware architecture, to avoid confusion and loss of images, please use the following convention imagename-arch when defining heterogeneous images in vnfs.conf.
     

  6. From the WWIM node (not the xCAT/WW management node) setup /etc/fstab, append:

    xCATWW:/tftpboot      /tftpboot      nfs timeo=14,intr 1 2
    xCATWW:/install       /install       nfs timeo=14,intr 1 2
    xCATWW:/vnfs          /vnfs.master   nfs timeo=14,intr 1 2
    xCATWW:/etc/warewulf  /etc/warewulf  nfs timeo=14,intr 1 2

    Where xCATWW is the hostname of the xCAT/WW management node.
     

  7. From the WWIM node (not the xCAT/WW management node) create mount points and mount /tftpboot and /install only, type:

    mkdir -p /tftpboot
    mount /tftpboot
    mkdir -p /install
    mount /install
    mkdir -p /etc/warewulf
    umount /etc/warewulf
    mkdir -p /vnfs.master
    umount /vnfs.master

     

  8. From the WWIM node (not the xCAT/WW management node) follow step 2. Got YUM? from the Homogeneous Installation Notes.
     

  9. From the WWIM node (not the xCAT/WW management node) follow step 3. Edit YUM configuration (/etc/yum.conf). from the Homogeneous Installation Notes.

    NOTE:  OS/ARCH must match WWIM.
     

  10. From the WWIM node (not the xCAT/WW management node) follow step 5. Build MAKEDEV and perl-Term-Screen. from the Homogeneous Installation Notes.
     

  11. From the WWIM node (not the xCAT/WW management node) follow step 6. Build Warewulf RPMs. from the Homogeneous Installation Notes.
     

  12. From the WWIM node (not the xCAT/WW management node) follow step 7. Setup YUM for Warewulf. from the Homogeneous Installation Notes.
     

  13. From the WWIM node (not the xCAT/WW management node) follow step 8. Install Warewulf. from the Homogeneous Installation Notes.
     

  14. From the WWIM node (not the xCAT/WW management node) patch wwvnfs.build to support WWIM nodes.

    cd /usr/sbin
    patch -p0 <$XCATROOT/patch/wwvnfs.build-2.2.4-2.patch

     

  15. From the WWIM node (not the xCAT/WW management node) copy /etc/hosts from xCAT/WW node.

    scp xCATWW:/etc/hosts /etc

    Where xCATWW is the hostname of the xCAT/WW management node.
     

  16. IA64/RHAS3 only notes:

    From the WWIM node (not the xCAT/WW management node) create a gzip'd kernel, type:

    cd /boot
    gzip -9 -c <vmlinux-2.4.21-27.EL >vmlinuz-2.4.21-27.EL
     

  17. Export WWIMAGE as image-arch.  E.g.:

    export WWIMAGE=compute-x86_64
     

  18. Mount /etc/warewulf and /vnfs, e.g.:

    mount /etc/warewulf
    mount /vnfs.master

     

  19. From the WWIM node (not the xCAT/WW management node) follow step 15. Create VNFS. from the Homogeneous Installation Notes.
     

  20. From the WWIM node (not the xCAT/WW management node) follow step 16. Put DNS info in VNFS. from the Homogeneous Installation Notes.
     

  21. From the WWIM node (not the xCAT/WW management node) follow step 17. Add xCAT and /usr/local NFS support. from the Homogeneous Installation Notes.

    NOTE:  Only create the VNFS mount points, e.g.:

    mkdir -p /vnfs/$WWIMAGE/opt/xcat
    mkdir -p /vnfs/$WWIMAGE/usr/local

     

  22. From the WWIM node (not the xCAT/WW management node) follow step 18. Setup xCAT sysconfig file. from the Homogeneous Installation Notes.
     

  23. From the WWIM node (not the xCAT/WW management node) follow step 19. Put NTP in VNFS. from the Homogeneous Installation Notes.  (Optional, but recommended).

    NOTE:  Replace $(hostname -s) with xCATWW.  Where xCATWW is the hostname of the xCAT/WW management node.
     

  24. From the WWIM node (not the xCAT/WW management node) follow step 20. Put SSH in VNFS. from the Homogeneous Installation Notes.  (Optional, but recommended).

    NOTE:  Skip /etc/warewulf/vnfs/excludes edit.  This was done as part of the Homogeneous Installation Notes.
     

  25. From the WWIM node (not the xCAT/WW management node) follow step 21. Install HPC and xCAT dependencies. from the Homogeneous Installation Notes.  (Optional, but recommended).
     

  26. From the WWIM node (not the xCAT/WW management node) follow step 22. Got Serial? from the Homogeneous Installation Notes.  (Optional, but recommended)
     

  27. From the WWIM node (not the xCAT/WW management node) follow step 23. Install Myrinet RPM. from the Homogeneous Installation Notes.  (Optional)
     

  28. From the WWIM node (not the xCAT/WW management node) follow step 24. Setup Torque. from the Homogeneous Installation Notes.  (Optional)
     

  29. From the WWIM node (not the xCAT/WW management node) follow step 25. Build network boot image. from the Homogeneous Installation Notes.
     

  30. Backup work to xCAT/WW management node.  From the WWIM node (not the xCAT/WW management node) type:

    >/vnfs/$WWIMAGE/var/log/lastlog
    rm -rf /vnfs.master/$WWIMAGE/
    cd /vnfs
    find $WWIMAGE -print | cpio -dump /vnfs.master/
     

  31. From the xCAT/WW management node (not the WWIM node), check /tftpboot and /vnfs for new heterogeneous image, e.g.:

    ls -l /tftpboot/compute-x86_64*
    ls -ld /vnfs/compute-x86_64

     

  32. Boot up nodes!

    Verify the node entries in $XCATROOT/etc/nodetype.tab are:

    nodename    warewulf,ARCH,image{-arch}

    e.g.

    node13      warewulf,x86_64,compute-x86_64
    node14      warewulf,x86_64,compute-x86_64

    NOTE:  The redundancy of specifying ARCH twice.  This is required because WW does not directly support heterogeneous clusters.  Also note that this is just a convention.  The image field in $XCATROOT/etc/nodetype.tab must match an entry in /etc/warewulf/vnfs.conf.

    Then, type:

    rnetboot noderange

    NOTE:  Use wcons or wvid and try one node first.

    NOTE:  Use wwnode.list and wwtop to monitor nodes.
     

  33. From the WWIM node (not the xCAT/WW management node) jump to step 28. Collect SSH host keys after boot. from the Homogeneous Installation Notes and continue from there.
     

Hybrid RAM/NFS File System

What?!?  My initrd image is huge (~32MB for x86_64) and I just lost a large chunk of short term memory (i.e. RAM ~100MB for x86_64) and I am having TFTP timeouts when I boot my 1024 node cluster. If this concerns you, consider WW Hybrid mode (~8MB initrd, ~20MB RAM usage).

What is Hybrid RAM/NFS?  Well the above notes build a RAM root system, all O/S system files are in memory and NFS is only used for user data and applications.  This large RAM root system can create TFTP scaling issues.  This can be addressed with multiple TFTP servers, this is easy to setup with xCAT and is recommended for any size initrd image.  However the use of the ~100MB of RAM cannot be reduced unless OS system files are moved from RAM to NFS.  There is still some RAM used, but it is reduced to only ~20MB (x86_64 tested).

But... NFS could get pounded.  NFS can scale and scale well, it is NFS mounting that has timeout problems.  A staggered boot or script to retry can resolve this problem.  Once mounted all should be fine.

NOTE:  You can have multiple NFS servers too.

Bottom line, test both.  A single WW cluster can have Hybrid and RAM root only nodes.

Get non-hybrid working first as a test.  Hybrid RAM/NFS works with homogeneous and heterogeneous clusters.  The steps below should be performed before creating the initial VNFS (see homogeneous notes).

  1. Create an /etc/warewulf/vnfs.conf entry like this:

    [compute-x86_64-hybrid]
       ramdisk nfs hybrid   = 1
       excludes file        = /etc/warewulf/vnfs/excludes-nfs
       sym links file       = /etc/warewulf/vnfs/symlinks-nfs
       fstab template       = /etc/warewulf/vnfs/fstab-nfs
       path                 = /compute-x86_64-hybrid
  2. Edit /etc/warewulf/vnfs/fstab-nfs like /etc/warewulf/vnfs/fstab per the homogeneous notes above, however the line in bold must also be added, e.g.:

    mgmt1:%{vnfs dir} %{vnfs dir} nfs ro,rsize=8192,wsize=8192 0 0
    mgmt1:%{sharedfs dir} %{sharedfs dir} nfs rw,rsize=8192,wsize=8192 0 0
    mgmt1:/opt/xcat /opt/xcat nfs ro,rsize=8192,wsize=8192 0 0
    mgmt1:/usr/local /usr/local nfs ro,rsize=8192,wsize=8192 0 0

     

  3. Torque setup.  (Optional)

    Edit /etc/warewulf/vnfs/excludes-nfs and append:

    + var/spool/pbs/

    after the line:

    + var/lib/

    This is required for xCAT Torque epilogue and prologue support to control access to each node.

    Edit /etc/warewulf/vnfs/excluded-nfs and change:

    /etc/security/

    to

    + /etc/security/

    Edit /etc/warewulf/vnfs/symlinks-nfs and change:

    %{vnfs}/%{link} etc/security

    to

    #%{vnfs}/%{link} etc/security
     

  4. Now perform the homogenous (and optionally the heterogeneous) notes.
     

  5. Additional Myrinet setup.  This step to be done after Myrinet step above.  (Optional)

    Obtain the name of the GM directory, e.g.:

    ls -d /vnfs/$WWIMAGE/opt/gm-*

    output:

    /vnfs/compute-x86_64-hybrid/opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64

    Note the above in bold and append to /etc/warewulf/vnfs/excludes-nfs (IMPORTANT: do not skip a single slash or add any extra):

    #GM stuff
    /opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64/


    Append to /etc/warewulf/vnfs/symlinks-nfs (IMPORTANT: do not skip a single slash or add any extra):

    #GM stuff
    %{vnfs}/%{link} opt/gm-2.0.17-2.4.21-27.ELsmp-x86_64
     

  6. YUM or RPM install more packages in to $WWIMAGE.  The minimalist restrictions of RAM root only are lifted.
     

  7. Build image and boot per homogeneous/heterogeneous notes.
     

Local HD Notes (WIP)

Accessing the local HD is no different that with diskfull nodes.  However since no OS was installed partitions must be created and formatted for file systems and swap.  Consider scripting sfdisk, e.g.:

  1. Boot up a node.

  2. SSH/RSH to that node.

  3. Use fdisk to setup disk.

  4. Run something like 'sfdisk /dev/sda -d >{NFS mount point}/sda.sfdisk' to save partition information.

  5. Write a script to:

    A.  Dump partition information.
    B.  Compare partition information with {NFS mount point}/sda.sfdisk, if no difference jump to E.
    C.  If different then run something like 'sfdisk /dev/sda <{NFS mount point}/file.sfdisk'.
    D.  Use mke2fs and mkswap to format partitions.
    E.  Create mount points, mount file systems.
    F.  Run swapon for each swap partition.
     

  6. Put script in rc.local.

There are many ways to do this, the above is just an example.
 

Warewulf Command Notes

Once your xCAT/WW cluster is built and running, use as any other cluster, however the following additional commands may be useful.

wwnode.list shows a list of the currently running nodes in the cluster.  You can use it to make a MPI or PVM machine file, or just get a list of
the nodes in the current ready position (--help for options).  E.g.:

# Nodes ready:
node13
node14
node2
node6

# Nodes in unknown state:
node15
node5


wwnode.stats
displays the current usage summary for your cluster by querying and pulling the information from the warewulfd daemon directly.  It lists the current CPU, memory, swap, and network utilization (--help for options).  E.g.:

--------------------------------------------------------------------------------
Total CPU utilization: 0%                          
          Total Nodes: 6         
               Living: 6                           Warewulf
          Unavailable: 0                      Cluster Statistics
             Disabled: 0                 http://warewulf-cluster.org/
                Error: 0         
                 Dead: 0         
--------------------------------------------------------------------------------
 Node        CPU       Memory (MB)    Swap (MB)      Network (KB)   Current
 Name     [util/num] [% used/total] [% used/total] [send - receive] Status
node13       0% (2)    2% 37/1946        none           0 - 0       READY
node14       0% (2)    2% 35/1946        none           0 - 0       READY
node15       0% (2)    1% 38/5711        none           0 - 0       READY
node2        0% (2)    3% 25/1001        none           0 - 0       READY
node5        0% (2)   3% 104/4014        none           0 - 0       READY
node6        0% (2)   3% 110/4014        none           0 - 0       READY

wwnode.sync will update nodes that are already booted so they are accessible to the users and updates these files on the nodes: /etc/passwd, /etc/group, /etc/hosts, and /etc/hosts.equiv (--help for options).

wwsumary is a summary of aggregate CPU utilization and number of nodes ready (up and sync'd) (--help for options).  E.g.:

localhost:   0% CPU util, 6 Node(s) Ready

wwtop
is the Warewulf 'top' like monitor. It shows the nodes ordered by the highest utilization and important statistics about each node and general summary type data. This is an interactive curses based tool (? for commands).

E.g.: Hybrid cluster of x86, x86_64, and ia64.

Top: 6 nodes, 12 cpus, 20062 MHz, 18.20 GB mem, 158 procs, uptime 0 days
Avg:    0% cputil,  58.00 MB memutil, load 0.00,  26 procs, uptime   0 day(s)
High:   0% cputil, 110.00 MB memutil, load 0.00,  27 procs, uptime   0 day(s)
Low:    0% cputil,  25.00 MB memutil, load 0.00,  25 procs, uptime   0 day(s)
Node status:    6 ready,    0 unavailable,    0 down,    0 unknown
18:38:06 mgmt1> 
Node name    CPU  MEM SWAP Uptime   MHz   ARCH  PROC   Load  Net:KB/s Stats/Util
node13        0%   2%   0%   0.05  3984  x86_64   27   0.00         0 |  IDLE  |
node14        0%   2%   0%   0.05  3984  x86_64   27   0.00         0 |  IDLE  |
node15        0%   1%   0%   0.01  4386  x86_64   26   0.00         0 |  IDLE  |
node2         0%   3%   0%   0.05  1728    i686   27   0.00         0 |  IDLE  |
node5         0%   3%   0%   0.01  2990    ia64   25   0.00         0 |  IDLE  |
node6         0%   3%   0%   0.20  2990    ia64   26   0.00         0 |  IDLE  |

E.g.: PPC64 cluster with 3 nodes down.
Top: 3 nodes, 6 cpus, 13164 MHz, 11.31 GB mem, 87 procs, uptime 0 days
Avg:    0% cputil,  27.00 MB memutil, load 0.00,  29 procs, uptime   0 day(s)
High:   0% cputil,  28.00 MB memutil, load 0.00,  29 procs, uptime   0 day(s)
Low:    0% cputil,  27.00 MB memutil, load 0.00,  29 procs, uptime   0 day(s)
Node status:    3 ready,    0 unavailable,    0 down,    3 unknown
21:19:45 wopr.cluster> 
Node name    CPU  MEM SWAP Uptime   MHz   ARCH  PROC   Load  Net:KB/s Stats/Util
blade001      0%   1%   0%   0.06  4388   ppc64   29   0.00         0 |  IDLE  |
blade002      0%   1%   0%   0.06  4388   ppc64   29   0.00         0 |  IDLE  |
blade003      0%   1%   0%   0.06  4388   ppc64   29   0.00         0 |  IDLE  |
blade005    ---- ---- ---- ------ ----- ------- ---- ------   ------- |UN-KNOWN|
blade006    ---- ---- ---- ------ ----- ------- ---- ------   ------- |UN-KNOWN|
blade007    ---- ---- ---- ------ ----- ------- ---- ------   ------- |UN-KNOWN|

For a list of specific nodes export NODES=node,node,node,... before running wwtop.

E.g. to use an xCAT noderange:

NODES=$(nr noderange) wwtop

E.g. to use PBS:

NODES=$(pbsjobnodes PBS_JOB_ID) wwtop
NODES=$(pbsusernodes USERID) wwtop



MPI Notes

No swap?  This can be a problem for large MPI jobs.  Specifically MPICH.  MPICH uses RSH or SSH to launch one process per processor.  E.g. If you have a 1024 node cluster with 2 processors per node you have 2048 processors, MPICH will launch 2048 instances of RSH or SSH.  Each instance stays in memory until the job completes.  (NOTE:  RSH cannot support more that 512 processors because it only uses privileged ports (there are only 1024 of them) and RSH uses 2 ports per processor one for STDOUT and one for STDERR).  2048 SSH processes will take more than 1 GB of RAM.  Normally for diskfull clusters this is not an issue, it gets swapped out as application needs grow to use more RAM.

Solutions:

  1. Use LAM MPI.  LAM has a lamboot command to be run before mpirun is run.  lamboot will RSH/SSH serially to each node and startup a lightweight daemon then exit.  LAM's mpirun will work with the lightweight daemons to start the job.  (Tested.)

    E.g. for xhpl/LAM-IP:

    export LAM=/usr/local/lam7
    export PATH=$LAM/bin:$PATH
    export LAMRSH=/usr/bin/ssh
    export NP=$(wc -l $PBS_NODEFILE | awk '{print $1}')

    lamboot $PBS_NODEFILE

    mpirun -np $NP xhpl 2>&1 | tee output

     

  2. Use MPICH MPD.  MPICH MPD is like LAM's lamboot.  (Untested, failed to work.)
     

  3. Use mpiexec (if using Torque).  mpiexec is a replacement program for the script mpirun, which is part of the MPICH package. It is used to initialize a parallel job from within a Torque batch or interactive environment. mpiexec uses the task manager library of Torque to spawn copies of the executable on the nodes in a Torque allocation.  (Tested.)

    mpiexec installation notes:

    Install Torque/Maui per this document, the xCAT mini-HOWTO, and benchmark-HOWTO.

    Download mpiexec from http://www.osc.edu/~pw/mpiexec/ and save in /tmp.  Read the documentation for more information.

    Type:

    cd /usr/local/pbs/$(uname -m)
    ln -s ../include .
    cd /tmp
    tar zxvf mpiexec-0.77.tgz
    cd mpiexec-0.77
    ./configure \
        --with-pbs=/usr/local/pbs/$(uname -m) \
        --with-default-comm=mpich-gm \
        --prefix=/usr/local/pbs \
        --bindir=/usr/local/pbs/$(uname -m)/bin
    make
    make install


    To use put this in your .pbs script to run manually after a qsub -I.

    E.g. for xhpl/MPICH-GM:

    export PATH=/usr/local/pbs/$(uname -m)/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/goto/lib:/usr/local/mpich/mpichgm-1.2.6..14a/gm/x86_64/smp/gnu64/ssh/lib/shared
    export OMP_NUM_THREADS=1

    mpiexec -comm mpich-gm xhpl

    E.g. for xhpl/MPICH-IP:

    export PATH=/usr/local/pbs/$(uname -m)/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/goto/lib
    export P4_GLOBMEMSIZE=16000000

    mpiexec -comm mpich-p4 xhpl

     

  4. Use the new --gm-tree-spawn option of MPICH-GM 1.2.6..14.  It distributes the SSH connections on a tree to limit resources usage per node for large job.  (Tested.)
     

MTU 9000 Notes

Warewulf does not have native support for MTU 9000.  Below are notes for the most complex solution (i.e. single NIC).

For MTU 9000 to function correctly everything must be MTU 9000.  The problem with a single NIC Warewulf solution is that PXE booting uses MTU 1500.  The solution is to have two NICs on the server in the MTU 9000 VLAN or switch.  One NIC should be set to MTU 1500 and the other to MTU 9000.  The MTU 9000 NIC should be the primary cluster NIC.  The MTU 1500 NIC will be used for PXE/TFTP booting only.

In the notes below the primary cluster VLAN NIC is eth0 (MTU 9000), the PXE/TFTP NIC is eth1 (MTU 1500), and the systems management VLAN NIC is eth2 (MTU 1500 required for many embedded devices).

NOTE:  Your switch must support MTU 9000.

HINT:  Cisco 3750 commands:  system mtu jumbo 9000 and switchport access vlan number.

  1. Setup management node NICs.  Remember to set one for MTU 9000 and one for MTU 1500 and both must be in the same VLAN.  Each NIC must be on a different subnet.
     

  2. Setup /etc/hosts per this HOWTO and the xCAT mini HOWTO.  E.g.:

    172.90.0.1 mgmt1 mgmt1-admin mgmt1-clust mgmt1-sfs
    172.20.0.1 mgmt1-eth1
    172.30.0.1 mgmt1-eth2

    172.90.1.1 blade001
    172.20.1.1 blade001-eth1
    172.90.1.2 blade002
    172.20.1.2 blade002-eth1
    ...

    Although the nodes only are only using one NIC two entries are required.  One for PXE/TFTP booting at MTU 1500 and the other at MTU 9000 for runtime.
     

  3. Setup $XCATROOT/etc/mac.tab.  Because we are using the same NIC as eth1 (PXE/TFTP booting at MTU 1500) and as eth0 (runtime MTU 9000) we must have the same MAC defined for both the physical and virtual NIC.  E.g.:

    blade001-eth0 00:11:25:4a:35:a2
    blade001-eth1 00:11:25:4a:35:a2
    ...

     

  4. Update $XCATROOT/etc/noderes.tab.  The TFTP field must be the host name of the TFTP server NIC (e.g. mgmt1-eth1).  The rest of the fields should be unchanged.

    E.g. old single NIC solution for MTU 1500.

    compute mgmt1,mgmt1,/install,NA,N,N,N,Y,N,N,N,eth0,eth0,mgmt1

    New single NIC solution for MTU 9000.

    compute mgmt1-eth1,mgmt1,/install,NA,N,N,N,Y,N,N,N,eth1,eth0,mgmt1

    The fields in bold determine what host should be used in DHCP for TFTP booting, the host NIC used for TFTP booting, and the host NIC used for the primary runtime NIC.  In the example above the virtual eth1 NIC will get a 172.20 address to TFTP download its kernel and initrd image from mgmt1-eth1 (172.20.0.1), but eth0 will use a 172.90 address for runtime and will communicate with mgmt1 (172.90.0.1) at MTU 9000.
     

  5. Update $XCATROOT/etc/site.tab with any new networks for DNS, then update DNS with:

    makedns
     

  6. Update DHCP, type:

    makedhcp noderange
     

  7. Verify that /vnfs/$WWIMAGE/etc/resolv.conf is correct for the MTU 9000 network.
     

  8. Update/create /vnfs/$WWIMAGE/etc/sysconfig/network-scripts/ifcfg-eth0:

    echo "MTU=9000" >> /vnfs/$WWIMAGE/etc/sysconfig/network-scripts/ifcfg-eth0
     

  9. Rebuild network boot image.

    Type:

    >/vnfs/$WWIMAGE/var/log/lastlog
    wwvnfs.build --vnfs=$WWIMAGE
     

  10. rnetboot nodes, e.g.:

    rnetboot noderange
     

PPC64 Notes

Follow homogeneous or heterogeneous notes with the following exceptions:

  1. OpenFirmware does not have a network downloadable secondary boot loader like x86/x86_64 (pxelinux.0) or ia64 (elilo).  A combined kernel/initrd image must be created using the following instructions:

    cp -f /tftpboot/$WWIMAGE.img.gz /usr/src/linux-2.4/arch/ppc64/boot/ramdisk.image.gz
    cd /usr/src/linux-2.4
    perl -pi -e 's/custom$//' Makefile
    make distclean
    cp -f configs/kernel-2.4.21-ppc64.config .config
    make oldconfig
    make dep
    make zImage.initrd
    cp -f /usr/src/linux-2.4/arch/ppc64/boot/zImage.initrd /tftpboot/$WWIMAGE.zimage
     

  2. Do not delete .img.gz files from /tftpboot.
     

  3. The OpenFirmware combined kernel/initrd image must be less than 12MB in size.  Hybrid RAM/NFS should be used.

    NOTE:  7582217 tested.  Greater failed.  If you have a problem with this you may need to slim down your image.  There are 2 ways to do this.

    A.  Use the Warewulf exclude and symlink lists.
    B.  Use the following hack (you may want to run the script internals by hand to be safe).  WWIMAGE must be exported first.

    /tftpboot
    cp $WWIMAGE.img.gz /tmp
    cd /tmp
    $XCATROOT/build/warewulf/slimit $WWIMAGE.img.gz
    cp $WWIMAGE.img.gz /tftpboot


    Rebuild the zImage.
     

Support

http://xcat.org
http://warewulf-cluster.org


Egan Ford
egan@us.ibm.com
February  2005