xCAT Diskless HOWTO

xCAT Diskless HOWTO (WIP)

What is Diskless?

Diskless is booting an OS from another source other than disk (SAN booting is NOT diskless). The most common method is network booting from firmware (e.g. PXE, PXE64, CHRP) or floppy (e.g. Etherboot).

Diskless booting can be done on systems with or without disk.

Why Diskless?

A 1000 locally installed nodes is a 1000 images to manage. One image per node. Although xCAT provides tools to ease the administration of large clusters there is still 1000 individual nodes to manage--pitfalls still exist.

Problem:

If node 666 had a hardware problem that took 3 days to diagnose and fix, for 3 days only 999 nodes are available. If local disk administration was performed outside of a complete reinstall, then steps taken to modify the other 999 will need to be repeated for node 666. Often actions are not documented and may be forgotten.
Reinstallation is the safest method to guarantee that all nodes are equal. This can take a lot of time. Imaging solutions were created to solve this problem, however undocumented bloat can still occur. A 1000 node install using Kickstart or Autoyast with xCAT staging can take 60 minutes.
Drives have moving parts and moving parts fail more often than any other component. A 1000 node system with 1000 local HDs with a MTBF of 500,000 hours (standard IDE disk) should expect a node failure every 500 hours (about once every 3 weeks).
Hot swap drives = hot swap identity theft.

Solution:

Diskless nodes maintain no state. Changes will be lost on next reboot, no need to document ad hoc temporary changes. Permanent changes are made on centralized servers and are available on next reboot.
With proper design (staging nodes) a reboot of a new diskless image should take less than 5 minutes for all 1000 nodes. Like imaging solutions undocumented bloat can occur with diskless images, however there is a penalty (high use of RAM) if the images are not kept under control.
No drives to fail. Only need to worry about fans and power supplies as the low MTBF components. The high availability of blade-based solutions can mitigate this.
Locally stored authentication data may be venerable to theft. Diskless solutions do not store any user information.

Why not Diskless?

Heavy RAM usage. A chuck of RAM must be sacrificed to mount the root file system. A reasonable (can run MPI applications) RAM root solution could use as much as 10% of RAM (source: Warewulf x86_64 system with 1GB of RAM).
Heavy NFS I/O. To reduce the size of RAM root put more on NFS. This may require a faster more capable network and NFS server solution. Without proper design NFS could become a single point of failure.
No swap. Using MPICH with SSH a 1000 node system may use 500MB of RAM (x86 system tested) just for all the SSH sessions. With a queuing system in place usually the first node of the nodes assigned starts MPICH with an SSH command to each processor. This is a lot of RAM that lingers around until end of job. Usually this gets put in swap as the MPI application uses RAM. LAM and MPICH both have solutions to resolve this. MPICH MPD is not very popular and has had issues with stability and scalability.

Hybrid solutions

Diskfull nodes with diskless boot. The disk is not used for the OS, but for swap and application data only. There are a number of HPC applications that have very high aggregate I/O requirements. A common solution is to stage the data on the local disks. This hybrid solves the problem of heavy NFS I/O, security (no local authentication data), and swap. For nodes with bad disks, they can still be used for diskless applications. Diskless and diskfull attributes can be applied to a queuing system to help users target the right resources.
SystemImager. Although SystemImager is not a diskless solution it is worth mentioning because it was designed to solve a lot of the same problems that diskless solves. I.e. rapid install/boot and centralized file system management. SystemImager like diskless solutions maintain a central set if root file systems that can be altered and pushed out. Complete reinstalls with SystemImager multicast can take as little as 5 minutes.

xCAT Diskless and Diskfull operational differences

xCAT's installation and OS boot support essentially network boots a kernel and initrd image. Since any installation method is essentially a diskless OS boot reason would dictate that diskless support in xCAT would be identically the same. The Network Boot, Bootloader, and Installer sections of the nodeinstall-HOWTO still apply to diskless.

However there are two differences.

nodeset noderange install is for diskfull, and nodeset noderange netboot is for diskless.
rinstall noderange is really a front-end to nodeset noderange install;rpower noderange boot. Similarly rnetboot noderange is a front-end to nodeset noderange netboot;rpower noderange boot. The initial reason for rinstall and rnetboot to be different commands was for users to experiment with diskless on nodes that already have an OS installed without accidentally effecting the installed OS. Another reason for the difference is that install states change to boot states after install whereas netboot states are always netboot. A future version of xCAT may define the boot state as diskless or diskfull in yet another .tab file.

Diskless support in xCAT is still evolving and there may be more (or preferably less) differences in the future.

xCAT Diskless solutions:

Warewulf (warewulf-HOWTO)
DIM (WIP)

Support

http://xcat.org

Egan Ford
egan@us.ibm.com
January 2005