xCAT Diskless HOWTO (WIP)
What is Diskless?
Diskless is booting an OS from another source other than disk (SAN booting is
NOT diskless). The most common method is network booting from firmware
(e.g. PXE, PXE64, CHRP) or floppy (e.g. Etherboot).
Diskless booting can be done on systems with or without disk.
Why Diskless?
A 1000 locally installed nodes is a 1000 images to manage. One image
per node. Although xCAT provides tools to ease the administration of large
clusters there is still 1000 individual nodes to manage--pitfalls still exist.
Problem:
- If node 666 had a hardware problem that took 3 days to diagnose and fix,
for 3 days only 999 nodes are available. If local disk administration
was performed outside of a complete reinstall, then steps taken to modify the
other 999 will need to be repeated for node 666. Often actions are not
documented and may be forgotten.
- Reinstallation is the safest method to guarantee that all nodes are equal.
This can take a lot of time. Imaging solutions were created to solve
this problem, however undocumented bloat can still occur. A 1000 node
install using Kickstart or Autoyast with xCAT staging can take 60 minutes.
- Drives have moving parts and moving parts fail more often than any other
component. A 1000 node system with 1000 local HDs with a MTBF of 500,000
hours (standard IDE disk) should expect a node failure every 500 hours (about
once every 3 weeks).
- Hot swap drives = hot swap identity theft.
Solution:
- Diskless nodes maintain no state. Changes will be lost on next
reboot, no need to document ad hoc temporary changes. Permanent changes
are made on centralized servers and are available on next reboot.
- With proper design (staging nodes) a reboot of a new diskless image should
take less than 5 minutes for all 1000 nodes. Like imaging solutions
undocumented bloat can occur with diskless images, however there is a penalty
(high use of RAM) if the images are not kept under control.
- No drives to fail. Only need to worry about fans and power supplies
as the low MTBF components. The high availability of blade-based
solutions can mitigate this.
- Locally stored authentication data may be venerable to theft.
Diskless solutions do not store any user information.
Why not Diskless?
- Heavy RAM usage. A chuck of RAM must be sacrificed to mount the root
file system. A reasonable (can run MPI applications) RAM root solution
could use as much as 10% of RAM (source: Warewulf x86_64 system with 1GB of
RAM).
- Heavy NFS I/O. To reduce the size of RAM root put more on NFS.
This may require a faster more capable network and NFS server solution.
Without proper design NFS could become a single point of failure.
- No swap. Using MPICH with SSH a 1000 node system may use 500MB of
RAM (x86 system tested) just for all the SSH sessions. With a queuing
system in place usually the first node of the nodes assigned starts MPICH with
an SSH command to each processor. This is a lot of RAM that lingers
around until end of job. Usually this gets put in swap as the MPI
application uses RAM. LAM and MPICH both have solutions to resolve this.
MPICH MPD is not very popular and has had issues with stability and
scalability.
Hybrid solutions
- Diskfull nodes with diskless boot. The disk is not used for the OS,
but for swap and application data only. There are a number of HPC
applications that have very high aggregate I/O requirements. A common
solution is to stage the data on the local disks. This hybrid solves the
problem of heavy NFS I/O, security (no local authentication data), and swap.
For nodes with bad disks, they can still be used for diskless applications.
Diskless and diskfull attributes can be applied to a queuing system to help
users target the right resources.
- SystemImager. Although SystemImager is not a diskless solution it is
worth mentioning because it was designed to solve a lot of the same problems
that diskless solves. I.e. rapid install/boot and centralized file
system management. SystemImager like diskless solutions maintain a
central set if root file systems that can be altered and pushed out.
Complete reinstalls with SystemImager multicast can take as little as 5
minutes.
xCAT Diskless and Diskfull operational differences
xCAT's installation and OS boot support essentially network boots a kernel
and initrd image. Since any installation method is essentially a diskless
OS boot reason would dictate that diskless support in xCAT would be identically
the same. The Network Boot, Bootloader, and Installer sections of the
nodeinstall-HOWTO still apply to diskless.
However there are two differences.
- nodeset noderange install
is for diskfull, and nodeset noderange
netboot is for diskless.
- rinstall noderange is
really a front-end to nodeset noderange
install;rpower noderange boot. Similarly
rnetboot noderange is a
front-end to nodeset noderange
netboot;rpower noderange boot. The initial reason for
rinstall and
rnetboot to be different commands was
for users to experiment with diskless on nodes that already have an OS
installed without accidentally effecting the installed OS. Another
reason for the difference is that install states change to boot
states after install whereas netboot states are always netboot.
A future version of xCAT may define the boot state as diskless or
diskfull in yet another .tab file.
Diskless support in xCAT is still evolving and there may be more (or
preferably less) differences in the future.
xCAT Diskless solutions:
- Warewulf (warewulf-HOWTO)
- DIM (WIP)
Support
http://xcat.org
Egan Ford
egan@us.ibm.com
January 2005