Getting started installing a large Linux cluster
Summary: Create a working Linux? cluster from manyseparate pieces of hardware and software, including IBM? System x? andIBM TotalStorage? systems. This part in this multipart series covershardware configuration, including understanding architecture, planninglogical network design, setting up terminal servers, and updatingfirmware.
View more content in this series
Tags for this article: gpfs
Date: 06 Dec 2006
Level: Advanced
Also available in: Chinese Russian
Activity: 16701 views
Comments: 0 (View | Add comment - Sign in)
Introduction to the large Linux cluster series
Thisis the first of multiple articles that cover the installation and setupof a large Linux computer cluster. The aim of the series is to bringtogether in one place up-to-date information from various places in thepublic domain on the process required to create a working Linux clusterfrom many separate pieces of hardware and software. These articles arenot intended, however, to provide the basis for the complete design of anew large Linux cluster. Refer to the reference materials and Redbooksunder Resources for general architecture pointers.
Thefirst two parts of this series address the base installation of thecluster and include an overview of the hardware configuration andinstallation using the IBM systems management software, Cluster SystemsManagement (CSM). The first article introduces you to the topic andtakes you through hardware configuration. The second article coversmanagement server configuration and node installation. Subsequent partsof the series deal with the storage back-end of the cluster. They coverthe storage hardware configuration and the installation andconfiguration of the IBM shared file system, General Parallel FileSystem (GPFS ).
This series addresses systems architects andsystems engineers to use when they plan and implement a Linux clusterusing the IBM eServer Cluster 1350 framework (see Resources). Some parts might also be relevant to cluster administrators for educational purposes and during normal cluster operation.
Part 1: General cluster architecture
A good design is critically important before you undertake any configuration steps. The design has two parts:
Theexample cluster (see Figure 1) consists entirely of Intel? or AMD-basedIBM Systems computers with attached TotalStorage subsystems (see Resourcesfor more information about these systems). For simplicity, coppergigabit Ethernet cable provides cluster interconnection. This cableprovides good speed in most circumstances with bandwidth increasesavailable between racks using bonded/port-channeled/etherchannel insert-your-favourite-trunking-term-here links.
Thenetwork topology takes a star shape, with all racks connecting back to amain switch in the management rack. The example cluster uses threenetworks: one for management/data (the compute network), one for theclustered file system (the storage network), and one for administrativedevice management. The first two networks are normal IP networks. Thecompute network is used for most tasks, including inter-processcommunications (such as MPI) and cluster management. The storage networkis used exclusively for clustered file system communication and access.
Some additional design and layout details for the example cluster include:
Afteryou assemble the racks and put them into place with all cablingcompleted, there is still a large amount of hardware configuration.Specific cabling details of any particular cluster are not covered inthis article. The hardware configuration steps required before clusterinstallation are described with some specific examples using the examplecluster design outlined above.
Oneof the most commonly overlooked tasks when installing a cluster is thelogical network design. Ideally, the logical design should be on paperbefore cluster implementation. Once you have the logical network design,use it to create a hosts file. In a small cluster, you can write outthe hosts file manually if there are not many devices on the network.However, it is usually best to produce a naming convention and write acustom script to produce the file.
Ensure all the devices on thenetwork are represented in the hosts file. Some examples include thefollowing (with example names):
Thisnaming convention covers only the five types of computer systems in thenetwork and only one network, which is not nearly good enough. Thereare also the storage network and compute networks to factor in, plus adevice management network. So this file needs to be expanded. Each noderequiring access to the clustered file system needs an address on thestorage network. Each node requires two addresses on the computenetwork: one for the compute address and another for the BaseboardManagement Controller (BMC), which is used for hardware monitoring andpower control. Table 1 outlines a much more comprehensive namingconvention with example IP address ranges.
Device | Compute 192.168.0.0/24 | BMC 192.168.0.0/24 | Storage 192.168.1.0/24 | Device 192.168.2.0/24 | External ext n/w |
---|---|---|---|---|---|
Management server | mgmt001 | mgmt001_d | mgmt001_s | mgmt001_m | mgmt001_e |
Storage server | stor001 | stor001_d | stor001_s | stor001_m | stor001_e |
User nodes | user001 | user001_d | user001_s | none | none |
Scheduler nodes | schd001 | schd001_d | schd001_s/ | none | none |
Compute nodes | node001 | node001_d | node001_s | none | none |
Compute switches | none | none | none | gigb01a | none |
Storage switches | none | none | none | gigb01b | none |
Terminal servers | none | none | none | term001 | none |
Storage controller A/B | none | none | none | disk01a/b | none |
LCM/KVM/RCM | none | none | none | cons001 | none |
When implemented, this scheme produces a hosts file like the example you can access under Downloads.This is a small example cluster consisting of sixteen compute nodes,one management server, one storage server, one user node, and onescheduler node in two racks with the relevant devices attached. Whilenot representing a large cluster, this is sufficient for this examplecluster, and you can easily extend it to represent far larger clustersif required.
Thereare two physical networks: one for compute traffic and one forstorage. A standard 32 nodes per rack requires two 48-port switches ineach rack, one for each network. In smaller clusters, the managementrack also requires two of the same switches. For larger clusters, 48ports might not be enough, so a larger central switch might be required.
Eachswitch for the two main networks (ignoring the device managementnetwork) requires a slightly different configuration, because, as in theexample, Gigabit Ethernet interconnects use jumbo frames for thestorage network and a standard frame size for the compute network. Thedevice management network setup is usually very simple: a flat-layertwo-type network on a 10/100 switch is acceptable for device managementpurposes, so no further explanation is needed.
Example A: Extreme Networks switch
Here are the configuration steps for an Extreme Networks Summit 400-48t 48-port Gigabit Ethernet switch.
First,connect to each switch using the serial console port with a straightserial cable (9600, 8-N-1, no flow control) with a default user ID admin
and no password. (Just press the Enter key at the prompt.)
For all switches, follow these steps:
unconfig switch all
-- Wipes any existing configuration, if required.configure vlan mgmt ipaddress 192.168.2.XXX/24
-- Sets the management IP address.configure snmp sysname gigbXXX.cluster.com
-- Sets the switch name.configure sntp-client primary server 192.168.2.XXX
-- Sets the NTP server to the management server. configure sntp-client update-interval 3600
-- Sets time synchronization to hourly.configure timezone 0
-- Sets the time zone.enable sntp-client
-- Turns on NTP. configure ports 1-4 preferred-medium copper
-- Changes default preferred medium from fiber to copper on ports 1-4, if required.Now, to configure jumbo frames on the storage network switches, follow these steps:
create vlan jumbo
-- Creates the jumbo frames vlan.configure "mgmt" delete ports 1-48
-- Removes ports from the mgmt vlan.configure "jumbo" add ports 1-48
-- Adds ports to the jumbo vlan.configure jumbo-frame size 9216
-- Sets the maximum transmission unit (MTU) size.enable jumbo-frame ports 1-48
-- Turns on jumbo frame support. To enable trunking on a 2-port link, use enable sharing 47 grouping 47-48
(group ports 47 and 48, with 47 as the primary).
To complete the configuration, complete the following:
save configuration primary
-- Writes switch configuration to flash in order to survive reboots.use configuration primary
Example B: Force 10 Networks switch
Hereare the configuration steps for a Force 10 Networks e600 multi-bladeGigabit Ethernet switch (with two 48-port blades) for routed networkswhere a central 48-port switch is not big enough.
Configure the chassis, line cards, and ports for an initial layer two configuration by doing the following:
enable
-- Enters super-user mode, no password required by default.chassis chassis-mode TeraScale
-- Initializes the switch to tera-scale mode.enable
.configure
-- Enters configuration mode. Prompt looks like Force10(conf)#)
.Interface Range GigabitEthernet 0/0 - 47
(configure line card 0 ports 0 through 47, prompt looks like Force10(conf-if-range-ge0/1-47)#)
.mtu 9252
-- Sets jumbo frames, if required.no shutdown
-- Allows the port to activate.exit
-- Goes back to configuration mode.Force10(conf-if-range-ge0/1-47)#)
.Configure the line cards and ports for layer 3 (VLan routing) by doing the following:
enable
.int port channel 1
-- Configures port channel 1.channel-member gig 0/46-47
-- Adds line card 0 ports 46 and 47 to the vlan.no shutdown
-- Allows the port channel to activate; this option overrides port configuration for inactive/active ports.ip add 192.168.x.x/24
-- Sets the IP address for the port channel; this is the gateway for your subnet.mtu 9252
-- Sets jumbo frames, if they are required.Now, turn on the DHCP helper to forward DHCP broadcasts across subnet boundaries by doing the following:
int range po 1-X
-- Applies configuration to all the port channels you have configured.ip helper 192.168.0.253
-- Forwards DHCP to your management server IP address.Next, configure the switch for remote management (using telnet or SSH) by doing the following:
interface managementethernet 0
-- Configures the management port from the configure prompt.ip add 192.168.2.x/24
-- Sets an IP address on the device management network and connects the admin port to the device management switch.Finally, save the switch configuration, by entering write mem
.
Afterthe switch configuration is complete, you can run a few sanity checkson your configuration. Plug in a device, such as a laptop, at variouspoints on the network to check connectivity. Most switches have thecapability to export their configuration. Consider making a backup copyof your running switch configuration once you have the network setupcorrectly.
The two example switches are described because they are working,100-percent non-blocking, and high-performance Gigabit Ethernetswitches. Cisco Systems switches do not provide 100% non-blockingthroughput, but can be used nonetheless.
Terminalservers play an important role in large cluster installations that useearlier versions of CSM than CSM 1.4. Clusters using the early versionsrelied on terminal servers to gather MAC addresses for installation.With the compatibility of CSM and system UUIDs, terminal servers are notas important for the installation of a more modern IBM cluster.However, if you have slightly older hardware or software in a largecluster, terminal servers are still vital during system setup. Ensuringthe correct setup of the terminal server itself can save a great deal oftime later in the installation process. In addition to collecting MACaddresses, terminal servers can also be used to view terminals from asingle point from POST and on into the operating system.
Ensurethat the terminal server baud speed for each port matches that of theconnecting computer. Most computers are set to a default of 9600 baud,so this might not be an issue. Also ensure the connection settings andflow control between the terminal server and each connecting system arethe same. If the terminal server expects an authenticated connection,set this up in CSM or turn off authentication altogether.
Here is an example configuration for the MRV InReach LX series switch (see Resources for more information about this switch). Configure the MRV card by doing the following:
enable
-- Enters super user mode with default password system. You see a configuration screen the first time the device is configured. Otherwise, enter setup
to get to the same screen.config
-- Enters configuration mode.port async 1 48
-- Configures ports 1 through 48.no authentication outbound
-- Turns off internal authentication.no authentication inbound
-- Turns off external authentication.no autobaud
-- Fixes the baud rate.access remote
-- Allows remote connection.flowcontrol cts
-- Sets hardware flow control to CTS, which is the default on most IBM computers.exit
-- Goes back to configuration mode.exit
-- Goes back to default super user mode.save config flash
-- Saves the configuration and make it persistent across reboots. Afterthis initial configuration, you should have little else to do. Again,make sure the settings you made here match the settings on theconnecting computers. You should now be able to telnet to your terminalservers in order to manage them in the future. As with the Ethernetswitches, you can view the running configuration in order to do somesanity checking of the configuration on the terminal servers, ifrequired. For example, the command show port async all char
returns detailed information about each port on the terminal server.
Firmware updates and setting BMC addresses
If it is appropriate, check and update the firmware across your entire cluster. Consider the following elements:
Youcan obtain IBM system updates on the IBM support Web site >, andvendor-specific hardware updates are usually available directly from thevendors' Web sites (see Resources).
Updating firmware on IBM systems
Note:The following method of firmware update might not be supported in yourarea or for your hardware. You are advised to check with your local IBMrepresentative before proceeding. This information is offered forexample purposes only.
CSM code for remotely flashing firmware isstill under development. Currently, if you need to flash many computersfor BIOS, BMC, or other firmware updates, you are presented with a largeproblem. It is not reasonable to flash a large cluster with currentmethods, which involve writing a floppy disk or CD image and attendingto each computer individually; an alternative is required. If you haveno hardware power control (no BMC IP address is set), start by flashingthe BMC firmware, which enables you to set the IP address at the sametime. You only need to press all the power buttons once. For otherfirmware flashes, you can remotely power the systems on and off.
Thefollowing example is for IBM Systems 325 or 326 AMD processor-basedsystems. However, only small alterations are required to apply it toSystem x computers. The idea is to take a default firmware update imageand modify it so that you can use it as a PXE boot image. Then you canboot a system over the network and have it unpack and flash the relevantfirmware. Once the system is set to PXE boot, you only need to turn iton for the flash to take place.
Acomputer on the network running DHCP and TFTP servers is required. ACSM management node installed and running with CSM is a suitablecandidate. However, if there are currently no installed computers on thenetwork, use a laptop running Linux connected to the network. Make surethe PXE server is on the correct part of the network (in the samesubnet), or that your switches are forwarding DHCP requests to thecorrect server across subnet boundaries. Then, complete the followingsteps:
ddns-update-style ad-hoc; subnet 192.168.0.0 netmask 255.255.255.0 { range 192.168.0.2 192.168.255.254; filename "/pxelinux.0"; next-server 192.168.0.1; } |
/tftpboot/
. Install syslinux
, which is provided as an RPM package for both Suse and Red Hat Linux.memdisk
and pxelinux.0
files installed with the syslinux
package into /tftpboot/
. /tftpboot/pxelinux.cfg/
to hold the configuration files and /tftpboot/firmware/
to hold the firmware images./tftpboot/pxelinux.cfg/default
, such as the following: serial 0 9600 default local #default bmc #default bios #default broadcom label local localboot 0 label bmc kernel memdisk append initrd=firmware/bmc.img label bios kernel memdisk append initrd=firmware/bios.img label broadcom kernel memdisk append initrd=firmware/broadcom.img |
For reference, when a computer receives a DHCP address during PXE, the configuration files in /tftpboot/pxelinux.cfg
are searched in a specific order with the first file found being theone used for the boot configuration for the requesting computer. Thesearch order is determined by converting the requesting DHCP addressinto 8 hexadecimal digits and searching for the first matching filenamein the configuration directory by expanding subnets -- by removing adigit right-to-left on each pass of the search.
As an example, consider a client computer getting the address 192.168.0.2
from the server during PXE boot. The first file search is for the hexadecimal version of this IP address /tftpboot/pxelinux.cfg/C0A80002
. If this configuration file is not present, the next searched for is C0A8000
, and so on. If no matches are found, the file named default
is used. Therefore, putting the above PXE configuration in a file named default
works for all computers, regardless of your DHCP configuration. However, for the example, writing the configuration to C0A800
(the 192.168.0.0/24
subnet) reduces the amount of searching.
Updating the Baseboard Management Controller (BMC) firmware and setting an IP address
Note:the procedure described here is for the AMD-based cluster nodes.However, you can use a similar procedure for the Intel-based nodes.Intel BMC updates are provided with the bmc_cfg.exe
program (instead of lancfg.exe
) to set the BMC address. You can drive this using the terminal servers with a script such as the sample script available under Downloads. Also, for Intel-based computers, you can usually set the BMC address in the system BIOS.
Afteryou set the BMC address on a node, you have remote power control, whichmakes life easier when configuring the cluster. However, this method ofupdating the BMC relies on network boot, so if your computers are notset to PXE boot in the BIOS yet, you can update the BIOS first andreturn to the BMC update afterwards.
Download the latest BMCfirmware update DOS image and follow the instructions to create a floppydisk boot image. This image contains a program called lancfg.exe
that allows you to set an IP address on the BMC. The usual process isto insert the floppy disk and boot from it in order to apply the update.However, first create a PXE boot image from the floppy disk on your PXEboot server computer with the following command:
dd if=/dev/fd0 of=/tftpboot/firmware/bmc.img bs=1024 |
Nowyou can edit the DOS image as needed. For the BMC update, nomodifications are required to the base image itself, except to copy aDOS power-off program into the image. At a high level, you power on thecomputer, it PXE boots to flash the BMC firmware, and it leaves thecomputer running in the DOS image. Using a script, you can then set theBMC address through the terminal server and power the computer off. Inthis way you know all the computers powered on are either flashing theirBMC firmware or waiting for the IP address to be set. On any computersthat are powered off, this process is completed. Download a suitableDOS-based power off command, such as the atxoff.com
utility. Once you have a power-off utility, copy it to the image as follows:
mount -o loop /tftpboot/firmware/bmc.img /mnt cp /path/to/poweroff.exe /mnt umount /mnt |
Now ensure your PXE bootconfiguration file can send the correct image by changing theappropriate comment to set the default to BMC in the /tftpboot/pxelinux.cfg/default
file previously created. After testing on a single node, boot allcomputers from the power off state so the flash takes place across allthe required nodes. When all the nodes have booted the PXE image, changethe configuration back to localboot in order to minimize the chance ofaccidentally flashing a computer if one were to be rebooted.
You can now call the lancfg
program and operate it through the terminal server (assuming the BIOSsettings export the terminal over serial with the same settings asconfigured on the terminal server). The BMC IP address can be set using lancfg
in a Perl script, such as the unsupported sample script available under Downloads. For example, to set the BMC address of all computers in a node group called Rack1
with gateway address 192.168.10.254
and netmask 255.255.255.0
, run the following from the PXE boot server computer:
perl set-bmc-address.pl -N Rack1 -g 192.168.10.254 -m 255.255.255.0 |
You can customize this scriptbased on your setup. When the script is completed, the computer turnsoff automatically after having its BMC IP address set using the DOSpower-off program you copied to the boot image.
Ifyou have the default BIOS settings applied on all computers, you can dothis step before the BMC update above. Flashing the BIOS is a two-stageprocess, resulting in the factory default settings being applied ifperformed without changes. Therefore, you need to flash and also apply anew appropriate configuration with any required changes for yourcluster. Download the latest BIOS update DOS image, and follow theinstructions to create a floppy disk boot image.
You need a savedconfiguration for the appropriate BIOS level and settings you require.In order to do this, manually update one computer. Boot a computer withthe floppy disk image (use a USB floppy drive if the computer does nothave one). Apply the update according to the readme file, and wait forthis to finish as normal. Reboot the computer and make all the changesto settings you require in the BIOS. Options to consider are turningNumlock to off (if you don't have a number keypad on your keyboard),enabling the serial port, setting the console redirection through theserial port with the appropriate settings configured to match theterminal servers, and setting the boot order to ensure Network appearsbefore Hard Disk. When the changes are complete, save them, and turn offthe computer.
On another computer (such as the one you have setup for PXE booting), mount the floppy disk containing the BIOS update.Rename the autoexec.bat file to keep it as a backup on the floppy forlater. This prevents the system from flashing the BIOS if this disk werebooted again. Insert the disk back into the computer where the updatedand configured BIOS options are set, and boot from your modified floppydisk image.
When the DOS prompt appears, ensure your currentworking directory appears on the a: drive. There is a program on thefloppy called cmosram.exe
that allows you to save theconfiguration of the BIOS to disk. Run this program to save the BIOSsettings to floppy disk as follows:
cmosram /load:cmos.dat |
Once the settings are in the autoexec.bat
file, you are ready to apply the update. As a sanity check, test thefloppy image you have in a computer to check that the flash happensautomatically and the correct settings are applied. You also notice thatthe system remains on after flashing the BIOS. You can get the systemto turn off automatically after the BIOS update in a similar way asdescribed in the BMC update section by using a DOS power-off utility and calling it from the autoexec.bat
file.
Onceyou are satisfied with your modified BIOS update image, you can create aPXE boot image from the floppy disk with the following command:
dd if=/dev/fd0 of=/tftpboot/firmware/bios.img bs=1024 |
Change the default PXE boot configuration file /tftpboot/pxelinux.cfg/default
so it serves the BIOS image when the systems PXE boot. Now, by poweringon a system connected to the network, it automatically flashes the BIOSwithout any user input, applies the correct BIOS settings, and powersoff again. When all updates are complete, return the default PXE bootconfiguration to boot from local disk to avoid any accidents if acomputer were to make a PXE request.
Updating the Broadcom firmware
Afterupdating the BMC firmware and BIOS, updating the Broadcom firmware is asimple repeat of the same ideas. Follow these steps:
dd if=/dev/fd0 of=/tftpboot/firmware/broadcom.img bs=1024"
mount -o loop /tftpboot/firmware/broadcom.img /mnt
autoexec.bat
file to automatically update the Broadcom firmware in unattended mode, and turn off the computer when it finishes. For example, for an IBM Systems 326, machine type 8848, the autoexec.bat
might look like the following: @echo off call sramdrv.bat echo. echo Extracting files... call a:\bin.exe -d -o %ramdrv%\update >NULL copy a:\command.com %ramdrv%\command.com copy a:\atxoff.com %ramdrv%\atxoff.com set COMSPEC=%ramdrv%\command.com if exist NULL del NULL %ramdrv% cd \update call update.bat 8848 cd atxoff |
/tftpboot/pxelinux.cfg/default
to ensure the computer can boot the firmware update for the Broadcom adapter.Afteryou have updated the firmware across the cluster, you can continuehardware setup with the knowledge that fewer problems will arise lateras a result of having the latest firmware code. However, you can repeatthe process at any time should you need another update. Also, theprinciples behind this type of firmware update can also be applied toany other type of firmware you might need to flash, as long as you canobtain a firmware update image to PXE boot.
That concludes the instructions for hardware configuration for a large Linux cluster. Subsequent articles in the Installing a large Linux clusterseries contain the steps to set up the software side of the cluster,including management server configuration and node installationprocedures in the next article.