Large Linux cluster storage backend
Summary: Create a working Linux? cluster from many separate pieces of hardware andsoftware, including System x? and IBM TotalStorage? systems. Part 3 provides the firsthalf of the instructions you need to set up the storage backend, includingdetails on storage architecture, needed hardware, and the Storage Area Network.
Date: 04 May 2007
Level: Advanced
Also available in: Chinese Russian
Activity: 10408 views
Comments: 0 (View | Add comment - Sign in)
This is the third in a series of articles that cover the installation andsetup of a large Linux computer cluster. The purpose of the series is to bringtogether in one place up-to-date information from various sources in the publicdomain about the process required to create a working Linux cluster from manyseparate pieces of hardware and software. These articles are not intended toprovide the basis for the complete design of a new large Linux cluster; refer tothe relevant reference materials and Redbooks? mentioned throughout for generalarchitecture pointers.
This series addresses systems architects and systems engineers to plan andimplement a Linux cluster using the IBM eServer Cluster 1350 framework (see Resources for more information about the framework). Some parts might alsobe relevant to cluster administrators for educational purposes and during normalcluster operation. Each part of this article refers to the same exampleinstallation.
Part 1of the series provides detailed instructions for setting up the hardware for thecluster.Part 2takes you through the next steps after hardware configuration: softwareinstallation using the IBM systems management software, Cluster Systems Management(CSM), and node installation.
This third part is the first of two articles that describe the storage backend of thecluster. Together, these two articles cover the storage hardware configuration andthe installation and configuration of the IBM shared file system, General ParallelFile System (GPFS). This third part takes you through the architecture of thestorage system, hardware preparation, and details about setting up a Storage AreaNetwork. The fourth and final part of the series provides details about CSMspecifics related to the storage backend of our example cluster, notablyperforming node installation for the storage system, and GPFS clusterconfiguration.
Before continuing, you will benefit from reviewing the General clusterarchitecture section inPart 1of this series.
Figure 1 shows an overview of the storage configuration used for the examplecluster described in this series. The configuration is explained in more detailthroughout this article. This setup is based on GPFS version 2.3. It includes onelarge GPFS cluster split into two logical halves with a single large file system.The example design provides resilience in case of a disaster where, if one halfof the storage backend is lost, the other can continue operation.
Figure 1 shows four storage servers that manage the storage provided by two disksubsystems. In the top right-hand corner, you can see a tie-breaker server. Thenetwork connections and fiber channel connections are shown for reference. All aredescribed in further detail in the following sections. The rest of the cluster isshown as a cloud and will not be addressed in this article. For more details aboutthe rest of the cluster, seePart 1andPart 2of this series.
The majority of the nodes within this GPFS cluster are running Red HatEnterprise Linux 3. The example uses a server/client architecture, where a smallsubset of servers has visibility of the storage using a fiber channel. They act asnetwork shared disk (NSD) servers to the rest of the cluster. This means that mostof the members of the GPFS cluster access the storage over IP using the NSD servers.There are four NSD nodes (also known here as storage nodes) in total: two in eachlogical half of the GPFS cluster. These are grouped into pairs, where each pairmanages one of the storage subsystems.
As each half of the cluster contains exactly the same number of nodes, should onehalf be lost, quorum becomes an issue. With GPFS, for the file system to remainavailable, a quorum of nodes needs to be available. Quorum is defined asQuorum = ( number of quorum nodes / 2 ) + 1
.
In a case such as this configuration, where the cluster is made of two identicalhalves, the GPFS file system becomes unavailable if either half is lost. To avoidthis situation, the system employs a tie-breaker node. This node is physicallylocated away from the main cluster. This means that should either half becomeunavailable, the other half can continue accessing the GPFS file system. This isalso made possible by the use of three failure groups, which are further explainedunder Data replication. This means two copies of the data areavailable: one in each half of the cluster.
As illustrated in Figure 1, each node is connected to two networks. The first ofthese is used for compute traffic and general cluster communication. The secondnetwork is dedicated to GPFS and is used for storage access over IP for thosenodes that do not have a direct view of the Storage Area Network (SAN) storagesystem. This second network uses jumbo frames for performance. See the GPFSnetwork tuning section in Part 4 of the series for more details on the storagenetwork.
The storage backend of this solution is comprised of two disk subsystems that are IBMTotalStorage DS4500 (formerly FAStT 900s) disk systems, each with a number offully populated EXP710 expansion disk drawers attached. Each DS4500 is configuredinto RAID 5 4+P arrays plus some hot spare disks.
Each DS4500 is owned by a pair of storage servers. The architecture splits the4+P arrays between the two servers so that each server is the primary server forthe first half of the arrays and the secondary server for the other half of thearrays. This way, should one of the servers fail, the other server can take overas primary for the disks from the failed server.
This example has GPFS replicate the data and metadata on the GPFS filesystem. The storage is spilt into three failure groups. Afailure group is a set of logical disks that share a common point of failure. (Asseen from the operating system, here a disk corresponds to one LUN, which is onedisk array on the DS4500.) The failure groups in this system are made up of thefollowing:
When you created the GPFS file system, you should have specified the number of copies of dataand metadata as two. So, with the failure groups defined above,each half contains one copy of the file system. The third failure group isrequired to solve disk quorum issues so that should either half of the storage gooffline, disk quorum is satisfied, and the file system remains accessible.
As mentioned, this cluster contains two IBM TotalStorage DS4500 devices, whichform the storage backend of the solution. You can find more information about thishardware under Resources.
IBM couples each DS4500 system with IBM TotalStorage DS4000 EXP710 fiber channel(FC) storage expansion units. Each of these is a 14-bay, 2 GBps rack-mountable FCenclosure. You can find more details about this hardware in the Resources section.
The following section covers in some detail the configuration of the DS4500 and EXP710units within the example solution.
Note that you need to power on and off the SAN system in aspecific order so that all storage is discovered correctly. Perform powering on inthe following order:
Power off in the opposite order, as follows:
Figure 2 shows the rear of a DS4500 unit. On the left-hand side are four mini-hubports for host connectivity. In this article, these are referred to as slots 1 to4, numbered from left to right, as shown in Figure 1. Slots 1 and 3 correspond tothe top controller, which is controller A. Slots 2 and 4 correspond to the bottomcontroller, which is controller B. On the right-hand side are four mini-hub portsfor expansion drawer (EXP710) connectivity.
Each DS4500 is cabled into two loops as shown in Figure 3.
Each EXP 710 drawer must have a unique ID. These are set using the panel on theback of each enclosure.
Configure IP addressesfor DS4500 controllers
Set the IP address of each controller using the serial port at the back ofeach enclosure. You could use the application hyperterminal on Windows?or minicom on Linux. The example uses the following settings:
Make the connection by sending a break (Ctrl-Break using hyperterminal), thenhitting the space bar to set the speed. Then, send another break and use theescape key to enter the shell. The default password isinfiniti
.
Use the command netCfgShow
to show the current IPsettings of the controller, and use the command netCfgSet
toset the desired IP address, subnet mask, and gateway.
Discover DS4500 from Storage Manager
After this point, the DS4500 is managed using the Storage Manager (SM)software. Use the latest version (9.1 or higher) with new hardware.
You can use Storage Manager to:
You can also troubleshoot and perform management tasks, such as checking the status ofthe TotalStorage subsystem and updating the firmware of RAID controllers. See Resources for the latest version of Storage Manager for your hardware.
The SM client can be installed on a variety of operating systems. In the exampledescribed in the article, the SM client is installed on the management server.Discover the newly configured DS4500 from the SM client using the first button onthe left, which has a wand on it. To perform operations on a DS4500 seen throughthis interface, double-click the computer name to open a new window.
General DS4500 controller configuration steps
First, rename DS4500 by going to Storage Subsystem >Rename…, and enter a new name. Next, check that clocks are synchronized by goingto Storage Subsystem > Set Controller Clock. There, check that theclocks are all synchronized. Now, set the system password by going to StorageSubsystem > Change > Password.
Update firmware for DS4500 and EXP 710 drawers
To check system firmware levels from the Storage Manager, go to Advanced> Maintenance > Download > Firmware. The current levelsare listed at the top of this window. You can download newer versions onto thecomputer from here, but be sure to use the correct firmware for themodel and to upgrade levels in the order specified in any notes that come with thefirmware code. The firmware for the disks and the ESMs can be also checked fromthe Download menu.
Manual configuration versus scripted configuration
The following sections detail the manual set up of a DS4500. Follow thesesteps for the initial configuration of one of the DS4500s in this solution, savingthe configuration of the first DS4500. This action produces a script that you can thenuse to reproduce the configuration on the same DS4500 should it be reset orreplaced with new hardware.
You can replicate this script and edit it for use on the other DS4500 to alloweasy and accurate reproduction of a similar set up. You need to change the fieldscontaining the name for the DS4500, disk locations, array names, and mappingdetails for hosts (that is, the World Wide Port Numbers [WWPNs]). Note that thesescripts leave the Access LUN in the host group definition. This is removedmanually on each DS4500.
This example uses a number of disks on each DS4500 to remain as hot spares. These areadded by right-clicking the disk to be assigned as a hot spare, choosing themanual option, and entering the password for the DS4500 (set in the General DS4500 controller configuration section).
<ds4500 name>_array<number>
. Under Advanced Parameters, choose Customize Settings. You see a green cylinder with a clock next to it while you create thisarray. You can check your progress by right-clicking on the logical drive name andchoosing Properties.
Note that the steps beyond this point require that you have configured the SANswitches and installed and run the storage servers with the host bus adapters(HBAs) configured so that WWPNs of the HBAs are seen at the SAN switches and,therefore, by the DS4500. See the SAN infrastructure and the HBA configuration sectionsin Part 4 of the series for details about these steps.
Storage partitioning and disk mapping
Once LUNs are created, they need to be assigned to hosts. In this example, use storage partitioning.Define storage partitions by creating alogical-drive-to-LUN mapping. This grants a host or host group access to aparticular logical drive. Perform these steps in order when defining storagepartitioning. You will initially define the topology and then the actual storagepartition:
As already described, in this setup there is only one group per DS4500, containingthe two storage nodes between which all disks on that DS4500 will be twin tailed.All LUNs are assigned to this group, with the exception of the Access LUN, whichmust not be assigned to this group. The Access LUN is used for in-band managementof the DS4500. However, it is not supported by Linux and must be removed from anynode groups created.
Create a new host group by right-clicking the Default Group section andselecting Define New Host Group. Enter the host group name. Create a new hostby right-clicking the host group created and selecting Define Host Port. Inthe pull-down menu, select the WWPN corresponding to the HBA to be added. Notethat for the WWPN to appear in this menu, you must have configured and zoned thehost correctly in the SAN. Storage Manager will then see the port under Show AllHost Port Information. The Linux Host Type has been chosen, and the Host port nameshould be entered in the final box.
Repeat this step so that each host has both ports defined. Next, create the storagepartition by right-clicking the newly created host group and selecting DefineStorage Partition. This opens the Storage Partitioning wizard. Click Next tostart the wizard. Select the Host Group you just created, and click Next. Choosethe LUNs you previously defined to include them here. Note that you must notinclude the Access LUN here. Click Finish to finalize this selection.
This section explains the steps to set up for the SANinfrastructure in a cluster. The SAN switches used in the example configurationare IBM TotalStorage SAN Switch H16 switches (2005-H16). See Resources for more details about this hardware.
In this section, this article covers in some detail the steps in configuration of SANswitches, referring specifically to commands and interfaces for H16 switches asexamples.
Configure IP addressesand hostnames for H16 SAN switches
To perform the initial configuration of the IP addresses on the H16 SAN switches,connect using the serial cable that comes with the switch (black ends,not null modem) into the port at the back of the computer. Use these connectionsettings:
Use the default login details: username admin
and password password
. Change the hostname and IPaddress using the command ipAddrSet
. Verify thesettings using the command ipAddrShow
.
Once the IP addresses are configured, you can manage the SAN switches with theWeb interface. Connect to a SAN switch using the IP address with a browser with aJava? plugin. To access the Admin interface, click the Admin button and enter theusername and password. At this point, you can enter the new name of the switchinto the box indicated and apply the changes.
The domain ID must be unique for every domain in a fabric. In this example, theswitches are contained in their own fabric, but the IDs are changed in case offuture merges. Note that the switch needs to be disabled before you can change thedomain ID.
For future reference, once the network can access the switch, you canchange the IP address of the SAN switch using the Admin interface from the NetworkConfig tab. This is an alternative to using a serial connection.
The example cluster uses the following zoning rules:
You set the zoning of the SAN switches using the Web interface on each switch asdescribed in the previous section. The zoning page can be reached using the farright button in the group on the bottom lefthand corner of the window. Tosimplify the management of zoning, assign aliases to each WWPN to identify thedevice attached to the port.
Here is how to create the aliases and assign them to hosts. First, add an alias byclicking Create and entering the name of the alias. Then, choose a WWPN to assignto this newly created alias. You see three levels of detail at each port, asfollows:
Add the second level to the alias by choosing the second level and selecting Addmember.
Once you create aliases, the next step is to create zones by combininggroups of aliases. In this configuration, you have used zones where each HBA oneach host sees only one controller on the relevant DS4500. As explained in theprevious section, in this example setup, each DS4500 presents its disks to only two hosts.Each host uses a different connection to the controller to spread the load andmaximize the throughput. This type of zoning is known as single HBA zoning. Allhosts are isolated from each other at the SAN level. This zoning removesunnecessary PLOGI activity from host to host, as well as removing the risk ofproblems caused by a faulty HBA affecting others. As a result of this, themanagement of the switch becomes safer, because modifying each individual zonedoesnot affect the other hosts. When you add a new host, create new zones also,instead of adding the host to an existing zone.
The final step is to add the zones defined into a configuration that can be savedand then activated. It is useful to produce a switch report, which you can do byclicking the Admin button and then choosing Switch Report. This report contains,in html format, all the information you need to manually recreate theconfiguration of the switch.
Saving configuration toanother server
Once the SAN switch is configured the configuration can be uploaded to anotherserver using ftp. You can do this again if necessary to automatically reconfigurethe switch. Here are the steps to save the configuration file to a server:
password
) using telnet. configupload
command. You can make firmware updates using download from an FTP server. Here are the stepsto follow:
firmwareshow
command to check the current firmware level. firmwaredownload
command to start the download process. release.plist
(for example, /pub/v4.4.0b/release.plist
). Do not be confused at this point that the release.plist
file does not appear to exist. The switch downloads and installs the software and then reboots. firmwaredownloadstatus
. This is only part of setting up the backend of your example cluster. The nextsteps involve using CSM to complete setup of the storage backend, which includesperforming node installation for the storage system and GPFS clusterconfiguration. The fourth and final part of this series covers those processes.
Learn
Get products and technologies
Discuss
GrahamWhite is a systems management specialist in the Linux IntegrationCentre within Emerging Technology Services at the IBM Hursley Parkoffice in the United Kingdom. He is a Red Hat Certified Engineer, and hespecializes in a wide range of open-source, open-standard, and IBMtechnologies Graham's areas of expertise include LAMP, Linux, security,clustering, and all IBM Systems hardware platforms. He received a BScwith honors in Computer Science with Management Science from ExeterUniversity in 2000.
聯(lián)系客服