-
Notifications
You must be signed in to change notification settings - Fork 173
DFM_Service_Node_Hierarchy_support
Table of Contents
- Overview
- MN vs. SN ownership and responsibilities
- xCAT DB attributes
- Installing RPMs for DFM on the SN
- Conserver Setting
- Setting servicenode, xcatmaster and conserver for DFM HW Ctrl Cmds
- **Configuring xCAT SN Hierarchy Ethernet Adapters(Power 775 DFM Only) **
- DFM configuration for hardware server connections
- DFM rpower on flow when OS provision
- CEC down/up policy
- Firmware update sequence
This Doc will describe the support for Hierarchy support using DFM.
Previously support for DFM was restricted to all Direct FSP Management being performed from the Management Node. With the new support administrators will be able to configure the SN as the supported DFM control point for nodes which it manages. This document will describe the changes to xCAT required to support this configuration as well as the configuration information that will need to be defined to allow it to work properly.
This support will only be documenting the configuration of the xCAT DB and the Service Nodes to allow xCAT commands to automatically determine when to run on the SN and when to run on the EMS.
- Every node/nodegroup has explict noderes.servicenode, noderes.xcatmaster, etc., entries. The user must ensure that each servicenode listed in one of those attributes is also added to the servicenode table for correct servicenode install/configuration.
- The SN will need to have the Cluster Service Network ethernet interfaces defined during OS installation. Descriptions of what xCAT DB entries are needed and what postscripts need to run to configure the network interfaces.
For general hierarchy description it is worth reviewing or referencing the following:
[Hierarchical_Design] section on New_servicenode_Table
For the flow of setting up a Hierarchical Cluster, we can refer to the following two docs:
- [Setting_Up_an_AIX_Hierarchical_Cluster]
- [Setting_Up_a_Linux_Hierarchical_Cluster]
In order to support DFM hierarchy there needs to be clear roles and responsibilities of the MN and the SN in relation to the DFM and OS image management.
It is clear that the SN cannot manage the CEC power within which it resides since it would never be able to power itself on after a power off. The MN will be the DFM HCP for the SN CEC.
We also need the MN to control all of the frames since the SNs cannot control the frame in which they reside.
Here is a summary of the split in MN and SN ownership for DFM hierarchy:
MN and SN Functional Ownership
function / hardware target MN SN
DFM Frames X
DFM SN CEC X
DFM SN LPAR X
OS image SN LPAR X
DFM SN CEC non-SN LPARs X
OS image SN CEC non-SN LPARs X
DFM non SN CECs X
DFM non SN CEC LPARs X
OS image non SN CEC LPARs X
Note: While DFM for the SN CEC non-SN LPARs is performed by the MN, these LPARs have their OS support provided by the SN. This is done for scaling reasons as a very large cluster would have too many LPARs to support from a single place if it were to support OS images for them all from the MN.
This section will discuss the xCAT DB attributes which determine the MN and SN hierarchy configuration and control.
The noderes.servicenode entry is used to override the default of the MN by naming a specific SN to be used instead. This can be used to control areas like xdsh by associating an lpar node definition with a noderes.servicenode setting. DFM can be redirected to a SN by associating an FSP or BPA node definition with a specific SN using the noderes.servicenode setting. This control will allow the administrator to define specific associations depending on the capabilities which are being managed. In our previous example there were some CEC LPARs which need to have the OS served by the SN but the DFM controlled by the MN. Using the noderes.servicenode settings we can accomplish this mixed support.
The lpar node definitions for the SN CEC non-SN lpars will set noderes.servicenode to this SN.
The lpar node definitions for all non-SN CECs will set noderes.servicenode to this SN.
The SN CEC definitions, which is the "parent" of the SN LPAR and all the other LPARs on the SN CEC, will NOT set its noderes.servicenode and therefore use the MN.
All the non-SN CECs definitions will set noderes.servicenode to SN.
All Frame definitions will NOT set noderes.servicenode and therefore use the MN.
The xCAT hosts table is used to allow customers to define any other no OS install network interfaces that need to be defined on that host. We will use this capability to define the SN Cluster Service Network Ethernet interfaces. Adding the interface information for each SN in this table will result in the creation of the hostname entries required for hostname resolution.
Here is an example of what would be done to define a Cluster Service Network ethernet interface for a SN.
This section will discuss which RPMs are required for the SN to be able to perform hardware control. It will also contain information on how the required RPMs will be distributed to the SN in AIX and Linux. Currently we will require the DFM and the hardware server RPMs.
The "[Setting_Up_an_AIX_Hierarchical_Cluster]" and "[Setting_Up_a_Linux_Hierarchical_Cluster]" documents will need to be updated to include the tasks of installing the DFM and hardware server RPMs on the service node. This information is currently missing from those sections.
Here is the excerpt from this document:
For most operations, the Power 775 is managed directly by xCAT, not using the HMC. This requires the new xCAT Direct FSP Management plugin (xCAT-dfm-*.ppc64.rpm), which is not part of the core xCAT open source, but is available as a free download from IBM. You must download this and install it on your xCAT management node (and possibly on your service nodes, depending on your configuration) before proceeding with this document.
Download DFM and the pre-requisite hardware server package from Fix Central :
Product Group: Power
Product: Cluster Software
Cluster Software: direct FSP management plug-in for xCAT
And
Product Group: Power
Product: Cluster Software
Cluster Software: HPC Hardware Server
- xCAT-dfm RPM
- ISNM-hdwr_svr RPM (linux)
- isnm.hdwr_svr installp package (AIX)
We should put this part in the section Set Up the Service Nodes for Diskfull Installation of the doc [Setting_Up_a_Linux_Hierarchical_Cluster] to add the DFM and hdwr_svr into the list of packages to be installed on the SN.
mkdir -p /install/post/otherpkgs/<osver>/<arch>/dfm
For example, for rhels6:
mkdir -p /install/post/otherpkgs/rhels6/ppc64/dfm
And then, put the DFM and hdwr_svr packages in the dfm directory just created.
You should create repodata in your /install/post/otherpkgs/<os>/<arch>/dfm directory so that yum or zypper can be used to install these packages and automatically resolve dependencies for you:
createrepo /install/post/otherpkgs/<os>/<arch>/dfm
If the createrepo command is not found, you may need to install the createrepo rpm package that is shipped with your Linux OS. (For SLES I think it is on the SDK DVD.)
Next, add rpm names into the service.<osver>.<arch>.otherpkgs.pkglist file. In most cases, this file is already created under /opt/xcat/share/xcat/install/<os> directory. If it is not, you can create your own by referencing the existing ones.
vi /install/custom/install/<os>/service.<osver>.<arch>.otherpkgs.pkglist
Besides the required xCAT packages for service node, and append the following:
dfm/xCAT-dfm
dfm/ISNM-hdwr_svr-RHEL
After the Initialize network boot to install Service Nodes in Power 775 section in (Setting_Up_a_Linux_Hierarchical_Cluster] , the DFM and hdwr_svr will be installed on the Linux SN automatically.
We should put this part in the section Add required service node software in [Setting_Up_an_AIX_Hierarchical_Cluster] or the doc [Setting_Up_an_AIX_Hierarchical_Cluster] to add the DFM and hdwr_svr into the configure file.
mkdir -p /install/post/otherpkgs/aix/ppc64/dfm
Copy the DFM and hdwr_svr the packages to the suggested target location on the xCAT MN:
/install/post/otherpkgs/aix/ppc64/dfm
Edit the bundle for AIX Service Node. Assuming you are using AIX 71, you should edit the file:
/opt/xcat/share/xcat/installp_bundles/xCATaixSN71.bnd
And add the following into the bundle file:
I:isnm.hdwr_svr
R:xCAT-dfm*
The required software must be copied to the NIM lpp_source that is being used for the service node image. Assuming you are using AIX 7.1 you could copy all the appropriate rpms to your lpp_source resource (ex. 710SNimage_lpp_source) using the following commands:
nim -o update packages=all -a source=/install/post/otherpkgs/aix/ppc64/dfm 710SNimage_lpp_source
The NIM command will find the correct directories and update the appropriate lpp_source resource directories.
After Initiate a network boot in [Setting_Up_an_AIX_Hierarchical_Cluster] or Initiate a network boot for Power 775 support in [Setting_Up_an_AIX_Hierarchical_Cluster], the DFM and hdwr_svr will be installed on the AIX SN automatically.
- Conserver will always use nodehm.conserver to determine who should have the console for each node.
- The cmd makeconservercf should be distributed to the service nodes based on nodehm.conserver
[NOTE] The nodehm.conserver should be the same as the service node which is set for the hw ctrl point(CEC) of the node. Otherwise the getmacs/rnetboot will not work.
There are 5 different nodes which should be considered: SN-CEC, SN, SN-CEC-nonSN-LPARs, non-SN-CEC, and non-SN-CEC-LPARs. The values of the noderes.servicenode and noderes.xcatmaster attributes of the CEC nodes will determine whether the HW Ctrl Cmds will be done by the DFM on the MN directly, or be distributed to the SN. The values of the noderes.servicenode and noderes.xcatmaster attributes of the LPARs nodes will determine whether the software management will be done on the MN directly, or distributed to the SN. The rcons will use the nodehm.conserver attribute in [#Conserver] session which will determine who should have the console for each node.
1. For the SN-CECs
- Key Attributes:
The noderes.servicenode and noderes.xcatmaster attributes of these CEC are not set, or set to MN.
- HW Ctrl:
The HW ctrl commands will be done by the DFM on the MN directly.
2. For the SN:
- Key Attributes:
The nodehm.conserver, noderes.servicenode and noderes.xcatmaster attributes of these SN are not set, or set to MN.
- HW Ctrl :
The HW ctrl commands will be done by the DFM on the MN directly.
- Software Management(serving the boot image, xdsh, etc.):
The software management commands will be done on the MN directly.
3. For the SN-CEC-nonSN-LPARs:
- Key Attributes:
The nodehm.conserver attribute are not set, or set to MN. The noderes.servicenode and noderes.xcatmaster attributes are set to SN for these LPARs nodes.
- HW Ctrl:
The HW ctrl commands will be done by the DFM on the MN directly.
- Software Management(serving the boot image, xdsh, etc.):
The software management commands will be distributed to the SN.
4. For the non-SN-CECs
- Key Attributes:
The noderes.servicenode and noderes.xcatmaster attributes of these CECs are set to SN.
- HW Ctrl:
The HW ctrl commands will be distributed to the SN.
5. For the non-SN-CEC-LPARs
- Key Attributes:
The nodehm.conserver, noderes.servicenode and noderes.xcatmaster attributes of these LPARs are set to SN.
- HW Ctrl:
The HW ctrl commands will be distributed to the SN.
- Software Management(serving the boot image, xdsh, etc.):
The software management commands will be distributed to the SN.
For Example:
- If noderes.servicenode is set for a hw ctrl point(CEC), then hw ctrl cmds should be distributed to the SN
- I.e. if "rpower node1 on" is run, xcat 1st looks up the hcp (e.g. ppc.hcp) of node1. Assume it is called CEC1. Then it looks up noderes.servicenode for CEC1. If that is set, for example, to sn1, then the rpower cmd will be dispatch to sn1 and then sn1 will contact CEC1 to power on node1.
- If noderes.servicenode is not set for a hw ctrl point CEC, then hw ctrl commands should be run on the MN directly.
[Configuring xCAT SN Hierarchy Ethernet Adapter DFM Only]
This section will discuss what needs to be configured and defined on the SN to allow for DFM to use hardware server to communicate to the CECs and Frames.
Since the CEC and Frame node definitions are already created on the MN, this step is primarily using this data to create the entries in the hardware server configuration file. This task is done by running the mkhwconn command for each FSP of the CECs and BPA of the Frame. The mkhwconn command will run on the MN.
Currently, xCAT supports creating hdwr_svr connections for xCAT(tool type: lpar) and CNM(tool type: fnm).
1. hdwr_svr connections for xCAT(tool type: lpar)
According to the definitions of the CEC in the section [#xCAT_DB_attributes], for <SN-CECs>, the noderes.servicenode and noderes.xcatmaster attributes of these CEC are not set, or set to MN; for <non-SN-CECs>, the noderes.servicenode and noderes.xcatmaster attributes of these CECs are set to SN.
About the SN CEC, the following command will create the connection between the hdwr_svr on the MN and the CEC:
mkhwconn <sn-CECs> -t
lshwconn <sn-CECs>
Use the lshwconn command to check the connection state, and the final expected connections state is "LINE UP"
Before running the hardware control commands which will be distributed to the SN, finishing the OS provision for SN is a required prerequisite.
About the non-SN CEC, this command will create the connections between the hdwr_svr on the SN and the CEC:
mkhwconn <non-SN-CECs> -t
According to the definitions of the Frame in the section [#xCAT_DB_attributes], the following command will create the connection between the hdwr_svr on the SN and the Frame. Finishing the OS provision for SN is a prerequisite for this command.
mkhwconn frame -t
And then run the following command to check the connection state, and the final expected connections state is "LINE UP"
lshwconn frame
2. hdwr_svr connection for CNM(tool type: fnm)
In xCAT DFM Hierarchy environment, CNM is not support on the SN, and CNM needs a connection to every CEC in the cluster to be able to manage the details of the HFI connectivity. So the hdwr_svr will be available only on the xCAT EMS working with the "fnm" hardware connection. mkhwconn will create the hdwr_svr connections for CNM on the MN to the CECs directly.
mkhwconn cec -t -T fnm
lshwconn -T fnm
As part of the rpower flow working with P775 IH cluster, it is important to include power on for all GPFS I/O server nodes prior to compute nodes. And the compute nodes could not get the image from the SNs when the CEC powers up. The SNs should be powered on firstly, and only if the OS provision for SNs is finished, we can power on the computes nodes. So We should require the rpower on order:
1. power on the CECs to standby state
2A. power on the SNs (includes LL servers)
2B. power on the GPFS I/O server working with DE's
2C. power on Utility nodes (not as important but needed)
3. power on the compute nodes
In the DFM Hierarchical environment, the power operating on the non-sn-CECs are done through the service node. So there are some CEC down/up policy should be followed.
1. Power on SN-CECs to standby/operating state
2. Power SNs on with OS
3. Create the connections from the SNs to the non-sn-CECs
4. Power on the non-sn-CECs
5. Power on all the compute LPARs
1. Power off all the compute LPARs
2. Power off the non-sn-CECs
3. Power off the SNs
4. Power off the SN-CECs
1. Power off all the CECs
2. Do the firmware update for the Frames' BPAs
3. Do the firmware update for the sn-CECs' FSPs
4. Power on the sn-CECs
5. Power SNs on with OS
6. Create the connections from the SNs to the non-sn-CECs
7. Do the firmware update for the non-sn-CECs' FSPs
- Nov 13, 2024: xCAT 2.17 released.
- Mar 08, 2023: xCAT 2.16.5 released.
- Jun 20, 2022: xCAT 2.16.4 released.
- Nov 17, 2021: xCAT 2.16.3 released.
- May 25, 2021: xCAT 2.16.2 released.
- Nov 06, 2020: xCAT 2.16.1 released.
- Jun 17, 2020: xCAT 2.16 released.
- Mar 06, 2020: xCAT 2.15.1 released.
- Nov 11, 2019: xCAT 2.15 released.
- Mar 29, 2019: xCAT 2.14.6 released.
- Dec 07, 2018: xCAT 2.14.5 released.
- Oct 19, 2018: xCAT 2.14.4 released.
- Aug 24, 2018: xCAT 2.14.3 released.
- Jul 13, 2018: xCAT 2.14.2 released.
- Jun 01, 2018: xCAT 2.14.1 released.
- Apr 20, 2018: xCAT 2.14 released.
- Mar 14, 2018: xCAT 2.13.11 released.
- Jan 26, 2018: xCAT 2.13.10 released.
- Dec 18, 2017: xCAT 2.13.9 released.
- Nov 03, 2017: xCAT 2.13.8 released.
- Sep 22, 2017: xCAT 2.13.7 released.
- Aug 10, 2017: xCAT 2.13.6 released.
- Jun 30, 2017: xCAT 2.13.5 released.
- May 19, 2017: xCAT 2.13.4 released.
- Apr 14, 2017: xCAT 2.13.3 released.
- Feb 24, 2017: xCAT 2.13.2 released.
- Jan 13, 2017: xCAT 2.13.1 released.
- Dec 09, 2016: xCAT 2.13 released.
- Dec 06, 2016: xCAT 2.9.4 (AIX only) released.
- Nov 11, 2016: xCAT 2.12.4 released.
- Sep 30, 2016: xCAT 2.12.3 released.
- Aug 19, 2016: xCAT 2.12.2 released.
- Jul 08, 2016: xCAT 2.12.1 released.
- May 20, 2016: xCAT 2.12 released.
- Apr 22, 2016: xCAT 2.11.1 released.
- Mar 11, 2016: xCAT 2.9.3 (AIX only) released.
- Dec 11, 2015: xCAT 2.11 released.
- Nov 11, 2015: xCAT 2.9.2 (AIX only) released.
- Jul 30, 2015: xCAT 2.10 released.
- Jul 30, 2015: xCAT migrates from sourceforge to github
- Jun 26, 2015: xCAT 2.7.9 released.
- Mar 20, 2015: xCAT 2.9.1 released.
- Dec 12, 2014: xCAT 2.9 released.
- Sep 5, 2014: xCAT 2.8.5 released.
- May 23, 2014: xCAT 2.8.4 released.
- Jan 24, 2014: xCAT 2.7.8 released.
- Nov 15, 2013: xCAT 2.8.3 released.
- Jun 26, 2013: xCAT 2.8.2 released.
- May 17, 2013: xCAT 2.7.7 released.
- May 10, 2013: xCAT 2.8.1 released.
- Feb 28, 2013: xCAT 2.8 released.
- Nov 30, 2012: xCAT 2.7.6 released.
- Oct 29, 2012: xCAT 2.7.5 released.
- Aug 27, 2012: xCAT 2.7.4 released.
- Jun 22, 2012: xCAT 2.7.3 released.
- May 25, 2012: xCAT 2.7.2 released.
- Apr 20, 2012: xCAT 2.7.1 released.
- Mar 19, 2012: xCAT 2.7 released.
- Mar 15, 2012: xCAT 2.6.11 released.
- Jan 23, 2012: xCAT 2.6.10 released.
- Nov 15, 2011: xCAT 2.6.9 released.
- Sep 30, 2011: xCAT 2.6.8 released.
- Aug 26, 2011: xCAT 2.6.6 released.
- May 20, 2011: xCAT 2.6 released.
- Feb 14, 2011: Watson plays on Jeopardy and is managed by xCAT!
- xCAT OS And Hw Support Matrix
- Oct 22, 2010: xCAT 2.5 released.
- Apr 30, 2010: xCAT 2.4 is released.
- Oct 31, 2009: xCAT 2.3 released. xCAT's 10 year anniversary!
- Apr 16, 2009: xCAT 2.2 released.
- Oct 31, 2008: xCAT 2.1 released.
- Sep 12, 2008: Support for xCAT 2 can now be purchased!
- June 9, 2008: xCAT breaths life into (at the time) the fastest supercomputer on the planet
- May 30, 2008: xCAT 2.0 for Linux officially released!
- Oct 31, 2007: IBM open sources xCAT 2.0 to allow collaboration among all of the xCAT users.
- Oct 31, 1999: xCAT 1.0 is born!
xCAT started out as a project in IBM developed by Egan Ford. It was quickly adopted by customers and IBM manufacturing sites to rapidly deploy clusters.