-
Notifications
You must be signed in to change notification settings - Fork 27
Configuration
[Basic]
head
node_size
ckpt_dir
glbl_dir
meta_dir
ckpt_L1
ckpt_L2
ckpt_L3
ckpt_L4
dcp_L4
inline_L2
inline_L3
inline_L4
keep_last_ckpt
keep_l4_ckpt
group_size
max_synch_intv
ckpt_io
enable_staging
enable_dcp
dcp_mode
dcp_block_size
verbosity
[Restart]
failure
exec_id
[Advanced]
block_size
transfer_size
general_tag
ckpt_tag
stage_tag
final_tag
local_test
lustre_striping_unit
lustre_striping_factor
lustre_striping_offset
The checkpointing safety levels L2, L3 and L4 produce additional overhead due to the necessary postprocessing work on the checkpoints. FTI offers the possibility to create an MPI process, called HEAD, in which this postprocessing will be accomplished. This allows it for the application processes to continue the execution immediately after the checkpointing.
Value | Meaning |
---|---|
0 |
The checkpoint postprocessing work is covered by the application processes |
1 |
The HEAD process accomplishes the checkpoint postprocessing work (notice: In this case, the number of application processes will be (n-1)/node) |
(default = 0)
Lets FTI know, how many processes will run on each node (ppn). In most cases this will be the amount of processing units within the node (e.g. 2 CPU’s/node and 8 cores/CPU ! 16 processes/node).
Value | Meaning |
---|---|
|
Number of processing units within each node (notice: The total number of processes must be a multiple of group_size*node_size ) |
(default = 2)
This entry defines the path to the local hard drive on the nodes.
Value | Meaning |
---|---|
string |
Path to the local hard drive on the nodes |
(default = /scratch/username/)
This entry defines the path to the checkpoint folder on the PFS (L4 checkpoints).
Value | Meaning |
---|---|
string |
Path to the checkpoint directory on the PFS |
(default = /work/project/)
This entry defines the path to the meta files directory. The directory has to be accessible from each node. It keeps files with information about the topology of the execution.
Value | Meaning |
---|---|
string |
Path to the meta files directory |
(default = /home/user/.fti)
Here, the user sets the checkpoint frequency of L1 checkpoints when using
FTI_Snapshot()
.
Value | Meaning |
---|---|
|
L1 checkpointing interval in minutes |
0 |
Disable L1 checkpointing |
(default = 3)
Here, the user sets the checkpoint frequency of L2 checkpoints when using
FTI_Snapshot()
.
Value | Meaning |
---|---|
|
L2 checkpointing interval in minutes |
0 |
Disable L2 checkpointing |
(default = 5)
Here, the user sets the checkpoint frequency of L3 checkpoints when using
FTI_Snapshot()
.
Value | Meaning |
---|---|
|
L3 checkpointing interval in minutes |
0 |
Disable L3 checkpointing |
(default = 7)
Here, the user sets the checkpoint frequency of L4 checkpoints when using
FTI_Snapshot()
.
Value | Meaning |
---|---|
|
L4 checkpointing interval in minutes |
0 |
Disable L4 checkpointing |
(default = 11)
Here, the user sets the checkpoint frequency of L4 differential checkpoints when using
FTI_Snapshot()
.
Value | Meaning |
---|---|
|
L4 dCP checkpointing interval in minutes |
0 |
Disable L4 dCP checkpointing |
(default = 0)
In this setting, the user chose whether the post-processing work on the L2 checkpoints is done by an FTI process or by the application process.
Value | Meaning |
---|---|
0 |
The post-processing work of the L2 checkpoints is done by an FTI process (notice: This setting is only alowed if head = 1) |
1 |
The post-processing work of the L2 checkpoints is done by the application process |
(default = 1)
In this setting, the user chose whether the post-processing work on the L3 checkpoints is done by an FTI process or by the application process.
Value | Meaning |
---|---|
0 |
The post-processing work of the L3 checkpoints is done by an FTI process (notice: This setting is only alowed if head = 1) |
1 |
The post-processing work of the L3 checkpoints is done by the application process |
(default = 1)
In this setting, the user chose whether the post-processing work on the L4 checkpoints is done by an FTI process or by the application process.
Value | Meaning |
---|---|
0 |
The post-processing work of the L4 checkpoints is done by an FTI process (notice: This setting is only alowed if head = 1) |
1 |
The post-processing work of the L4 checkpoints is done by the application process |
(default = 1)
This setting tells FTI whether the last checkpoint taken during the execution will be kept in the case of a successful run or not.
Value | Meaning |
---|---|
0 |
During FTI_Finalize() , all checkpoints will be removed (except case 'keep_l4_ckpt=1') |
1 |
After FTI_Finalize() , the last checkpoint will be kept and stored on the PFS as a L4 checkpoint (notice: Additionally, the setting failure in the configuration file is set to 2. This will lead to a restart from the last checkpoint if the application is executed again) |
(default = 0)
This setting triggers FTI to keep all level 4 checkpoints taken during the execution. The checkpoint files will be saved in glbl_dir/l4_archive.
Value | Meaning |
---|---|
0 |
During FTI_Finalize() , all checkpoints will be removed (except case 'keep_last_ckpt=1') |
1 |
All level 4 checkpoints taken during the execution, will be stored under glbl_dir/l4_archive . This folder will not be deleted during the FTI_Finalize() call. |
(default = 0)
The group size entry sets, how many nodes (members) forming a group.
Value | Meaning |
---|---|
|
Number of nodes contained in a group (notice: The total number of processes must be a multiple of group_size*node_size ) |
(default = 4)
Sets the maximum number of iterations between synchronisations of the iteration length (used for
FTI_Snapshot()
). Internally the value will be rounded to the next lower value which is a power of 2.
Value | Meaning |
---|---|
|
maximum number of iterations between measurements of the global mean iteration time (MPI_Allreduce call) |
0 |
Sets the value to 512, the default value for FTI |
(default = 0)
Sets the I/O mode.
Value | Meaning |
---|---|
1 |
POSIX I/O mode |
2 |
MPI-IO I/O mode |
3 |
FTI-FF I/O mode |
4 |
SIONLib I/O mode |
5 |
HDF5 I/O mode |
(default = 1)
Enable the staging feature. This feature allows to stage files asynchronously from local (e.g. node local NVMe storage) to the PFS. FTI offers the API functions FTI_SendFile, FTI_GetStageDir and FTI_GetStageStatus for that.
Value | Meaning |
---|---|
0 |
Staging disabled |
1 |
Stagin enabled (creation of the staging directory in folde 'ckpt_dir') |
(default = 0)
Enable differential checkpointing. In order to use this feature, ckpt_io has to be set to 3 (FTI-FF). To trigger differential checkpoints, use either level
FTI_L4_DCP
in FTI_Checkpoint or set the interval in dcp_L4 for usage in FTI_Snapshot.
Value | Meaning |
---|---|
0 |
dCP disabled |
1 |
dCP enabled |
Set the hash algorithm used for differential checkpointing.
Value | Meaning |
---|---|
0 |
MD5 |
1 |
CRC32 |
(default = 0)
Set the desired partition block size for differential checkpointing in bytes. The block size must be within 512 ..
USHRT_MAX
(65535 on most systems).
Value | Meaning |
---|---|
|
block size for dataset partition for dCP |
(default = 16384)
Sets the level of verbosity.
Value | Meaning |
---|---|
1 |
Debug sensitive. Beside warnings, errors and information, FTI debugging information will be printed |
2 |
Information sensitive. FTI prints warnings, errors and information |
3 |
FTI prints only warnings and errors |
4 |
FTI prints only errors |
(default = 2)
This setting should mainly set by FTI itself. The behaviour within FTI is the following:
- Within
FTI_Init()
, it remains on it initial value.- After the first checkpoint is taken, it is set to 1.
- After
FTI_Finalize()
andkeep_last_ckpt
= 0, it is set to 0.- After
FTI_Finalize()
andkeep_last_ckpt
= 1, it is set to 2.
Value | Meaning |
---|---|
0 |
The application starts with its initial conditions (notice: In order to force a clean start, the value may be set to 0 manually. In this case the user has to take care about removing the checkpoint data from the last execution) |
1 |
FTI is searching for checkpoints and starts from the highest checkpoint level (notice: If no readable checkpoints are found, the execution stops) |
2 |
FTI is searching for the last L4 checkpoint and restarts the execution from there (notice: If checkpoint is not L4 or checkpoint is not readable, the execution stops) |
(default = 0)
This setting should mainly set by FTI itself. During
FTI_Init()
the execution ID is set if the application starts for the first time (failure = 0) or the execution ID is used by FTI in order to find the checkpoint files for the case of a restart (failure
= 1,2)
Value | Meaning |
---|---|
|
Execution ID (notice: If variate checkpoint data is available, the execution ID may set by the user to assign the desired starting point) |
(default = NULL)
The settings in this section, should ONLY be changed by advanced users.
FTI temporarily copies small blocks of the L2 and L3 checkpoints to send them through MPI. The size of the data blocks can be set here.
Value | Meaning |
---|---|
int |
Size in KB of the data blocks send by FTI through MPI for the checkpoint levels L2 and L3 |
(default = 1024)
FTI transfers in chunks local checkpoint files to PFS. The size of the chunk can be set here.
Value | Meaning |
---|---|
int |
Size in MB of the chunks send by FTI from local to PFS |
(default = 16)
FTI uses a certain tags for the MPI messages. The tag for general messages can be set here.
Value | Meaning |
---|---|
int |
Tag, used for general MPI messages within FTI |
(default = 2612)
FTI uses a certain tags for the MPI messages. The tag for messages related to checkpoint communication can be set here.
Value | Meaning |
---|---|
int |
Tag, used for MPI messages related to a checkpoint context within FTI |
(default = 711)
FTI uses a certain tags for the MPI messages. The tag for messages related to staging communication can be set here.
Value | Meaning |
---|---|
int |
Tag, used for MPI messages related to a staging context within FTI |
(default = 406)
FTI uses a certain tags for the MPI messages. The tag for the message to the heads to trigger the end of the execution can be set here.
Value | Meaning |
---|---|
int |
Tag, used for the MPI message that marks the end of the execution send from application processes to the heads within FTI |
(default = 3107)
This option only impacts if
-DENABLE_LUSTRE
was added to the Cmake command. It sets the striping unit for the MPI-IO file.
Value | Meaning |
---|---|
|
Striping size in Bytes. The default in Lustre systems is 1MB (1048576 Bytes), FTI uses 4MB (4194304 Bytes) as the dafault value |
0 |
Assigns the Lustre default value |
(default = 4194304)
This option only impacts if
-DENABLE_LUSTRE
was added to the Cmake command. It sets the striping factor for the MPI-IO file.
Value | Meaning |
---|---|
|
Striping factor. The striping factor determines the number of OST’s to use for striping. |
-1 |
Stripe over all available OST’s. This is the default in FTI. |
0 |
Assigns the Lustre default value |
(default = -1)
This option only impacts if
-DENABLE_LUSTRE
was added to the Cmake command. It sets the striping offset for the MPI-IO file.
Value | Meaning |
---|---|
|
Striping offset. The striping offset selects a particular OST to begin striping at. |
-1 |
Assigns the Lustre default value |
(default = -1)
FTI is building the topology of the execution, by determining the hostnames of the nodes on which each process runs. Depending on the settings for
group_size
,node_size
andhead
, FTI assigns each particular process to a group and decides which process will be Head or Application dedicated. This is meant to be a local test. In certain situations (e.g. to run FTI on a local machine) it is necessary to disable this function.
Value | Meaning |
---|---|
0 |
Local test is disabled. FTI will simulate the situation set in the configuration |
1 |
Local test is enabled (notice: FTI will check if the settings are correct on initialization and if necessary stop the execution) |
(default = 1)