-
Notifications
You must be signed in to change notification settings - Fork 27
API Reference
FTI Datatypes
FTI Constants
FTI_Init
FTI_InitType
FTI_Protect
FTI_GetStoredSize
FTI_Realloc
FTI_Checkpoint
FTI_Status
FTI_Recover
FTI_Snapshot
FTI_GetStageDir
FTI_GetStageStatus
FTI_SendFile
FTI_Finalize
FTI_CHAR
: FTI data type for chars.
FTI_SHRT
: FTI data type for short integers.
FTI_INTG
: FTI data type for integers.
FTI_LONG
: FTI data type for long integers.
FTI_UCHR
: FTI data type for unsigned chars.
FTI_USHT
: FTI data type for unsigned short integers.
FTI_UINT
: FTI data type for unsigned integers.
FTI_ULNG
: FTI data type for unsigned long integers.
FTI_SFLT
: FTI data type for single floating point.
FTI_DBLE
: FTI data type for double floating point.
FTI_LDBE
: FTI data type for long double floating point.
FTI_BUFS
: 256
FTI_DONE
: 1
FTI_SCES
: 0
FTI_NSCS
: -1
FTI_NREC
: -2
- Reads configuration file.
- Creates checkpoint directories.
- Detects topology of the system.
- Regenerates data upon recovery.
DEFINITION
int FTI_Init ( char * configFile , MPI_Comm globalComm )
INPUT
Variable | What for? |
---|---|
char * configFile |
Path to the config file |
MPI_Comm globalComm |
MPI communicator used for the execution |
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
FTI_NSCS |
No Success |
FTI_NREC |
FTI could not recover ckpt files |
DESCRIPTION
FTI_Init
initializes the FTI context. It must be called before any other FTI
function and after MPI_Init
.
EXAMPLE
int main ( int argc , char **argv ) {
MPI_Init (&argc , &argv );
char *path = "config.fti"; // config file path
int res = FTI_Init ( path , MPI_COMM_WORLD );
if (res == FTI_NREC) {
printf("Recovery not possible, terminating...");
FTI_Finalize();
MPI_Finalize();
return 1;
}
.
.
.
return 0;
}
- Initializes a data type.
DEFINITION
int FTI_InitType ( FTIT_type *type , int size )
INPUT
Variable | What for? |
---|---|
FTIT_type * type |
The data-type to be initialized |
int size |
The size of the data-type to be initialized |
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
DESCRIPTION
FTI_InitType initializes a FTI data-type. A data-type which is not defined by default by FTI (see: FTI Datatypes), must be defined using this function in order to protect variables of that type with FTI_Protect.
EXAMPLE
typedef struct A {
int a;
int b;
} A;
FTIT_type structAinfo ;
//sizeof sturct is safest due to padding
//in more complex structs
FTI_InitType (&structAinfo , sizeof(A));
- Stores metadata concerning the variable to protect.
DEFINITION
int FTI_Protect ( int id, void *ptr, long count, FTIT_type type )
INPUT
Variable | What for? |
---|---|
int id |
Unique ID of the variable to protect |
void * ptr |
Pointer to memory address of variable |
long count |
Number of elements at memory address |
FTIT_type type |
FTI data type of variable to protect |
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
FTI_NSCS |
No success |
DESCRIPTION
FTI_Protect
is used to add data fields to the list of protected
variables. Data, protected by this function will be stored during a call to FTI_Checkpoint or FTI_Snapshot and restored during a call to FTI_Recover.
If the dimension of a protected variable changes during the execution, a subsequent call to FTI_Protect
will update the meta-data whithin FTI in order to store the correct size during a successive call to FTI_Checkpoint or FTI_Snapshot.
EXAMPLE
int A;
float *B = malloc (sizeof(float) * 10) ;
FTI_Protect(1, &A, 1, FTI_INTG );
FTI_Protect(2, B, 10, FTI_SFLT );
// changing B size
B = realloc(B, sizeof(float) * 20) ;
// updating B size in protected list
FTI_Protect(2, B, 20, FTI_SFLT);
- Returns size of protected variable saved in metadata
DEFINITION
long FTI_GetStoredSize ( int id )
INPUT
Variable | What for? |
---|---|
int id |
ID of the protected variable |
OUTPUT
Value | Reason |
---|---|
long |
Size of a variable |
0 |
No success |
DESCRIPTION
FTI_GetStoredSize
returns the size of a protected variable with id
from the FTI metadata. The result may differ from the size of the variable known to the application at that moment. If the function is called on a restart, it returns the size stored in the metadata file. Called during the execution, it returns the value stored in the FTI runtime metadata, i.e. the size of the variable at the moment of the last checkpoint.
The function is needed to manually reallocate memory for protected variables with variable size on a recovery. Another possibility for the reallocation of memory is provided by FTI_Realloc.
EXAMPLE
...
long* array = calloc(arraySize, sizeof(long));
FTI_Protect(1, array, arraySize, FTI_LONG);
if (FTI_Status() != 0) {
long arraySizeInBytes = FTI_GetStoredSize(1);
if (arraySizeInBytes == 0) {
printf("No stored size in metadata!\n");
return GETSTOREDSIZE_FAILED;
}
array = realloc(array, arraySizeInBytes);
int res = FTI_Recover();
if (res != 0) {
printf("Recovery failed!\n");
return RECOVERY_FAILED;
}
//update arraySize
arraySize = arraySizeInBytes / sizeof(long);
}
for (i = 0; i < max; i++) {
if (i % CKTP_STEP) {
//update FTI array size information
FTI_Protect(1, array, arraySize, FTI_LONG);
int res = FTI_Checkpoint((i % CKTP_STEP) + 1, 1);
if (res != FTI_DONE) {
printf("Checkpoint failed!.\n");
return CHECKPOINT_FAILED;
}
}
...
//add element to array
arraySize += 1;
array = realloc(array, arraySize * sizeof(long));
}
...
Reallocates dataset to last checkpoint size.
DEFINITION
void* FTI_Realloc ( int id, void* ptr )
INPUT
Variable | What for? |
---|---|
int id |
ID of the protected variable |
void * ptr |
Pointer to memory address of variable |
OUTPUT
Value | Reason |
---|---|
void* |
Pointer to reallocated data |
NULL |
On failure |
DESCRIPTION
FTI_Realloc
is called for protected variables with dynamic size on recovery. It reallocates sufficient memory to store the checkpoint data to the pointed memory address. It must be called before FTI_Recover to prevent segmentation faults. If the reallocation must/is wanted to be done within the application, FTI provides the function FTI_GetStoredSize to request the variable size of the
checkpoint to recover.
EXAMPLE
...
FTI_Protect(1, &arraySize, 1, FTI_INTG);
long* array = calloc(arraySize, sizeof(long));
FTI_Protect(2, array, arraySize, FTI_LONG);
if (FTI_Status() != 0) {
array = FTI_Realloc(2, array);
if (array == NULL) {
printf("Reallocation failed!\n");
return REALLOC_FAILED;
}
int res = FTI_Recover();
if (res != 0) {
printf("Recovery failed!\n");
return RECOVERY_FAILED;
}
}
for (i = 0; i < max; i++) {
if (i % CKTP_STEP) {
//update FTI array size information
FTI_Protect(2, array, arraySize, FTI_LONG);
int res = FTI_Checkpoint((i % CKTP_STEP) + 1, 1);
if (res != FTI_DONE) {
printf("Checkpoint failed!.\n");
return CHECKPOINT_FAILED;
}
}
...
//add element to array
arraySize += 1;
array = realloc(array, arraySize * sizeof(long));
}
...
- Stores protected variables in the checkpoint of a desired safety level.
DEFINITION
int FTI_Checkpoint( int id, int level )
INPUT
Variable | What for? |
---|---|
int id |
Unique checkpoint ID |
int level |
Checkpoint level (1=L1, 2=L2, 3=L3, 4=L4) |
OUTPUT
Value | Reason |
---|---|
FTI_DONE |
Success |
FTI_NSCS |
Failure |
DESCRIPTION
FTI_Checkpoint
is used to store the current values of protected variables into a
checkpoint of safety level level
(see Multilevel-Checkpointing for descritions of the particular levels).
NOTICE: The checkpoint id must be different from 0!
EXAMPLE
int i;
for (i = 0; i < 100; i ++) {
if (i % 10 == 0) {
FTI_Checkpoint ( i /10 + 1, 1) ;
}
.
. // some computations
.
}
- Returns the current status of the recovery flag.
DEFINITION
int FTI_Status()
OUTPUT
Value | Reason |
---|---|
|
No checkpoints taken yet or recovered successfully |
|
At least one checkpoint is taken. If execution fails, the next start will be a restart |
|
The execution is a restart from checkpoint level L4 and keep_last_checkpoint was enabled during the last execution |
DESCRIPTION
FTI_Status
returns the current status of the recovery flag.
EXAMPLE
if ( FTI_Status () != 0) {
.
. // this section will be executed during restart
.
}
- Recovers the data of the protected variables from the checkpoint file.
DEFINITION
int FTI_Recover()
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Success |
FTI_NSCS |
Failure |
DESCRIPTION
FTI_Recover
loads the data from the checkpoint file to the protected variables. It only recovers variables which are protected by a preceeding call to FTI_Protect. If a variable changes its size during execution, the proper amount of memory has to be allocated for that variable before the call to FTI_Recover
. FTI provides the API functions
FTI_GetStoredSize and
FTI_Realloc for this case.
EXAMPLE
Basic example:
if ( FTI_Status() == 1 ) {
FTI_Recover() ;
}
- Invokes the recovery of protected variables on a restart.
- Writes multilevel checkpoints regarding their requested frequencies during execution.
DEFINITION
int FTI_Snapshot()
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
Successfull call (without checkpointing) or if recovery successful |
FTI_NSCS |
Failure of FTI_Checkpoint
|
FTI_DONE |
Success of FTI_Checkpoint
|
FTI_NREC |
Failure on recovery |
DESCRIPTION
On a restart, FTI_Snapshot
loads the data from the checkpoint file to the protected variables. During execution it performs checkpoints according to the checkpoint frequencies for the various safety levels. The frequencies may be set in the configuration file (see e.g.: ckpt_L1).
FTI_Snapshot
can only take care of variables which are protected by a preceding call to FTI_Protect.
EXAMPLE
int res = FTI_Snapshot();
if ( res == FTI_SCES ) {
.
. // executed after successful recover
. // or when checkpoint is not required
}
else { // res == FTI_DONE
.
. // executed after successful checkpointing
.
}
- Returns the local staging directory.
DEFINITION
int FTI_GetStageDir ( char* stageDir, int maxLen )
INPUT
Variable | What for? |
---|---|
int maxLen |
The length of the string buffer stageDir
|
OUTPUT
Value | Reason |
---|---|
char * stageDir |
Path to the local staging directory |
DESCRIPTION
FTI_GetStageDir
initializes the string stageDir
with the path to the local stage directory. This is a directory in the local ckpt path, set by the user in the configuration file.
EXAMPLE
see example for FTI_SendFile
- Returns the status of the stage request.
DEFINITION
int FTI_GetStageStatus ( int ID )
INPUT
Variable | What for? |
---|---|
int ID |
Request ID (returned from FTI_SendFile) |
OUTPUT
Value | Reason |
---|---|
int FTI_SI_PEND |
Head is occupied and request is pending |
int FTI_SI_ACTV |
Head is processing the request |
int FTI_SI_SCES |
Request was successfully processed |
int FTI_SI_FAIL |
Request failed |
int FTI_SI_NINI |
Request does not exist or was already processed |
DESCRIPTION
FTI_GetStageStatus
queries the status of the staging request. If the request was successful or failed, the function returns the respective status and resets the ID in order to allow the reassignment. I.e., after the function returns FTI_SI_SCES
or FTI_SI_FAIL
, a consecutive call with the same ID retuns FTI_SI_NINI
.
EXAMPLE
see example for FTI_SendFile
- Triggers the asynchronous transfer of local file to the PFS.
DEFINITION
int FTI_SendFile( char* lpath, char *rpath )
OUTPUT
Value | Reason |
---|---|
int ID |
On success, the request ID is returned. This ID may be used to query the status of the request whithin FTI_GetStageStatus |
FTI_NSCS |
On failure |
DESCRIPTION
The user may store files local on the nodes to a fast storage layer (e.g. NVMe) and send these files to the PFS asynchronously to the execution. The transfer is performed by the FTI head process. Thus, in order to use this feature, the head feature must be enabled. Is the head feature disabled, the files are send by the calling process itself.
EXAMPLE
#include "fti.h"
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
int main() {
MPI_Init(NULL,NULL);
FTI_Init("config.fti", MPI_COMM_WORLD);
int rank;
MPI_Comm_rank( FTI_COMM_WORLD, &rank );
char local_dir[512];
char remote_dir[] = "./";
// get local stage directory
if ( FTI_GetStageDir( local_dir, 512 ) != FTI_SCES ) {
fprintf( stderr, "Failed to get the local directory.\n" );
exit( EXIT_FAILURE );
}
char filename[512];
snprintf( filename, 512, "testfile-%d", rank );
char local_fn[512];
char remote_fn[512];
snprintf( local_fn, 512, "%s/%s", local_dir, filename );
snprintf( remote_fn, 512, "%s/%s", remote_dir, filename );
// crate local dummy file (1MB)
FILE *fstream = fopen( local_fn, "wb+" );
fsync(fileno(fstream));
fclose( fstream );
truncate( local_fn, 1024L*1024L );
int reqID;
// send local file to PFS
if ( (reqID = FTI_SendFile( local_fn, remote_fn )) == FTI_NSCS ) {
fprintf( stderr, "Failed to stage %s.", local_fn );
exit( EXIT_FAILURE );
}
// check status of staging request
int reqStatus = FTI_SI_NINI; // set status to not initialized (null)
while( 1 ) {
int request_final = 0;
reqStatus = FTI_GetStageStatus( reqID );
switch( reqStatus ) {
case FTI_SI_ACTV:
printf("Stage Status: ACTIVE\n");
break;
case FTI_SI_PEND:
printf("Stage Status: PENDING\n");
break;
case FTI_SI_SCES:
printf("Stage Status: SUCCESS\n");
request_final = 1;
break;
case FTI_SI_FAIL:
printf("Stage Status: FAILED\n");
request_final = -1;
break;
}
if ( request_final == -1) {
fprintf( stderr, "Staging request with ID: %d failed!\n", reqID );
break;
}
if ( request_final == 1) {
printf( "Staging request with ID: %d succeed!\n", reqID );
break;
}
}
FTI_Finalize();
MPI_Finalize();
exit( EXIT_SUCCESS );
}
- Frees the allocated memory.
- Communicates the end of the execution to dedicated threads.
- Cleans checkpoints and metadata.
DEFINITION
int FTI_Finalize()
OUTPUT
Value | Reason |
---|---|
FTI_SCES |
For application process |
exit(0) |
For FTI process |
DESCRIPTION
FTI_Finalize
notifies the FTI processes that the execution is over, frees
FTI internal data structures and it performs a clean up of the checkpoint folders at a normal execution. If the setting keep_last_ckpt
is set, it flushes local checkpoint files (if present) to the PFS. If the setting head
is set to 1, it will also terminate the FTI processes. It should be called before MPI_Finalize()
.
EXAMPLE
int main ( int argc , char ** argv ) {
.
.
.
FTI_Finalize () ;
MPI_Finalize () ;
return 0;
}