Skip to content

API Reference

Kai Keller edited this page Sep 17, 2018 · 21 revisions

FTI Datatypes
FTI Constants
FTI_Init
FTI_InitType
FTI_Protect
FTI_GetStoredSize
FTI_Realloc
FTI_Checkpoint
FTI_Status
FTI_Recover
FTI_Snapshot
FTI_GetStageDir
FTI_GetStageStatus
FTI_SendFile
FTI_Finalize

FTI Datatypes and Constants

FTI Datatypes

⬆️ Top

FTI_CHAR : FTI data type for chars.
FTI_SHRT : FTI data type for short integers.
FTI_INTG : FTI data type for integers.
FTI_LONG : FTI data type for long integers.
FTI_UCHR : FTI data type for unsigned chars.
FTI_USHT : FTI data type for unsigned short integers.
FTI_UINT : FTI data type for unsigned integers.
FTI_ULNG : FTI data type for unsigned long integers.
FTI_SFLT : FTI data type for single floating point.
FTI_DBLE : FTI data type for double floating point.
FTI_LDBE : FTI data type for long double floating point.

FTI Constants

⬆️ Top

FTI_BUFS : 256
FTI_DONE : 1
FTI_SCES : 0
FTI_NSCS : -1
FTI_NREC : -2


FTI_Init

⬆️ Top

  • Reads configuration file.
  • Creates checkpoint directories.
  • Detects topology of the system.
  • Regenerates data upon recovery.

DEFINITION

int FTI_Init ( char * configFile , MPI_Comm globalComm )

INPUT

Variable What for?
char * configFile Path to the config file
MPI_Comm globalComm MPI communicator used for the execution

OUTPUT

Value Reason
FTI_SCES Success
FTI_NSCS No Success
FTI_NREC FTI could not recover ckpt files

DESCRIPTION

FTI_Init initializes the FTI context. It must be called before any other FTI function and after MPI_Init.

EXAMPLE

int main ( int argc , char **argv ) {
    MPI_Init (&argc , &argv );
    char *path = "config.fti"; // config file path
    int res = FTI_Init ( path , MPI_COMM_WORLD );
    if (res == FTI_NREC) {
        printf("Recovery not possible, terminating...");
        FTI_Finalize();
        MPI_Finalize();
        return 1;
    }
.
.
.
    return 0;
}

FTI_InitType

⬆️ Top

  • Initializes a data type.

DEFINITION

int FTI_InitType ( FTIT_type *type , int size )

INPUT

Variable What for?
FTIT_type * type The data-type to be initialized
int size The size of the data-type to be initialized

OUTPUT

Value Reason
FTI_SCES Success

DESCRIPTION

FTI_InitType initializes a FTI data-type. A data-type which is not defined by default by FTI (see: FTI Datatypes), must be defined using this function in order to protect variables of that type with FTI_Protect.

EXAMPLE

typedef struct A {
    int a;
    int b;
} A;
FTIT_type structAinfo ;
//sizeof sturct is safest due to padding
//in more complex structs
FTI_InitType (&structAinfo , sizeof(A));

FTI_Protect

⬆️ Top

  • Stores metadata concerning the variable to protect.

DEFINITION

int FTI_Protect ( int id, void *ptr, long count, FTIT_type type )

INPUT

Variable What for?
int id Unique ID of the variable to protect
void * ptr Pointer to memory address of variable
long count Number of elements at memory address
FTIT_type type FTI data type of variable to protect

OUTPUT

Value Reason
FTI_SCES Success
FTI_NSCS No success

DESCRIPTION

FTI_Protect is used to add data fields to the list of protected variables. Data, protected by this function will be stored during a call to FTI_Checkpoint or FTI_Snapshot and restored during a call to FTI_Recover.

If the dimension of a protected variable changes during the execution, a subsequent call to FTI_Protect will update the meta-data whithin FTI in order to store the correct size during a successive call to FTI_Checkpoint or FTI_Snapshot.

EXAMPLE

int A;
float *B = malloc (sizeof(float) * 10) ;
FTI_Protect(1, &A, 1, FTI_INTG );
FTI_Protect(2, B, 10, FTI_SFLT );
// changing B size
B = realloc(B, sizeof(float) * 20) ;
// updating B size in protected list
FTI_Protect(2, B, 20, FTI_SFLT);

FTI_GetStoredSize

⬆️ Top

  • Returns size of protected variable saved in metadata

DEFINITION

long FTI_GetStoredSize ( int id )

INPUT

Variable What for?
int id ID of the protected variable

OUTPUT

Value Reason
long Size of a variable
0 No success

DESCRIPTION

FTI_GetStoredSize returns the size of a protected variable with id from the FTI metadata. The result may differ from the size of the variable known to the application at that moment. If the function is called on a restart, it returns the size stored in the metadata file. Called during the execution, it returns the value stored in the FTI runtime metadata, i.e. the size of the variable at the moment of the last checkpoint.

The function is needed to manually reallocate memory for protected variables with variable size on a recovery. Another possibility for the reallocation of memory is provided by FTI_Realloc.

EXAMPLE

...
long* array = calloc(arraySize, sizeof(long));
FTI_Protect(1, array, arraySize, FTI_LONG);
if (FTI_Status() != 0) {
    long arraySizeInBytes = FTI_GetStoredSize(1);
    if (arraySizeInBytes == 0) {
            printf("No stored size in metadata!\n");
            return GETSTOREDSIZE_FAILED;
    }
    array = realloc(array, arraySizeInBytes);
    int res = FTI_Recover();
    if (res != 0) {
        printf("Recovery failed!\n");
        return RECOVERY_FAILED;
    }
    //update arraySize
    arraySize = arraySizeInBytes / sizeof(long);
}
for (i = 0; i < max; i++) {
    if (i % CKTP_STEP) {
        //update FTI array size information
        FTI_Protect(1, array, arraySize, FTI_LONG);
        int res = FTI_Checkpoint((i % CKTP_STEP) + 1, 1);
        if (res != FTI_DONE) {
            printf("Checkpoint failed!.\n");
            return CHECKPOINT_FAILED;
        }
    }
    ...
    //add element to array
    arraySize += 1;
    array = realloc(array, arraySize * sizeof(long));
}
...

FTI_Realloc

⬆️ Top

Reallocates dataset to last checkpoint size.

DEFINITION

void* FTI_Realloc ( int id, void* ptr )

INPUT

Variable What for?
int id ID of the protected variable
void * ptr Pointer to memory address of variable

OUTPUT

Value Reason
void* Pointer to reallocated data
NULL On failure

DESCRIPTION

FTI_Realloc is called for protected variables with dynamic size on recovery. It reallocates sufficient memory to store the checkpoint data to the pointed memory address. It must be called before FTI_Recover to prevent segmentation faults. If the reallocation must/is wanted to be done within the application, FTI provides the function FTI_GetStoredSize to request the variable size of the checkpoint to recover.

EXAMPLE

...
FTI_Protect(1, &arraySize, 1, FTI_INTG);
long* array = calloc(arraySize, sizeof(long));
FTI_Protect(2, array, arraySize, FTI_LONG);
if (FTI_Status() != 0) {
    array = FTI_Realloc(2, array);
    if (array == NULL) {
            printf("Reallocation failed!\n");
            return REALLOC_FAILED;
    }

    int res = FTI_Recover();
    if (res != 0) {
        printf("Recovery failed!\n");
        return RECOVERY_FAILED;
    }
}
for (i = 0; i < max; i++) {
    if (i % CKTP_STEP) {
        //update FTI array size information
        FTI_Protect(2, array, arraySize, FTI_LONG);
        int res = FTI_Checkpoint((i % CKTP_STEP) + 1, 1);
        if (res != FTI_DONE) {
            printf("Checkpoint failed!.\n");
            return CHECKPOINT_FAILED;
        }
    }
    ...
    //add element to array
    arraySize += 1;
    array = realloc(array, arraySize * sizeof(long));
}
...

FTI_Checkpoint

⬆️ Top

  • Stores protected variables in the checkpoint of a desired safety level.

DEFINITION

int FTI_Checkpoint( int id, int level )

INPUT

Variable What for?
int id Unique checkpoint ID
int level Checkpoint level (1=L1, 2=L2, 3=L3, 4=L4)

OUTPUT

Value Reason
FTI_DONE Success
FTI_NSCS Failure

DESCRIPTION

FTI_Checkpoint is used to store the current values of protected variables into a checkpoint of safety level level (see Multilevel-Checkpointing for descritions of the particular levels).

NOTICE: The checkpoint id must be different from 0!

EXAMPLE

int i;
for (i = 0; i < 100; i ++) {
    if (i % 10 == 0) {
        FTI_Checkpoint ( i /10 + 1, 1) ;
    }
.
. // some computations
.
}

FTI_Status

⬆️ Top

  • Returns the current status of the recovery flag.

DEFINITION

int FTI_Status()

OUTPUT

Value Reason
int 0
No checkpoints taken yet or recovered successfully
int 1
At least one checkpoint is taken. If execution fails, the next start will be a restart
int 2
The execution is a restart from checkpoint level L4 and keep_last_checkpoint was enabled during the last execution

DESCRIPTION

FTI_Status returns the current status of the recovery flag.

EXAMPLE

if ( FTI_Status () != 0) {
    .
    . // this section will be executed during restart
    .
}

FTI_Recover

⬆️ Top

  • Recovers the data of the protected variables from the checkpoint file.

DEFINITION

int FTI_Recover()

OUTPUT

Value Reason
FTI_SCES Success
FTI_NSCS Failure

DESCRIPTION

FTI_Recover loads the data from the checkpoint file to the protected variables. It only recovers variables which are protected by a preceeding call to FTI_Protect. If a variable changes its size during execution, the proper amount of memory has to be allocated for that variable before the call to FTI_Recover. FTI provides the API functions FTI_GetStoredSize and FTI_Realloc for this case.

EXAMPLE

Basic example:

if ( FTI_Status() == 1 ) {
    FTI_Recover() ;
}

FTI_Snapshot

⬆️ Top

  • Invokes the recovery of protected variables on a restart.
  • Writes multilevel checkpoints regarding their requested frequencies during execution.

DEFINITION

int FTI_Snapshot()

OUTPUT

Value Reason
FTI_SCES Successfull call (without checkpointing) or if recovery successful
FTI_NSCS Failure of FTI_Checkpoint
FTI_DONE Success of FTI_Checkpoint
FTI_NREC Failure on recovery

DESCRIPTION

On a restart, FTI_Snapshot loads the data from the checkpoint file to the protected variables. During execution it performs checkpoints according to the checkpoint frequencies for the various safety levels. The frequencies may be set in the configuration file (see e.g.: ckpt_L1).

FTI_Snapshotcan only take care of variables which are protected by a preceding call to FTI_Protect.

EXAMPLE

int res = FTI_Snapshot();
if ( res == FTI_SCES ) {
    .
    . // executed after successful recover
    . // or when checkpoint is not required
}
else { // res == FTI_DONE
    .
    . // executed after successful checkpointing
    .
}

FTI_SendFile

⬆️ Top

  • Triggers the asynchronous transfer of local file to the PFS.

DEFINITION

int FTI_SendFile( char* lpath, char *rpath )

OUTPUT

Value Reason
int ID On success, the request ID is returned. This ID may be used to query the status of the request whithin FTI_GetStageStatus
FTI_NSCS On failure

DESCRIPTION

The user may store files local on the nodes to a fast storage layer (e.g. NVMe) and send these files to the PFS asynchronously to the execution. The transfer is performed by the FTI head process. Thus, in order to use this feature, the head feature must be enabled. Is the head feature disabled, the files are send by the calling process itself.

EXAMPLE

#include "fti.h"
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>


int main() {

    MPI_Init(NULL,NULL);
    FTI_Init("config.fti", MPI_COMM_WORLD);

    int rank;
    MPI_Comm_rank( FTI_COMM_WORLD, &rank );

    char local_dir[512];
    char remote_dir[] = "./";

    // get local stage directory
    if ( FTI_GetStageDir( local_dir, 512 ) != FTI_SCES ) {
        fprintf( stderr, "Failed to get the local directory.\n" );
        exit( EXIT_FAILURE );
    }


    char filename[512];
    snprintf( filename, 512, "testfile-%d", rank ); 

    char local_fn[512];
    char remote_fn[512];

    snprintf( local_fn, 512, "%s/%s", local_dir, filename );
    snprintf( remote_fn, 512, "%s/%s", remote_dir, filename );

    // crate local dummy file (1MB)
    FILE *fstream = fopen( local_fn, "wb+" );
    fsync(fileno(fstream));
    fclose( fstream );
    truncate( local_fn, 1024L*1024L );
    
    int reqID; 
    // send local file to PFS
    if ( (reqID = FTI_SendFile( local_fn, remote_fn )) == FTI_NSCS ) {
        fprintf( stderr, "Failed to stage %s.", local_fn );
        exit( EXIT_FAILURE );
    }
 
    // check status of staging request
    int reqStatus = FTI_SI_NINI; // set status to not initialized (null) 
    while( 1 ) {
        int request_final = 0;
        reqStatus = FTI_GetStageStatus( reqID );
        switch( reqStatus ) {
            case FTI_SI_ACTV:
                printf("Stage Status: ACTIVE\n");
                break;
            case FTI_SI_PEND:
                printf("Stage Status: PENDING\n");
                break;
            case FTI_SI_SCES:
                printf("Stage Status: SUCCESS\n");
                request_final = 1;
                break;
            case FTI_SI_FAIL:
                printf("Stage Status: FAILED\n");
                request_final = -1;
                break;
        }
        if ( request_final == -1) {
            fprintf( stderr, "Staging request with ID: %d failed!\n", reqID );
            break;
        }
        if ( request_final == 1) {
            printf( "Staging request with ID: %d succeed!\n", reqID );
            break;
        }
    }

    FTI_Finalize();
    MPI_Finalize();

    exit( EXIT_SUCCESS );

}

FTI_Finalize

⬆️ Top

  • Frees the allocated memory.
  • Communicates the end of the execution to dedicated threads.
  • Cleans checkpoints and metadata.

DEFINITION

int FTI_Finalize()

OUTPUT

Value Reason
FTI_SCES For application process
exit(0) For FTI process

DESCRIPTION

FTI_Finalize notifies the FTI processes that the execution is over, frees FTI internal data structures and it performs a clean up of the checkpoint folders at a normal execution. If the setting keep_last_ckptis set, it flushes local checkpoint files (if present) to the PFS. If the setting headis set to 1, it will also terminate the FTI processes. It should be called before MPI_Finalize().

EXAMPLE

int main ( int argc , char ** argv ) {
    .
    .
    .
    FTI_Finalize () ;
    MPI_Finalize () ;
    return 0;
}