-
Notifications
You must be signed in to change notification settings - Fork 27
Tutorial
I this page we present a tutorial of FTI. The purpose of the practice section is for you to get familiar with the FTI API as well as with the configuration file. Therefore there is limited information on how you should proceed.
- Create FTI directory
mkdir FTI
cd FTI
- Create Installation Directory
mkdir install-fti
- Set enviromental variable to installation path
export FTI_INSTALL_DIR=$PWD/install-fti
- Download FTI.
git clone https://github.com/leobago/fti
- Change into base directory
cd fti
- Set Enviroment Variable to FTI root
export FTI_ROOT=$PWD
- Create build directory and change into it
mkdir build
cd build
- Build FTI
cmake -DCMAKE_INSTALL_PREFIX:PATH=$FTI_INSTALL_DIR -DENABLE_TUTORIAL=1 ..
make
make install
The flag -DENABLE_TUTORIAL=1 besides building FTI, will also build the tutorial files
The library is installed at the
export TUTORIAL_EXEC=${FTI_ROOT}/build/tutorial/
export TUTORIAL_SRC=${FTI_ROOT}/tutorial/
You should always export this variables every time you try to start/continue the tutorial. Under the ${TUTORIAL_SRC} directory you can find various directories, each directory corresponds to a step presented in the tutorial.
To demonstrate the various safety levels of FTI, we will execute an example which uses the API function ‘FTI_Snapshot()’. Run the example in each case for at least one minute and interrupt the execution after that time by pressing ‘ctrl+c’. In some systems 'ctrl+c' does not kill all executing MPI processes, to kill all processes just killall 'executable'.
Change into folder ${TUTORIAL_EXEC}/L1 and run the execution with ‘make hdl1’. While the program is running, you may follow the events by observing the contents in the ‘local’ folder. In order to do that you can use the commands:
watch -n 1 $(find local)
watch -n 1 $(du -kh local)
or
cd local; watch -n 1 $(ls -lR)
(It may be illuminating to open the files in the ‘${TUTORIAL_EXEC}/L1/meta’ folder, using a text editor. What kind of information do you think is kept in these files?)
After interrupting the execution, run again ‘make hdl1’. The execution will (hopefully) resume from where the checkpoint was taken.
After the successful restart, interrupt the execution and delete one of the checkpoint files. The files are stored as (you can also simply delete the whole node directory): ${TUTORIAL_EXEC}/L1//local///l1/ckpt-Rank.fti. You will notice, that in that case the program won’t be able to resume the execution.
Change into folder ${TUTORIAL_EXEC}/L2 and run the execution with ‘make hdl2’. While the program is running, you may follow the events by observing the contents in the ‘local’ folder.
After interrupting the execution, run again ‘make hdl2’. The execution will also in this case (hopefully) resume from where the checkpoint was taken.
After the successful restart, interrupt the execution and delete one of the checkpoint files. You will notice that now the program (hopefully) will be able to resume the execution. Try to delete more then one file.
- How many files you can delete?
- Which files can you delete?
L3 – local checkpoint on the nodes + copy to the neighbor node + RS encoding:
Change into folder ${TUTORIAL_EXEC}/L3 and run the execution with ‘make hdl3’. While the program is running, you may follow the events by observing the contents in the ‘local’ folder.
After interrupting the execution, run again ‘make hd3’. The execution will (surprisingly) also in this case resume from where the checkpoint was taken.
After the successful restart, interrupt the execution and delete one of the checkpoint files, the program will be able to resume.
- How many files you can delete?
- Which files can you delete?
Change into folder ${TUTORIAL_EXEC}/L4 and run the execution with ‘make hdl4’. While the program is running, you may follow the events by observing the contents in the ‘global’ folder. After interrupting the execution, run again ‘make hdl4’. The execution will resume from where the checkpoint was taken.
Change into folder ${TUTORIAL_EXEC}/DCP/ and run the execution with ‘make hdDCP’. While the progam is running you may follow the “blue” messages in the terminal. What is actually happening? After a couple of checkpoints, you can kill the application and restart it.
Delete all files under ./local, ./global/ ./meta/ and open file config.DCP.fti with your favorite text editor. Change the following parameters :
- ckpt_io = 3 to ckpt_io = 1
- failure = “x” to failure = 0
The first option changes the file format and the second option indicates that we will do a fresh run (not a recovery). Run the execution with ‘make hdDCP’, do you observe any difference in the timings of the checkpoints?
- In the ‘${TUTORIAL_SRC}/practice’ folder you will find the source code of the program we used to demonstrate the FTI features. In this case without FTI being implemented. Try to implement FTI. You can use either the ‘FTI_Snapshot’ or ‘FTI_Checkpoint’ function to cause FTI taking a checkpoint. To build the code changes you implemented you can :
cd $FTI_ROOT/build
make
To execute your implementation change directory to ${TUTORIAL_EXEC}/practice and execute the binary hdp.exe.
Besides implementing the source code you need also to create an appropriate configuration file. Information about the options in the configuration file can be found here and example configuration files can be found here.
cd $TUTORIAL_EXEC/practice
make
mpirun -n 4 ./hdp.exe GRID_SIZE
GRID_SIZE is an integer number defining the size of the grid to be solved in Mb.
- Change into the folder ‘${TUTORIAL_EXEC}/tutorial/experiment’ and play with the settings of the configuration file. To run the program, type: ‘mpirun -n 8 hdex.exe config.fti’. Perform executions with ‘Head=0’ and ‘Head=1’, do you notice any difference in the execution duration? (Note: You may take frequent L3 checkpointing and a gridsize of 256 or higher. In that case you will most likely see a difference). (Remark: denotes the dynamic memory of each mpi process in MB)