Skip to content

Using TAU to Profile and or Trace ADIOS

Kevin Huck edited this page Sep 18, 2019 · 17 revisions

Instructions tested with ADIOS2 v2.4.0 release on September 5, 2019

Quick Links

TAU installation

ADIOS2 performance can be measured using the tau_exec program which is included in TAU (https://github.com/UO-OACISS/tau2). TAU configuration should auto-detect the architecture that it is building on, and with systems that require cross-compilation (Cray) it will build the measurement library for the compute nodes and the front-end tools for the login nodes. To configure/install TAU, do one of the following (for more details, see the main Installing TAU page):

...with Spack

With a recent Spack (http://github.com:spack/spack) installation, do:

spack install tau@develop

By default, that will build TAU with binutils, libdwarf, libelf, libunwind (all used for sampling, call site resolution and compiler-based instrumentation for address resolution), otf2 (for tracing support), papi (hardware counter support), PDT (auto-instrumentation support), pthread (for threading support). That is, the TAU configuration in Spack will look something like: tau@develop%[email protected]~bgq+binutils~comm~craycnl~cuda~gasnet+libdwarf+libelf+libunwind~likwid~mpi~ompt~opari~openmp+otf2+papi+pdt~phase~ppc64le+pthreads~python~scorep~shmem~x86_64. If your only interest is in using TAU to measure ADIOS2, you can save time/space by installing TAU without PDT:

spack install tau@develop~pdt

Note: As of September 5, 2019, the TAU spack package was missing support for the POSIX I/O wrapper. That's been requested as a pull request, but if not yet merged and you want POSIX I/O wrapper support, use this installation package: https://github.com/khuck/spack/blob/develop/var/spack/repos/builtin/packages/tau/package.py

...or Manually

TAU will download and build many dependencies, with the exception of PAPI - usually pre-installed on systems. TAU assumes the GCC compilers by default. For ABI compatibility, TAU requires you build with the same compiler and MPI configuration used with ADIOS2 and the application(s). By default TAU will install in-place, in tau2/$arch.

$ git clone https://github.com/UO-OACISS/tau2.git
$ cd tau2
$ ./configure -bfd=download -dwarf=download -elf=download -unwind=download -otf=download -papi=/path/to/papi -pthread -mpi -iowrapper -prefix=/path/to/installation
$ make -j install

When using different compilers (i.e. PGI), you specify them on the configuration line like so (for historical reasons, the fortran compiler is specified by vendor, not name...in most cases):

$ ./configure -cc=pgcc -c++=pgc++ -fortran=pgi ...

Running with TAU

Using TAU at runtime just requires the tau_exec wrapper script at runtime. First, make sure that TAU is in your path. If you installed with Spack and you have used either spack bootstrap or installed modules with Spack, you can use spack load tau. For all others:

# bash:
$ export PATH=$PATH:/home/users/khuck/src/tau2/x86_64/bin
# csh:
> setenv PATH $PATH:/home/users/khuck/src/tau2/x86_64/bin

The primary argument passed to tau_exec is the TAU configuration (because many different TAU configurations/builds can live in the same installation directory). To see which configurations are available, use the tau-config program. In the following example, there is only one configuration.

$ tau-config --list-matching
shared-papi-mpi-pthread

By replacing the dashes (-) with commas (,), add tau_exec to the mpirun/srun/jsrun command. For example:

$ mpirun -n 12 ./heatSimulation sim.bp  4 3  5 10 10 10 :  \
         -n 2 ./heatAnalysis sim.bp analysis.bp 2 1

becomes (to measure the heatSimulation application):

$ mpirun -n 12 tau_exec -T shared,papi,mpi,pthread ./heatSimulation sim.bp  4 3  5 10 10 10 :  \
         -n 2 ./heatAnalysis sim.bp analysis.bp 2 1

Other useful arguments to tau_exec:

  • -ebs will enable sampling
  • -memory will enable memory usage tracking (malloc/free)
  • -io will enable the posix IO wrapper
  • -cupti will enable CUDA/CUPTI measurement (TAU needs to be configured with -cuda=${OLCF_CUDA_ROOT})
  • -opencl will enable the OpenCL wrapper (TAU needs to be configured with -opencl=/path/to/opencl/headers)
  • -papi_components will enable a plugin to read OS/HW metrics periodically (see tau2/examples/papi_components or https://github.com/UO-OACISS/tau2/wiki/Using-the-PAPI-Components-Plugin for more information)

Working with measurement data

After execution, you should have a profile.* file for each thread of execution and each MPI rank. (There are TAU options for only generating one file, but for now we will work with the profile.* files.) To get a quick text dump of the measurement, use the pprof program in the TAU path:

$ pprof
Reading Profile files in profile.*

NODE 0;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0            4          630           1         412     630275 .TAU application
 64.3          405          405           1           2     405193 MPI_Init() 
 16.3           46          102           1           2     102709 MPI_Finalize() 
 14.9           44           93           1           3      93820 BP4Writer::Close
  9.0           56           56           2           0      28320 pthread_join
  7.9           33           49           1           5      49630 BP4Writer::WriteProfilingJSONFile
  2.8        0.189           17           1           4      17398 IO::Open
  2.7           17           17           1           5      17103 BP4Writer::Open
  2.6           16           16          31           0        532 MPI_Gather() 
  0.5         0.48            3          10          26        319 BP4Writer::EndStep
  0.4        0.285            2          10          20        229 BP4Writer::Flush
  0.3            1            1          11         280        175 BP4Writer::WriteCollectiveMetadataFile
  0.2            1            1         182           0          7 MPI_Recv() 
  0.1        0.767        0.767         182           0          4 MPI_Send() 
  0.1        0.696        0.696           1           0        696 MPI_Comm_split() 
  0.1        0.381        0.409          10          20         41 BP4Writer::PerformPuts
  0.0        0.271        0.271          10           0         27 MPI_Reduce() 
  0.0        0.198        0.198          31           0          6 MPI_Gatherv() 
  0.0        0.163        0.163           2           0         82 MPI_Comm_dup() 
  0.0        0.127        0.127          11           0         12 BP4Writer::WriteData
  0.0        0.112        0.112           2           0         56 pthread_create
  0.0        0.104        0.104           2           0         52 MPI_Bcast() 
  0.0        0.087        0.087           1           0         87 IO::DefineVariable
  0.0        0.087        0.087         182           0          0 MPI_Comm_rank() 
  0.0         0.07         0.07          15           0          5 IO::other
  0.0        0.062        0.062           2           0         31 MPI_Allreduce() 
  0.0        0.051        0.051           2           0         26 MPI_Comm_free() 
  0.0        0.043        0.049           3           2         16 IO::DefineAttribute
  0.0        0.033        0.033           2           0         16 MPI_Barrier() 
  0.0        0.019        0.019          10           0          2 BP4Writer::BeginStep
  0.0        0.015        0.015          34           0          0 MPI_Comm_size() 
  0.0        0.014        0.014          10           0          1 BP4Writer::PopulateMetadataIndexFileContent
  0.0        0.013        0.013          10           0          1 IO::InquireVariable
  0.0        0.009        0.009           6           0          2 IO::InquireAttribute
  0.0        0.004        0.004           1           0          4 MPI_Finalized()
---------------------------------------------------------------------------------------

USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
         2          8          8          8          0  Message size for all-reduce
         2       1816          8        912        904  Message size for broadcast
        62       8676          0      530.9       1250  Message size for gather
        10          8          8          8          0  Message size for reduce
---------------------------------------------------------------------------------------

NODE 0;CONTEXT 0;THREAD 1:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0            6          565           1           1     565356 .TAU application
 98.8          558          558           1           0     558697 progress_engine 

NODE 0;CONTEXT 0;THREAD 2:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0        0.693          531           1           1     531646 .TAU application
 99.9          530          530           1           0     530953 addr=<7f6cebb6de60>

(more ranks follow)

In this case (BP4), the main thread did all the work, and the two additional threads were spawned by OpenMPI as progress threads.

Adding the -ebs flag, we get:

$ pprof
Reading Profile files in profile.*

NODE 0;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0            5          317           1         412     317863 .TAU application
 43.8          139          139           1           2     139270 MPI_Init() 
 33.2          105          105           2           0      52700 [SAMPLE] __close_nocancel 
 28.8           67           91           1           3      91469 BP4Writer::Close
 25.4            0           80           1           0      80841 BP4Writer::Close => [CONTEXT] BP4Writer::Close
 25.4           80           80           1           0      80841 BP4Writer::Close => [CONTEXT] BP4Writer::Close => [SAMPLE] __close_nocancel 
 25.4            0           80           1           0      80841 [CONTEXT] BP4Writer::Close
 21.7           61           68           1           2      68870 MPI_Finalize() 
  7.7            0           24           1           0      24558 BP4Writer::WriteProfilingJSONFile => [CONTEXT] BP4Writer::WriteProfilingJSONFile
  7.7           24           24           1           0      24558 BP4Writer::WriteProfilingJSONFile => [CONTEXT] BP4Writer::WriteProfilingJSONFile => [SAMPLE] __close_nocancel 
  7.7            0           24           1           0      24558 [CONTEXT] BP4Writer::WriteProfilingJSONFile
  7.6           23           24           1           5      24017 BP4Writer::WriteProfilingJSONFile
  2.4            7            7           2           0       3788 pthread_join
  1.8        0.197            5           1           4       5651 IO::Open
  1.6            4            4           1           5       4941 BP4Writer::Open
  1.3        0.836            4          10          26        414 BP4Writer::EndStep
  0.9        0.286            2          10          20        296 BP4Writer::Flush
  0.8        0.984            2          11         280        239 BP4Writer::WriteCollectiveMetadataFile
  0.4            1            1         182           0          7 MPI_Recv() 
  0.4            1            1          10           0        112 MPI_Reduce() 
  0.3        0.864        0.864           1           0        864 MPI_Comm_split() 
  0.2        0.681        0.681           2           0        340 MPI_Comm_dup() 
  0.2        0.523        0.523         182           0          3 MPI_Send() 
  0.1        0.319        0.339          10          20         34 BP4Writer::PerformPuts
  0.1        0.288        0.288          31           0          9 MPI_Gather() 
  0.1        0.191        0.191          31           0          6 MPI_Gatherv() 
  0.1        0.189        0.189           2           0         94 MPI_Barrier() 
  0.0        0.117        0.117           2           0         58 pthread_create
  0.0        0.083        0.083          11           0          8 BP4Writer::WriteData
  0.0         0.08         0.08           2           0         40 MPI_Bcast() 
  0.0        0.063        0.063           1           0         63 IO::DefineVariable
  0.0        0.062        0.062         182           0          0 MPI_Comm_rank() 
  0.0        0.048        0.048           2           0         24 MPI_Allreduce() 
  0.0        0.046        0.046           2           0         23 MPI_Comm_free() 
  0.0        0.035        0.035          15           0          2 IO::other
  0.0        0.029        0.033           3           2         11 IO::DefineAttribute
  0.0        0.016        0.016          10           0          2 BP4Writer::BeginStep
  0.0        0.011        0.011          34           0          0 MPI_Comm_size() 
  0.0         0.01         0.01          10           0          1 IO::InquireVariable
  0.0        0.009        0.009          10           0          1 BP4Writer::PopulateMetadataIndexFileContent
  0.0        0.006        0.006           6           0          1 IO::InquireAttribute
  0.0        0.003        0.003           1           0          3 MPI_Finalized()
---------------------------------------------------------------------------------------

USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
         2          8          8          8          0  Message size for all-reduce
         2       1816          8        912        904  Message size for broadcast
        62       8676          0      530.9       1249  Message size for gather
        10          8          8          8          0  Message size for reduce
---------------------------------------------------------------------------------------

NODE 0;CONTEXT 0;THREAD 1:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0            5          267           1           1     267043 .TAU application
 98.0          261          261           1           0     261670 progress_engine 

NODE 0;CONTEXT 0;THREAD 2:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0        0.993          240           1           1     240100 .TAU application
 99.6          239          239           1           0     239107 addr=<7f09eb916e60> 

(more ranks follow)

Adding the -memory flag, we get:

USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
        89       3789     0.0166      42.74      399.4  Decrease in Heap Memory (KB)
        10    0.02051    0.02051    0.02051  3.293E-10  Decrease in Heap Memory (KB) : BP4Writer::BeginStep
         1     0.0166     0.0166     0.0166  2.328E-10  Decrease in Heap Memory (KB) : BP4Writer::Close
         1    0.03809    0.03809    0.03809  4.657E-10  Decrease in Heap Memory (KB) : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile
         1    0.02051    0.02051    0.02051  2.328E-10  Decrease in Heap Memory (KB) : BP4Writer::Close => BP4Writer::WriteData
         1     0.0332     0.0332     0.0332          0  Decrease in Heap Memory (KB) : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile
         9     0.1279     0.1279     0.1279          0  Decrease in Heap Memory (KB) : BP4Writer::EndStep
         9     0.0166     0.0166     0.0166  2.328E-10  Decrease in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush
         9    0.03809    0.03809    0.03809  6.585E-10  Decrease in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile
        10    0.04297    0.04297    0.04297          0  Decrease in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => BP4Writer::PopulateMetadataIndexFileContent
        10    0.02051    0.02051    0.02051  3.293E-10  Decrease in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData
         9     0.1318     0.1318     0.1318  2.634E-09  Decrease in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::PerformPuts
        10    0.01953    0.01953    0.01953          0  Decrease in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::PerformPuts => IO::InquireVariable
         6    0.02051    0.02051    0.02051  2.328E-10  Decrease in Heap Memory (KB) : BP4Writer::EndStep => IO::InquireAttribute
         2       5.34      5.309      5.324    0.01562  Decrease in Heap Memory (KB) : MPI_Comm_free() 
         1       3789       3789       3789  6.104E-05  Decrease in Heap Memory (KB) : MPI_Finalize() 
  2.84E+04  1.052E+06          1      228.1       7723  Heap Allocate
       517  3.277E+04         16      156.5       1494  Heap Allocate : .TAU application
         6         39         21      31.33      7.587  Heap Allocate : BP4Writer::Close
         2         40         38         39          1  Heap Allocate : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile
        45       8192          3      321.3       1291  Heap Allocate : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile
        83        944          8      56.06      125.7  Heap Allocate : BP4Writer::EndStep
        40         39         21         30          9  Heap Allocate : BP4Writer::EndStep => BP4Writer::Flush
       361  1.058E+04         22        212      763.5  Heap Allocate : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile
        22       1040          8      61.27      215.6  Heap Allocate : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Reduce() 
        89        200          1      25.63         23  Heap Allocate : BP4Writer::EndStep => BP4Writer::PerformPuts
        12        397         24      145.5      126.2  Heap Allocate : IO::DefineAttribute
        12        552         16      95.67      157.3  Heap Allocate : IO::DefineVariable
         3       1328         80        536      562.2  Heap Allocate : IO::Open
       114  1.638E+04          1      198.4       1524  Heap Allocate : IO::Open => BP4Writer::Open
        91       1200          1      97.81      214.5  Heap Allocate : IO::Open => MPI_Comm_dup() 
         3         96         96         96          0  Heap Allocate : IO::other
         2          8          8          8          0  Heap Allocate : MPI_Allreduce() 
         2        148         32         90         58  Heap Allocate : MPI_Bcast() 
        91       1200          1      112.8      216.6  Heap Allocate : MPI_Comm_dup() 
       100       1920          1      140.2        324  Heap Allocate : MPI_Comm_split() 
       267  3.282E+04          7      656.9       4444  Heap Allocate : MPI_Finalize() 
 2.654E+04  1.052E+06          1      228.3       7973  Heap Allocate : MPI_Init() 
         1          4          4          4          0  Heap Allocate : progress_engine 
 2.805E+04  1.052E+06          1      222.9       7771  Heap Free
       544  3.277E+04         16      214.5       1675  Heap Free : .TAU application
        10         21         21         21          0  Heap Free : BP4Writer::BeginStep
         4         39         17      27.75      9.038  Heap Free : BP4Writer::Close
         3         40         38         39     0.8165  Heap Free : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile
         1         21         21         21          0  Heap Free : BP4Writer::Close => BP4Writer::WriteData
        46       8192          3      315.1       1278  Heap Free : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile
        56        944          8      54.25      139.5  Heap Free : BP4Writer::EndStep
        30         39         17      25.67      9.568  Heap Free : BP4Writer::EndStep => BP4Writer::Flush
       361       8676         22      182.7      533.5  Heap Free : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile
        10         44         44         44          0  Heap Free : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => BP4Writer::PopulateMetadataIndexFileContent
        20          8          8          8          0  Heap Free : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Reduce() 
        10         21         21         21          0  Heap Free : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData
       126         64          1      24.63      14.81  Heap Free : BP4Writer::EndStep => BP4Writer::PerformPuts
        10         20         20         20          0  Heap Free : BP4Writer::EndStep => BP4Writer::PerformPuts => IO::InquireVariable
         6         21         21         21          0  Heap Free : BP4Writer::EndStep => IO::InquireAttribute
         6        397         20      84.83      139.6  Heap Free : IO::DefineAttribute
         7        296         16         65      95.51  Heap Free : IO::DefineVariable
        57         96          1      28.82      19.14  Heap Free : IO::Open => BP4Writer::Open
        39        525          4      80.51      143.8  Heap Free : IO::Open => MPI_Comm_dup() 
         2          8          8          8          0  Heap Free : MPI_Allreduce() 
        44        525          4      87.52        138  Heap Free : MPI_Comm_dup() 
        23       1200         32      474.1      412.4  Heap Free : MPI_Comm_free() 
        46        525          4      83.57      134.6  Heap Free : MPI_Comm_split() 
 1.161E+04  1.052E+06          1      348.7  1.191E+04  Heap Free : MPI_Finalize() 
 1.498E+04  6.554E+04          1      131.8       1712  Heap Free : MPI_Init() 
         1         48         48         48          0  Heap Free : progress_engine 
 6.214E+04       4454    0.03906       1608       1207  Heap Memory Used (KB)
       782       4445      136.6       4411      319.3  Heap Memory Used (KB) at Entry
         1      136.6      136.6      136.6          0  Heap Memory Used (KB) at Entry : .TAU application
        10       4441       4428       4440      3.985  Heap Memory Used (KB) at Entry : BP4Writer::BeginStep
         1       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Entry : BP4Writer::Close
         1       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile
         1       4441       4441       4441          0  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteData
         1       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile
         2       4445       4441       4443      1.736  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Comm_rank() 
         1       4441       4441       4441          0  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Comm_size() 
         1       4441       4441       4441          0  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Gather() 
         1       4445       4445       4445  6.104E-05  Heap Memory Used (KB) at Entry : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Gatherv() 
        10       4441       4428       4440      3.899  Heap Memory Used (KB) at Entry : BP4Writer::EndStep
        10       4441       4430       4440      3.428  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush
        10       4441       4430       4440      3.428  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile
        10       4441       4441       4441          0  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => BP4Writer::PopulateMetadataIndexFileContent
       170       4444       4430       4441      2.619  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Comm_rank() 
        30       4441       4431       4440      2.777  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Comm_size() 
        30       4442       4431       4441      2.777  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Gather() 
        30       4443       4432       4441      2.499  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Gatherv() 
        10       4441       4430       4440      3.428  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Reduce() 
        10       4441       4430       4440      3.428  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData
        10       4441       4428       4440      3.899  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::PerformPuts
        10       4441       4428       4440      3.899  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::PerformPuts => IO::InquireVariable
        10       4441       4428       4440      3.899  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => BP4Writer::PerformPuts => IO::other
         6       4430       4429       4429     0.3899  Heap Memory Used (KB) at Entry : BP4Writer::EndStep => IO::InquireAttribute
         3       4401       4400       4400       0.49  Heap Memory Used (KB) at Entry : IO::DefineAttribute
         2       4400       4400       4400     0.1445  Heap Memory Used (KB) at Entry : IO::DefineAttribute => IO::other
         1       4399       4399       4399          0  Heap Memory Used (KB) at Entry : IO::DefineVariable
         1       4402       4402       4402          0  Heap Memory Used (KB) at Entry : IO::Open
         1       4408       4408       4408  6.104E-05  Heap Memory Used (KB) at Entry : IO::Open => BP4Writer::Open
         1       4425       4425       4425  6.104E-05  Heap Memory Used (KB) at Entry : IO::Open => BP4Writer::Open => MPI_Barrier() 
         4       4428       4425       4427      1.146  Heap Memory Used (KB) at Entry : IO::Open => BP4Writer::Open => MPI_Comm_rank() 
         1       4402       4402       4402          0  Heap Memory Used (KB) at Entry : IO::Open => MPI_Comm_dup() 
         1       4408       4408       4408  6.104E-05  Heap Memory Used (KB) at Entry : IO::Open => MPI_Comm_rank() 
         1       4408       4408       4408  6.104E-05  Heap Memory Used (KB) at Entry : IO::Open => MPI_Comm_size() 
         3       4433       4429       4431      1.518  Heap Memory Used (KB) at Entry : IO::other
         2       4428       4428       4428          0  Heap Memory Used (KB) at Entry : MPI_Allreduce() 
         1       4441       4441       4441          0  Heap Memory Used (KB) at Entry : MPI_Barrier() 
         2       4398       4396       4397     0.9751  Heap Memory Used (KB) at Entry : MPI_Bcast() 
         1       4388       4388       4388          0  Heap Memory Used (KB) at Entry : MPI_Comm_dup() 
         2       4441       4436       4438      2.684  Heap Memory Used (KB) at Entry : MPI_Comm_free() 
         5       4396       4378       4390      6.982  Heap Memory Used (KB) at Entry : MPI_Comm_rank() 
         2       4387       4378       4382      4.501  Heap Memory Used (KB) at Entry : MPI_Comm_size() 
         1       4378       4378       4378  6.104E-05  Heap Memory Used (KB) at Entry : MPI_Comm_split() 
         1       4387       4387       4387  6.104E-05  Heap Memory Used (KB) at Entry : MPI_Finalize() 
         2       1629       1185       1407      222.3  Heap Memory Used (KB) at Entry : MPI_Finalize() => pthread_join
         1       4436       4436       4436          0  Heap Memory Used (KB) at Entry : MPI_Finalized()
         1      254.1      254.1      254.1          0  Heap Memory Used (KB) at Entry : MPI_Init() 
         2       1421      360.5      890.5        530  Heap Memory Used (KB) at Entry : MPI_Init() => pthread_create
       182       4441       4428       4441      1.385  Heap Memory Used (KB) at Entry : MPI_Recv() 
       182       4441       4428       4441      1.385  Heap Memory Used (KB) at Entry : MPI_Send() 
         0          0          0          0          0  Heap Memory Used (KB) at Entry : progress_engine 
       782       4445      360.5       4413      306.1  Heap Memory Used (KB) at Exit
         1      597.6      597.6      597.6          0  Heap Memory Used (KB) at Exit : .TAU application
        10       4441       4428       4440      3.985  Heap Memory Used (KB) at Exit : BP4Writer::BeginStep
         1       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::Close
         1       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile
         1       4441       4441       4441          0  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteData
         1       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile
         2       4445       4441       4443      1.736  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Comm_rank() 
         1       4441       4441       4441          0  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Comm_size() 
         1       4441       4441       4441          0  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Gather() 
         1       4445       4445       4445  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => MPI_Gatherv() 
        10       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::EndStep
        10       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush
        10       4441       4441       4441  6.104E-05  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile
        10       4441       4441       4441          0  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => BP4Writer::PopulateMetadataIndexFileContent
       170       4444       4430       4441      2.619  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Comm_rank() 
        30       4441       4431       4440      2.777  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Comm_size() 
        30       4442       4431       4441      2.777  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Gather() 
        30       4443       4432       4441      2.499  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Gatherv() 
        10       4441       4431       4440       3.08  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Reduce() 
        10       4441       4430       4440      3.428  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData
        10       4441       4429       4440      3.802  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::PerformPuts
        10       4441       4428       4440      3.899  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::PerformPuts => IO::InquireVariable
        10       4441       4428       4440      3.899  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => BP4Writer::PerformPuts => IO::other
         6       4430       4429       4429     0.3899  Heap Memory Used (KB) at Exit : BP4Writer::EndStep => IO::InquireAttribute
         3       4402       4400       4401     0.6682  Heap Memory Used (KB) at Exit : IO::DefineAttribute
         2       4400       4400       4400     0.1445  Heap Memory Used (KB) at Exit : IO::DefineAttribute => IO::other
         1       4400       4400       4400          0  Heap Memory Used (KB) at Exit : IO::DefineVariable
         1       4428       4428       4428          0  Heap Memory Used (KB) at Exit : IO::Open
         1       4428       4428       4428  6.104E-05  Heap Memory Used (KB) at Exit : IO::Open => BP4Writer::Open
         1       4425       4425       4425  6.104E-05  Heap Memory Used (KB) at Exit : IO::Open => BP4Writer::Open => MPI_Barrier() 
         4       4428       4425       4427      1.146  Heap Memory Used (KB) at Exit : IO::Open => BP4Writer::Open => MPI_Comm_rank() 
         1       4406       4406       4406          0  Heap Memory Used (KB) at Exit : IO::Open => MPI_Comm_dup() 
         1       4408       4408       4408  6.104E-05  Heap Memory Used (KB) at Exit : IO::Open => MPI_Comm_rank() 
         1       4408       4408       4408  6.104E-05  Heap Memory Used (KB) at Exit : IO::Open => MPI_Comm_size() 
         3       4433       4429       4431      1.518  Heap Memory Used (KB) at Exit : IO::other
         2       4428       4428       4428          0  Heap Memory Used (KB) at Exit : MPI_Allreduce() 
         1       4441       4441       4441          0  Heap Memory Used (KB) at Exit : MPI_Barrier() 
         2       4398       4396       4397     0.8872  Heap Memory Used (KB) at Exit : MPI_Bcast() 
         1       4394       4394       4394  6.104E-05  Heap Memory Used (KB) at Exit : MPI_Comm_dup() 
         2       4436       4430       4433        2.7  Heap Memory Used (KB) at Exit : MPI_Comm_free() 
         5       4396       4378       4390      6.982  Heap Memory Used (KB) at Exit : MPI_Comm_rank() 
         2       4387       4378       4382      4.501  Heap Memory Used (KB) at Exit : MPI_Comm_size() 
         1       4387       4387       4387  6.104E-05  Heap Memory Used (KB) at Exit : MPI_Comm_split() 
         1      597.6      597.6      597.6          0  Heap Memory Used (KB) at Exit : MPI_Finalize() 
         2       1629       1185       1407      222.3  Heap Memory Used (KB) at Exit : MPI_Finalize() => pthread_join
         1       4436       4436       4436          0  Heap Memory Used (KB) at Exit : MPI_Finalized()
         1       4378       4378       4378  6.104E-05  Heap Memory Used (KB) at Exit : MPI_Init() 
         2       1421      360.5      890.5        530  Heap Memory Used (KB) at Exit : MPI_Init() => pthread_create
       182       4441       4428       4441      1.385  Heap Memory Used (KB) at Exit : MPI_Recv() 
       182       4441       4428       4441      1.385  Heap Memory Used (KB) at Exit : MPI_Send() 
         0          0          0          0          0  Heap Memory Used (KB) at Exit : progress_engine 
        20       4123    0.09375      234.5      897.7  Increase in Heap Memory (KB)
         1        461        461        461  5.395E-06  Increase in Heap Memory (KB) : .TAU application
         1      12.87      12.87      12.87  2.384E-07  Increase in Heap Memory (KB) : BP4Writer::EndStep
         1      11.41      11.41      11.41          0  Increase in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush
         1      11.39      11.39      11.39  1.686E-07  Increase in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile
         1       1.16       1.16       1.16  2.107E-08  Increase in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Reduce() 
         1     0.1885     0.1885     0.1885  2.634E-09  Increase in Heap Memory (KB) : BP4Writer::EndStep => BP4Writer::PerformPuts
         3      0.665     0.2461     0.4027     0.1867  Increase in Heap Memory (KB) : IO::DefineAttribute
         1     0.6768     0.6768     0.6768  7.451E-09  Increase in Heap Memory (KB) : IO::DefineVariable
         1      26.45      26.45      26.45          0  Increase in Heap Memory (KB) : IO::Open
         1      20.48      20.48      20.48          0  Increase in Heap Memory (KB) : IO::Open => BP4Writer::Open
         1      4.395      4.395      4.395   5.96E-08  Increase in Heap Memory (KB) : IO::Open => MPI_Comm_dup() 
         3    0.09375    0.09375    0.09375          0  Increase in Heap Memory (KB) : IO::other
         1     0.1758     0.1758     0.1758          0  Increase in Heap Memory (KB) : MPI_Bcast() 
         1      5.517      5.517      5.517          0  Increase in Heap Memory (KB) : MPI_Comm_dup() 
         1      9.002      9.002      9.002          0  Increase in Heap Memory (KB) : MPI_Comm_split() 
         1       4123       4123       4123  6.104E-05  Increase in Heap Memory (KB) : MPI_Init() 
         0          0          0          0          0  Increase in Heap Memory (KB) : progress_engine 
       285  1.857E+05          8       1191  1.134E+04  MEMORY LEAK! Heap Allocate : .TAU application
         1       1040       1040       1040          0  MEMORY LEAK! Heap Allocate : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => MPI_Reduce() 
         1         15         15         15          0  MEMORY LEAK! Heap Allocate : IO::Open => BP4Writer::Open
         3        396          4      149.7      175.2  MEMORY LEAK! Heap Allocate : IO::Open => MPI_Comm_dup() 
         4        200          4       82.5      77.09  MEMORY LEAK! Heap Allocate : MPI_Comm_dup() 
         4        396        102      211.2      112.2  MEMORY LEAK! Heap Allocate : MPI_Comm_split() 
       657  1.782E+04          1      362.8       1196  MEMORY LEAK! Heap Allocate : MPI_Init() 
        16        136         51      56.31      20.58  MEMORY LEAK! Heap Allocate : progress_engine 
         2          8          8          8          0  Message size for all-reduce
         2       1816          8        912        904  Message size for broadcast
        62       8676          0      530.9       1249  Message size for gather
        10          8          8          8          0  Message size for reduce
---------------------------------------------------------------------------------------

Adding the -io flag, we get:

$ pprof 0
Reading Profile files in profile.*

NODE 0;CONTEXT 0;THREAD 0:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0            2          357           1         459     357101 .TAU application
 38.5          124          137           1         336     137443 MPI_Init() 
 35.3           75          126           1          21     126186 MPI_Finalize() 
 22.4        0.119           80           1           6      80089 BP4Writer::Close
 14.0           49           49           2           0      24952 pthread_join
 11.2        0.091           40           1           8      40168 BP4Writer::WriteProfilingJSONFile
 11.2           39           39          68           0        587 close() 
  9.1           32           32          27           0       1210 fclose() 
  2.2            7            7          29           0        271 fopen() 
  1.9            6            6          31           0        220 MPI_Gather() 
  1.9            6            6          51           0        131 open() 
  1.4        0.118            5           1           4       5006 IO::Open
  1.3        0.357            4           1           8       4725 BP4Writer::Open
  0.7        0.291            2          10          26        260 BP4Writer::EndStep
  0.6        0.192            2          10          20        206 BP4Writer::Flush
  0.5        0.674            1          11         303        170 BP4Writer::WriteCollectiveMetadataFile
  0.4            1            1         147           0         10 write() 
  0.4            1            1           2           0        677 fopen64() 
  0.3        0.904        0.904         182           0          5 MPI_Recv() 
  0.2        0.873        0.873          42           0         21 read() 
  0.2        0.543        0.543           1           0        543 MPI_Comm_split() 
  0.1        0.534        0.534          10           0         53 MPI_Reduce() 
  0.1        0.497        0.497         182           0          3 MPI_Send() 
  0.1         0.29         0.29           8           0         36 socket() 
  0.1        0.255        0.255          43           0          6 fwrite() 
  0.1        0.246        0.246           3           0         82 socketpair() 
  0.1        0.225        0.242          10          20         24 BP4Writer::PerformPuts
  0.1        0.203        0.203           2           0        102 MPI_Comm_dup() 
  0.1        0.181        0.181          31           0          6 MPI_Gatherv() 
  0.0        0.027        0.142          11          10         13 BP4Writer::WriteData
  0.0        0.098        0.098           8           0         12 fscanf() 
  0.0        0.086        0.086           2           0         43 pthread_create
  0.0        0.078        0.078           1           0         78 send() 
  0.0        0.069        0.069           2           0         34 MPI_Barrier() 
  0.0        0.061        0.061           1           0         61 connect() 
  0.0        0.057        0.057           2           0         28 MPI_Bcast() 
  0.0        0.054        0.054           1           0         54 writev() 
  0.0        0.048        0.048         182           0          0 MPI_Comm_rank() 
  0.0        0.048        0.048           2           0         24 recv() 
  0.0        0.038        0.038           2           0         19 MPI_Allreduce() 
  0.0        0.029        0.029           1           0         29 IO::DefineVariable
  0.0        0.026        0.026           2           0         13 MPI_Comm_free() 
  0.0        0.015        0.017           3           2          6 IO::DefineAttribute
  0.0        0.017        0.017          15           0          1 IO::other
  0.0        0.016        0.016           5           0          3 bind() 
  0.0        0.014        0.014           2           0          7 rewind() 
  0.0        0.011        0.011          10           0          1 BP4Writer::BeginStep
  0.0        0.009        0.009          10           0          1 BP4Writer::PopulateMetadataIndexFileContent
  0.0        0.009        0.009          34           0          0 MPI_Comm_size() 
  0.0        0.008        0.008          10           0          1 IO::InquireVariable
  0.0        0.006        0.006           6           0          1 IO::InquireAttribute
  0.0        0.002        0.002           2           0          1 lseek() 
  0.0        0.001        0.001           1           0          1 MPI_Finalized()
---------------------------------------------------------------------------------------

USER EVENTS Profile :NODE 0, CONTEXT 0, THREAD 0
---------------------------------------------------------------------------------------
NumSamples   MaxValue   MinValue  MeanValue  Std. Dev.  Event Name
---------------------------------------------------------------------------------------
        44       8192          1       1535       2814  Bytes Read
         7         21         21         21          0  Bytes Read : MPI_Init() => fscanf() 
        34       8192          1       1928       3079  Bytes Read : MPI_Init() => read() 
         2          4          4          4          0  Bytes Read : MPI_Init() => recv() 
         0          0          0          0          0  Bytes Read : progress_engine => read() 
         1       1816       1816       1816          0  Bytes Read : read() 
         7         21         21         21          0  Bytes Read <file=/proc/net/if_inet6>
         7         21         21         21          0  Bytes Read <file=/proc/net/if_inet6> : MPI_Init() => fscanf() 
         1          1          1          1          0  Bytes Read <file=/proc/sys/kernel/yama/ptrace_scope>
         1          1          1          1          0  Bytes Read <file=/proc/sys/kernel/yama/ptrace_scope> : MPI_Init() => read() 
         0          0          0          0          0  Bytes Read <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/libopen-pal.so.40.20.1>
         0          0          0          0          0  Bytes Read <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/libopen-pal.so.40.20.1> : read() 
         0          0          0          0          0  Bytes Read <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/openmpi/mca_pmix_pmix3x.so>
         0          0          0          0          0  Bytes Read <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/openmpi/mca_pmix_pmix3x.so> : read() 
         2         20         20         20          0  Bytes Read <file=/sys/class/infiniband/hfi1_0/node_guid>
         2         20         20         20          0  Bytes Read <file=/sys/class/infiniband/hfi1_0/node_guid> : MPI_Init() => read() 
         1          6          6          6          0  Bytes Read <file=/sys/class/infiniband/hfi1_0/node_type>
         1          6          6          6          0  Bytes Read <file=/sys/class/infiniband/hfi1_0/node_type> : MPI_Init() => read() 
         3         40         40         40          0  Bytes Read <file=/sys/class/infiniband/hfi1_0/ports/1/gids/0>
         3         40         40         40          0  Bytes Read <file=/sys/class/infiniband/hfi1_0/ports/1/gids/0> : MPI_Init() => read() 
         1          2          2          2          0  Bytes Read <file=/sys/class/infiniband_verbs/abi_version>
         1          2          2          2          0  Bytes Read <file=/sys/class/infiniband_verbs/abi_version> : MPI_Init() => read() 
         4          2          2          2          0  Bytes Read <file=/sys/class/infiniband_verbs/uverbs0/abi_version>
         4          2          2          2          0  Bytes Read <file=/sys/class/infiniband_verbs/uverbs0/abi_version> : MPI_Init() => read() 
         4         54         54         54          0  Bytes Read <file=/sys/class/infiniband_verbs/uverbs0/device/modalias>
         4         54         54         54          0  Bytes Read <file=/sys/class/infiniband_verbs/uverbs0/device/modalias> : MPI_Init() => read() 
         5          7          7          7          0  Bytes Read <file=/sys/class/infiniband_verbs/uverbs0/ibdev>
         5          7          7          7          0  Bytes Read <file=/sys/class/infiniband_verbs/uverbs0/ibdev> : MPI_Init() => read() 
         1          2          2          2          0  Bytes Read <file=/sys/class/misc/rdma_cm/abi_version>
         1          2          2          2          0  Bytes Read <file=/sys/class/misc/rdma_cm/abi_version> : MPI_Init() => read() 
         2       2886       2886       2886          0  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/openmpi-mca-params.conf>
         2       2886       2886       2886          0  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/openmpi-mca-params.conf> : MPI_Init() => read() 
         2       2902       2902       2902          0  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/pmix-mca-params.conf>
         2       2902       2902       2902          0  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/pmix-mca-params.conf> : MPI_Init() => read() 
         6       8192        986       6991       2686  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/help-mpi-btl-openib.txt>
         6       8192        986       6991       2686  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/help-mpi-btl-openib.txt> : MPI_Init() => read() 
         2       8192       3408       5800       2392  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/mca-btl-openib-device-params.ini>
         2       8192       3408       5800       2392  Bytes Read <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/mca-btl-openib-device-params.ini> : MPI_Init() => read() 
         2          4          4          4          0  Bytes Read <file=127.0.0.1,port=34423>
         2          4          4          4          0  Bytes Read <file=127.0.0.1,port=34423> : MPI_Init() => recv() 
         0          0          0          0          0  Bytes Read <file=127.0.0.1,port=34423> : progress_engine => read() 
         1       1816       1816       1816          0  Bytes Read <file=adios2.xml>
         1       1816       1816       1816          0  Bytes Read <file=adios2.xml> : read() 
         0          0          0          0          0  Bytes Read <file=unknown>
         0          0          0          0          0  Bytes Read <file=unknown> : progress_engine => read() 
       192       3461          1      166.2      492.3  Bytes Written
         1          1          1          1          0  Bytes Written : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile => write() 
         1       3461       3461       3461          0  Bytes Written : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => writev() 
        20       2555         64      964.5      909.2  Bytes Written : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => write() 
        10       1183        590      656.6      175.6  Bytes Written : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData => write() 
         8          8          8          8          0  Bytes Written : MPI_Finalize() => write() 
         1         66         66         66          0  Bytes Written : MPI_Init() => send() 
       108        144          8      19.59       19.3  Bytes Written : MPI_Init() => write() 
        43         34          1       7.86       9.27  Bytes Written : fwrite() 
         0          0          0          0          0  Bytes Written : progress_engine => writev() 
         5        144         24       49.6       47.3  Bytes Written <file=/dev/infiniband/rdma_cm>
         5        144         24       49.6       47.3  Bytes Written <file=/dev/infiniband/rdma_cm> : MPI_Init() => write() 
        51         64         12      28.47      16.28  Bytes Written <file=/dev/infiniband/uverbs0>
        51         64         12      28.47      16.28  Bytes Written <file=/dev/infiniband/uverbs0> : MPI_Init() => write() 
         1         66         66         66          0  Bytes Written <file=127.0.0.1,port=34423>
         1         66         66         66          0  Bytes Written <file=127.0.0.1,port=34423> : MPI_Init() => send() 
         0          0          0          0          0  Bytes Written <file=127.0.0.1,port=34423> : progress_engine => writev() 
        10       1183        590      656.6      175.6  Bytes Written <file=sim.bp/data.0>
        10       1183        590      656.6      175.6  Bytes Written <file=sim.bp/data.0> : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData => write() 
        10       2555       1780       1859      232.1  Bytes Written <file=sim.bp/md.0>
        10       2555       1780       1859      232.1  Bytes Written <file=sim.bp/md.0> : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => write() 
        11        128          1      64.09      27.08  Bytes Written <file=sim.bp/md.idx>
         1          1          1          1          0  Bytes Written <file=sim.bp/md.idx> : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile => write() 
        10        128         64       70.4       19.2  Bytes Written <file=sim.bp/md.idx> : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => write() 
         1       3461       3461       3461          0  Bytes Written <file=sim.bp/profiling.json>
         1       3461       3461       3461          0  Bytes Written <file=sim.bp/profiling.json> : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => writev() 
        43         34          1       7.86       9.27  Bytes Written <file=stdout>
        43         34          1       7.86       9.27  Bytes Written <file=stdout> : fwrite() 
        60          8          8          8          0  Bytes Written <file=unknown>
         8          8          8          8          0  Bytes Written <file=unknown> : MPI_Finalize() => write() 
        52          8          8          8          0  Bytes Written <file=unknown> : MPI_Init() => write() 
         2          8          8          8          0  Message size for all-reduce
         2       1816          8        912        904  Message size for broadcast
        62       8676          0      530.9       1249  Message size for gather
        10          8          8          8          0  Message size for reduce
        44       8192     0.3333      476.8       1458  Read Bandwidth (MB/s)
         7         21     0.7778      9.611      5.654  Read Bandwidth (MB/s) : MPI_Init() => fscanf() 
        34       8192     0.3333      604.3       1636  Read Bandwidth (MB/s) : MPI_Init() => read() 
         2          2     0.6667      1.333     0.6667  Read Bandwidth (MB/s) : MPI_Init() => recv() 
         0          0          0          0          0  Read Bandwidth (MB/s) : progress_engine => read() 
         1      363.2      363.2      363.2          0  Read Bandwidth (MB/s) : read() 
         7         21     0.7778      9.611      5.654  Read Bandwidth (MB/s) <file=/proc/net/if_inet6>
         7         21     0.7778      9.611      5.654  Read Bandwidth (MB/s) <file=/proc/net/if_inet6> : MPI_Init() => fscanf() 
         1     0.3333     0.3333     0.3333          0  Read Bandwidth (MB/s) <file=/proc/sys/kernel/yama/ptrace_scope>
         1     0.3333     0.3333     0.3333          0  Read Bandwidth (MB/s) <file=/proc/sys/kernel/yama/ptrace_scope> : MPI_Init() => read() 
         0          0          0          0          0  Read Bandwidth (MB/s) <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/libopen-pal.so.40.20.1>
         0          0          0          0          0  Read Bandwidth (MB/s) <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/libopen-pal.so.40.20.1> : read() 
         0          0          0          0          0  Read Bandwidth (MB/s) <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/openmpi/mca_pmix_pmix3x.so>
         0          0          0          0          0  Read Bandwidth (MB/s) <file=/storage/packages/openmpi/4.0.1-gcc8.1/lib/openmpi/mca_pmix_pmix3x.so> : read() 
         2      6.667      6.667      6.667          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband/hfi1_0/node_guid>
         2      6.667      6.667      6.667          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband/hfi1_0/node_guid> : MPI_Init() => read() 
         1        1.5        1.5        1.5          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband/hfi1_0/node_type>
         1        1.5        1.5        1.5          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband/hfi1_0/node_type> : MPI_Init() => read() 
         3         20      13.33      15.56      3.143  Read Bandwidth (MB/s) <file=/sys/class/infiniband/hfi1_0/ports/1/gids/0>
         3         20      13.33      15.56      3.143  Read Bandwidth (MB/s) <file=/sys/class/infiniband/hfi1_0/ports/1/gids/0> : MPI_Init() => read() 
         1        0.5        0.5        0.5          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/abi_version>
         1        0.5        0.5        0.5          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/abi_version> : MPI_Init() => read() 
         4          1          1          1          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/uverbs0/abi_version>
         4          1          1          1          0  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/uverbs0/abi_version> : MPI_Init() => read() 
         4         18       13.5      16.88      1.949  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/uverbs0/device/modalias>
         4         18       13.5      16.88      1.949  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/uverbs0/device/modalias> : MPI_Init() => read() 
         5        3.5       1.75      2.333      0.639  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/uverbs0/ibdev>
         5        3.5       1.75      2.333      0.639  Read Bandwidth (MB/s) <file=/sys/class/infiniband_verbs/uverbs0/ibdev> : MPI_Init() => read() 
         1          1          1          1          0  Read Bandwidth (MB/s) <file=/sys/class/misc/rdma_cm/abi_version>
         1          1          1          1          0  Read Bandwidth (MB/s) <file=/sys/class/misc/rdma_cm/abi_version> : MPI_Init() => read() 
         2      169.8      151.9      160.8      8.935  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/openmpi-mca-params.conf>
         2      169.8      151.9      160.8      8.935  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/openmpi-mca-params.conf> : MPI_Init() => read() 
         2      193.5      90.69      142.1      51.39  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/pmix-mca-params.conf>
         2      193.5      90.69      142.1      51.39  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/etc/pmix-mca-params.conf> : MPI_Init() => read() 
         6       8192        493       3024       2794  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/help-mpi-btl-openib.txt>
         6       8192        493       3024       2794  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/help-mpi-btl-openib.txt> : MPI_Init() => read() 
         2       1136        512        824        312  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/mca-btl-openib-device-params.ini>
         2       1136        512        824        312  Read Bandwidth (MB/s) <file=/usr/local/packages/openmpi/4.0.1-gcc8.1/share/openmpi/mca-btl-openib-device-params.ini> : MPI_Init() => read() 
         2          2     0.6667      1.333     0.6667  Read Bandwidth (MB/s) <file=127.0.0.1,port=34423>
         2          2     0.6667      1.333     0.6667  Read Bandwidth (MB/s) <file=127.0.0.1,port=34423> : MPI_Init() => recv() 
         0          0          0          0          0  Read Bandwidth (MB/s) <file=127.0.0.1,port=34423> : progress_engine => read() 
         1      363.2      363.2      363.2          0  Read Bandwidth (MB/s) <file=adios2.xml>
         1      363.2      363.2      363.2          0  Read Bandwidth (MB/s) <file=adios2.xml> : read() 
         0          0          0          0          0  Read Bandwidth (MB/s) <file=unknown>
         0          0          0          0          0  Read Bandwidth (MB/s) <file=unknown> : progress_engine => read() 
       168      597.3    0.06667      44.01      114.3  Write Bandwidth (MB/s)
         1       0.25       0.25       0.25          0  Write Bandwidth (MB/s) : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile => write() 
         1      384.6      384.6      384.6  9.344E-06  Write Bandwidth (MB/s) : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => writev() 
        20      597.3       25.6      232.3      227.3  Write Bandwidth (MB/s) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => write() 
        10      200.3      76.88      153.6      36.51  Write Bandwidth (MB/s) : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData => write() 
         8      1.143     0.7273     0.9488      0.141  Write Bandwidth (MB/s) : MPI_Finalize() => write() 
         1      2.538      2.538      2.538  5.162E-08  Write Bandwidth (MB/s) : MPI_Init() => send() 
       108         36     0.3636      6.716      7.112  Write Bandwidth (MB/s) : MPI_Init() => write() 
        19         25    0.06667      4.724       7.48  Write Bandwidth (MB/s) : fwrite() 
         0          0          0          0          0  Write Bandwidth (MB/s) : progress_engine => writev() 
         5         36        6.4      19.68       11.1  Write Bandwidth (MB/s) <file=/dev/infiniband/rdma_cm>
         5         36        6.4      19.68       11.1  Write Bandwidth (MB/s) <file=/dev/infiniband/rdma_cm> : MPI_Init() => write() 
        51         24     0.9275      9.902      6.884  Write Bandwidth (MB/s) <file=/dev/infiniband/uverbs0>
        51         24     0.9275      9.902      6.884  Write Bandwidth (MB/s) <file=/dev/infiniband/uverbs0> : MPI_Init() => write() 
         1      2.538      2.538      2.538  5.162E-08  Write Bandwidth (MB/s) <file=127.0.0.1,port=34423>
         1      2.538      2.538      2.538  5.162E-08  Write Bandwidth (MB/s) <file=127.0.0.1,port=34423> : MPI_Init() => send() 
         0          0          0          0          0  Write Bandwidth (MB/s) <file=127.0.0.1,port=34423> : progress_engine => writev() 
        10      200.3      76.88      153.6      36.51  Write Bandwidth (MB/s) <file=sim.bp/data.0>
        10      200.3      76.88      153.6      36.51  Write Bandwidth (MB/s) <file=sim.bp/data.0> : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteData => write() 
        10      597.3      196.5      423.7      172.7  Write Bandwidth (MB/s) <file=sim.bp/md.0>
        10      597.3      196.5      423.7      172.7  Write Bandwidth (MB/s) <file=sim.bp/md.0> : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => write() 
        11         64       0.25      37.26      18.63  Write Bandwidth (MB/s) <file=sim.bp/md.idx>
         1       0.25       0.25       0.25          0  Write Bandwidth (MB/s) <file=sim.bp/md.idx> : BP4Writer::Close => BP4Writer::WriteCollectiveMetadataFile => write() 
        10         64       25.6      40.96       15.2  Write Bandwidth (MB/s) <file=sim.bp/md.idx> : BP4Writer::EndStep => BP4Writer::Flush => BP4Writer::WriteCollectiveMetadataFile => write() 
         1      384.6      384.6      384.6  9.344E-06  Write Bandwidth (MB/s) <file=sim.bp/profiling.json>
         1      384.6      384.6      384.6  9.344E-06  Write Bandwidth (MB/s) <file=sim.bp/profiling.json> : BP4Writer::Close => BP4Writer::WriteProfilingJSONFile => writev() 
        19         25    0.06667      4.724       7.48  Write Bandwidth (MB/s) <file=stdout>
        19         25    0.06667      4.724       7.48  Write Bandwidth (MB/s) <file=stdout> : fwrite() 
        60          8     0.3636      2.158      1.233  Write Bandwidth (MB/s) <file=unknown>
         8      1.143     0.7273     0.9488      0.141  Write Bandwidth (MB/s) <file=unknown> : MPI_Finalize() => write() 
        52          8     0.3636      2.344      1.222  Write Bandwidth (MB/s) <file=unknown> : MPI_Init() => write() 
---------------------------------------------------------------------------------------

NODE 0;CONTEXT 0;THREAD 1:
---------------------------------------------------------------------------------------
%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
              msec   total msec                          usec/call 
---------------------------------------------------------------------------------------
100.0            7          301           1          61     301393 .TAU application
 97.2          292          292           1           6     292867 progress_engine 
  0.2        0.648        0.648          63           0         10 read() 
  0.1        0.259        0.259           1           0        259 fopen64() 
  0.0        0.102        0.102           1           0        102 fopen() 
  0.0        0.028        0.028           1           0         28 fclose() 
---------------------------------------------------------------------------------------

(other threads/nodes follow)

Tracing

Note: on a test system at uoregon.edu, the machine has an Infiniband interface and LibFabric installed. The CMake configuration for ADIOS2 detected LibFabric and decided to use RDMA as the default protocol. Runtime errors led us to force the SST transport to be WAN instead of RDMA. That is done with:

    <io name="SimulationOutput">
        <engine type="SST">
            <parameter key="OpenTimeoutSecs" value="10.0"/>
            <parameter key="DataTransport" value="WAN"/>
        </engine>
    </io>

In order to collect a trace of an ADIOS2 coupled application, the easiest way is to launch the writer and reader using MPMD. The ADIOS2 tutorial example HeatTransfer is normally launched like this:

mpirun \
    -n 16 $te ./heatSimulation sim.bp  4 4  64 64 100 10 : \
    -n 4 $te ./heatAnalysis sim.bp analysis.bp 2 2

To collect a trace, tau_exec is used to wrap both applications, and a TAU environment variable is set:

te="tau_exec -T mpi,pthread"
export TAU_TRACE=1

rm -f tautrace* events*
mpirun \
    -n 16 $te ./heatSimulation sim.bp  4 4  64 64 100 10 : \
    -n 4 $te ./heatAnalysis sim.bp analysis.bp 2 2

...and after the trace files are generated, they have to be merged and converted to either OTF2 (to be read by Vampir) or SLOG2 (to be read by JumpShot). Keep in mind that both applications will be merged into the same trace.

rm -f tau.trc tau.edf
tau_treemerge.pl
rm -rf traces*
tau2otf2 tau.trc tau.edf traces
tau2slog2 tau.trc tau.edf -o traces.slog2

If you are eventually targeting OTF2 (Vampir), you can set the TAU_TRACE_FORMAT=otf2 environment variable before writing, and the trace will be written directly. Regardless, this is what the trace of the above example looks like in Vampir (after setting colors for groups of events):

Vampir image with Heat Transfer tutorial example, 16 simulation nodes and 4 analysis nodes.

...and this is what it looks like in Vampir when written straight to OTF2 (no conversion):

Vampir native OTF2 image with Heat Transfer tutorial example, 16 simulation nodes and 4 analysis nodes.

...and this is what it looks like in JumpShot:

JumpShot image with Heat Transfer tutorial example, 16 simulation nodes and 4 analysis nodes.