You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi~
Question: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH.
Take MPI_Allreduce and MPI_Barrier as an example:
Algorithm in SST-Macro:
MPI_Barrier : bruck algorithm
MPI_Allreduce : Wilke-Halving (The wilke algorithm is a variation binary blocks algorithm)
2.1 First reduce rounds(similar to recuriseve-halving algorithm)
2.2 Second recv rounds (similar to bruck algorithm)
Algorithm in MVAPICH :
MPI_Barrier :
1.1 : if mv2_use_osu_collectives:(default) use pairwise exchange with recursive doubling algorithm
1.2 : else : dissemination algorithm (the bruck algorithm)
MPI_Barrier :
2.1 : if mv2_use_osu_collectives:(default) What algorithm is not analyzed
2.2 : else :
short messages: size <= MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE
long messages: size > MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE
2.2.1 For long messages , we use Rabenseifner's algorithm.
First recuriseve-halving algorithm is used.
Second recursive doubling algorithm is used.
2.2.2 For short messages, we use a recursive doubling algorithm.
Based on the algorithm implemented by MPI_Allreduce and MPI_Barrier, it is found that the same algorithm is not used by default in SST-Macro and MVAPICH.
The current test osu_allreduce and osu_barrer benchmarks are in SST-Macro and MVAPICH, and the results are quite different.
As shown in the figure below: The configuration information is shown in parameter.ini (same as the hardware information)
parameters.ini (all benchmark use the same one)
node {
name = simple
app1 {
launch_cmd = aprun -n 4 -N 1
exe=./osu_allreduce_sst
allocation = node_id
node_id_allocation_file = andy-node_id_allocation_topo1_4.txt
mpi {
max_vshort_msg_size = 16384
max_eager_msg_size = 16384
post_header_delay = 0.81us
post_rdma_delay = 0.13us
rdma_pin_latency = 0.9us
rdma_page_delay = 1ns
eager_cutoff = 524288
allgather = ring
}
}
proc {
frequency = 2.6 GHz
ncores = 8
parallelism = 16
}
memory {
name = pisces
total_bandwidth = 12.8GB/s
latency = 12.5ns
Using a performance KPI to measure the results of osu_allreduce and osu_barrier (MVAPICH and SST-Macro comparison), the performance can only reach 60% and 70% similar
Hence the question:: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH.
Thanks a lot,
The text was updated successfully, but these errors were encountered:
Hi~
Question: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH.
Take MPI_Allreduce and MPI_Barrier as an example:
Algorithm in SST-Macro:
2.1 First reduce rounds(similar to recuriseve-halving algorithm)
2.2 Second recv rounds (similar to bruck algorithm)
Algorithm in MVAPICH :
1.1 : if mv2_use_osu_collectives:(default) use pairwise exchange with recursive doubling algorithm
1.2 : else : dissemination algorithm (the bruck algorithm)
2.1 : if mv2_use_osu_collectives:(default) What algorithm is not analyzed
2.2 : else :
short messages: size <= MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE
long messages: size > MPIR_CVAR_ALLREDUCE_LONG_MSG_SIZE
2.2.1 For long messages , we use Rabenseifner's algorithm.
First recuriseve-halving algorithm is used.
Second recursive doubling algorithm is used.
2.2.2 For short messages, we use a recursive doubling algorithm.
Based on the algorithm implemented by MPI_Allreduce and MPI_Barrier, it is found that the same algorithm is not used by default in SST-Macro and MVAPICH.
The current test osu_allreduce and osu_barrer benchmarks are in SST-Macro and MVAPICH, and the results are quite different.
As shown in the figure below: The configuration information is shown in parameter.ini (same as the hardware information)
parameters.ini (all benchmark use the same one)
node {
name = simple
app1 {
launch_cmd = aprun -n 4 -N 1
exe=./osu_allreduce_sst
allocation = node_id
node_id_allocation_file = andy-node_id_allocation_topo1_4.txt
mpi {
max_vshort_msg_size = 16384
max_eager_msg_size = 16384
post_header_delay = 0.81us
post_rdma_delay = 0.13us
rdma_pin_latency = 0.9us
rdma_page_delay = 1ns
eager_cutoff = 524288
allgather = ring
}
}
proc {
frequency = 2.6 GHz
ncores = 8
parallelism = 16
}
memory {
name = pisces
total_bandwidth = 12.8GB/s
latency = 12.5ns
}
nic {
name = pisces
negligible_size = 0
injection {
mtu = 4096
arbitrator = cut_through
bandwidth = 100Gb/s
latency = 300ns
credits = 64KB
}
ejection{
mtu = 4096
arbitrator = cut_through
bandwidth = 100Gb/s
latency = 300ns
credits = 64KB
}
}
os{
compute_scheduler = simple
stack_size = 128KB
stack_chunk_size = 2MB
}
}
switch {
router {
name = table
}
name = pisces
arbitrator = cut_through
mtu = 512
link {
bandwidth = 200Gb/s
latency = 130ns
credits = 64KB
}
xbar {
bandwidth = 16Tb/s
}
logp {
bandwidth = 200Gb/s
hop_latency = 116ns
out_in_latency = 60ns
}
}
topology {
name = file
filename = topology.json
routing_tables = routing-table.json
}
Using a performance KPI to measure the results of osu_allreduce and osu_barrier (MVAPICH and SST-Macro comparison), the performance can only reach 60% and 70% similar
Hence the question:: Is there a reasonable path in MVAPICH that is the same as the algorithm used in SST-Macro, in other words, the two benchmarks use the same parameter file (same hardware information). Can SST-Macro get the same similar results as MVAPICH.
Thanks a lot,
The text was updated successfully, but these errors were encountered: