Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".
QSync aims to explore the potential of removing unnecessary quantized operations to improve training accuracy. It achieves this through the following components:
- Quantization perturbation indicator/Replayer for analyzing the global data flow graph's memory and latency under mixed-precision (Predictor)
- Allocator for selecting the optimal quantized operations for training (Allocator / Syncer)
- Support for low-precision backends (CUTLASS, CUDNN) (LP-PyTorch)
In particular, QSync addresses a specific practical scenario: hybrid-cluster training, which involves inference GPUs with power capabilities (memory, compute) and training GPUs with higher capabilities.
The provided scripts support both convolution-based and transformer-based models.
NOTE: The project is a bit old. The performance of kernel implementation may not catch up with latest PyTorch.
Clone the repo
git clone --recursive https://github.com/bytedance/QSync.git
- run
build.sh
underdockerfile
- run
run.sh
, specifiying the necessary path mounting inside. - run
pip install -e .
right in the root folder of QSync, compilation of kernels will start.
- Some libs may hard to install without proxy. Change
<abspath_to_root>
inm_install.sh
to the absolute path to the root folder. Then
bash m_install.sh
make
QSync is implemented under the qsync
folder, composed of syncer
, predictor
and LpTorch
.
- to use LpTorch and convert your model to mixed-biwdith model, use
model = QModule(model)
- See detail for usage of predictor and syncer in the corresponding page.
- See sample under
benchmark_convs / benchmark_transformers
notice the cross-node cost modeling is not as accurate as single-node is. Extra efforts required to align the communication start.