All the papers below are about machine learning system.
PipeDream: Generalized Pipeline Parallelism for DNN Training (SOSP2019) [Paper] [Slide] [Talk]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism [Paper] [Code]
Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform [Paper]
PipeMare: Asynchronous Pipeline Parallel DNN Training [Paper]
ElasticPipe: An Efficient and Dynamic Model-Parallel Solution to DNN Training [Paper]
Horizontal or Vertical? A Hybrid Approach to Large-Scale Distributed Machine Learning [Paper]
XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training [Paper]
Reduce the Training Time of Neural Networks by Partitioning [Paper]
STRADS: a distributed framework for scheduled model parallel machine learning (EuroSys 2016)[Paper]
Beyond Data and Model Parallelism for Deep Neural Networks [Paper]
A Generic Communication Scheduler for Distributed DNN Training Acceleration (SOSP 2019) [Paper] [BytePS]
TicTac: Accelerating Distributed Deep Learning with Communication Scheduling (SysML 2019) [Paper]
Distributed Equivalent Substitution Training for Large-Scale Recommender Systems (SysML 2019) [Paper]
Geryon: Accelerating Distributed CNN Training byNetwork-Level Flow Scheduling (INFOCOM 2020) [Paper]
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters [Paper]
Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]
Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes [Paper]
Optimizing CNN Model Inference on CPUs (ATC 2019) [Paper]