diff --git a/.nojekyll b/.nojekyll
index f2cab720..1ed029f5 100644
--- a/.nojekyll
+++ b/.nojekyll
@@ -1 +1 @@
-bdc47399
\ No newline at end of file
+cde099d5
\ No newline at end of file
diff --git a/contents/contributors.html b/contents/contributors.html
index c0525556..eea99276 100644
--- a/contents/contributors.html
+++ b/contents/contributors.html
@@ -552,7 +552,7 @@
- Annie Laurie Cook
+ Annie Laurie Cook
|
Emeka Ezike
@@ -694,7 +694,7 @@ Contributors
abigailswallow
|
- Costin-Andrei Oncescu
+ Costin-Andrei Oncescu
|
Vijay Edupuganti
diff --git a/contents/responsible_ai/images/png/human_compatible_ai.png b/contents/responsible_ai/images/png/human_compatible_ai.png
deleted file mode 100644
index 151ba162..00000000
Binary files a/contents/responsible_ai/images/png/human_compatible_ai.png and /dev/null differ
diff --git a/contents/training/training.html b/contents/training/training.html
index 775838c2..9d39751f 100644
--- a/contents/training/training.html
+++ b/contents/training/training.html
@@ -997,18 +997,18 @@ “An Overview of Gradient Descent Optimization Algorithms.” ArXiv Preprint abs/1609.04747. https://arxiv.org/abs/1609.04747.
Momentum: Accumulates a velocity vector in directions of persistent gradient across iterations. This helps accelerate progress by dampening oscillations and maintains progress in consistent directions.
Nesterov Accelerated Gradient (NAG): A variant of momentum that computes gradients at the “look ahead” position rather than the current parameter position. This anticipatory update prevents overshooting while the momentum maintains the accelerated progress.
-RMSProp: Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates. (Hinton 2017)
+RMSProp: Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates (Hinton 2017).
Hinton, Geoffrey. 2017. “Overview of Minibatch Gradient Descent.” University of Toronto; University Lecture.
Duchi, John C., Elad Hazan, and Yoram Singer. 2010. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” In COLT 2010 - the 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, edited by Adam Tauman Kalai and Mehryar Mohri, 257–69. Omnipress. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf\#page=265.
- Adagrad: An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates. (Duchi, Hazan, and Singer 2010)
-Adadelta: A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates. (Zeiler 2012)
+Adagrad: An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates (Duchi, Hazan, and Singer 2010).
+Adadelta: A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates (Zeiler 2012).
Zeiler, Matthew D. 2012. “Reinforcement and Systemic Machine Learning for Decision Making.” Wiley. https://doi.org/10.1002/9781118266502.ch6.
Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1412.6980.
- Adam: - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes. (Kingma and Ba 2015)
+Adam: - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes (Kingma and Ba 2015).
Of these methods, Adam is widely considered the go-to optimization algorithm for many deep learning tasks, consistently outperforming vanilla SGD in terms of both training speed and performance. Other optimizers may be better suited in some cases, particularly for simpler models.
@@ -1098,9 +1098,9 @@
Grid Search: The most basic search method, where you manually define a grid of values to check for each hyperparameter. For example, checking learning rates = [0.01, 0.1, 1] and batch sizes = [32, 64, 128]. The key advantage is simplicity, but exploring all combinations leads to exponential search space explosion. Best for fine-tuning a few params.
Random Search: Instead of a grid, you define a random distribution per hyperparameter to sample values from during search. It is more efficient at searching a vast hyperparameter space. However, still somewhat arbitrary compared to more adaptive methods.
-Bayesian Optimization: An advanced probabilistic approach for adaptive exploration based on a surrogate function to model performance over iterations. It is very sample efficient - finds highly optimized hyperparameters in fewer evaluation steps. Requires more investment in setup. (Snoek, Larochelle, and Adams 2012)
+Bayesian Optimization: An advanced probabilistic approach for adaptive exploration based on a surrogate function to model performance over iterations. It is very sample efficient - finds highly optimized hyperparameters in fewer evaluation steps. Requires more investment in setup (Snoek, Larochelle, and Adams 2012).
Evolutionary Algorithms: Mimic natural selection principles - generate populations of hyperparameter combinations, evolve them over time based on performance. These algorithms offer robust search capabilities better suited for complex response surfaces. But many iterations required for reasonable convergence.
-Neural Architecture Search: An approach to designing well-performing architectures for neural networks. Traditionally, NAS approaches use some form of reinforcement learning to propose neural network architectures which are then repeatedly evaluated. (Zoph and Le 2023)
+Neural Architecture Search: An approach to designing well-performing architectures for neural networks. Traditionally, NAS approaches use some form of reinforcement learning to propose neural network architectures which are then repeatedly evaluated (Zoph and Le 2023).
Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held December 3-6, 2012, Lake Tahoe, Nevada, United States, edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, 2960–68. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html.
@@ -1396,7 +1396,7 @@ Batch Size
Hardware Characteristics
Modern hardware like CPUs and GPUs are highly optimized for computational throughput as opposed to memory bandwidth. For example, high-end H100 Tensor Core GPUs can deliver over 60 TFLOPS of double-precision performance but only provide up to 3 TB/s of memory bandwidth. This means there is almost a 20x imbalance between arithmetic units and memory access. Consequently, for hardware like GPU accelerators, neural network training workloads need to be made as computationally intensive as possible in order to fully utilize the available resources.
This further motivates the need for using large batch sizes during training. When using a small batch, the matrix multiplication is bounded by memory bandwidth, underutilizing the abundant compute resources. However, with sufficiently large batches, we can shift the bottleneck more towards computation and attain much higher arithmetic intensity. For instance, batches of 256 or 512 samples may be needed to saturate a high-end GPU. The downside is that larger batches provide less frequent parameter updates, which can impact convergence. Still, the parameter serves as an important tuning knob to balance memory vs compute limitations.
- Therefore, given the imbalanced compute-memory architectures of modern hardware, employing large batch sizes is essential to alleviate bottlenecks and maximize throughput. The subsequent software and algorithms also need to accommodate such batch sizes, as mentioned, since larger batch sizes may have diminishing returns towards the convergence of the network. Using very small batch sizes may lead to suboptimal hardware utilization, ultimately limiting training efficiency. Scaling up to large batch sizes is a topic of research and has been explored in various works that aim to do large scale training. (You et al. 2018)
+ Therefore, given the imbalanced compute-memory architectures of modern hardware, employing large batch sizes is essential to alleviate bottlenecks and maximize throughput. The subsequent software and algorithms also need to accommodate such batch sizes, as mentioned, since larger batch sizes may have diminishing returns towards the convergence of the network. Using very small batch sizes may lead to suboptimal hardware utilization, ultimately limiting training efficiency. Scaling up to large batch sizes is a topic of research and has been explored in various works that aim to do large scale training (You et al. 2018).
diff --git a/search.json b/search.json
index e445b651..b91c1b67 100644
--- a/search.json
+++ b/search.json
@@ -473,14 +473,14 @@
"href": "contents/training/training.html#optimization-algorithms",
"title": "8 AI Training",
"section": "8.4 Optimization Algorithms",
- "text": "8.4 Optimization Algorithms\nStochastic gradient descent (SGD) is a simple yet powerful optimization algorithm commonly used to train machine learning models. SGD works by estimating the gradient of the loss function with respect to the model parameters using a single training example, and then updating the parameters in the direction that reduces the loss.\nWhile conceptually straightforward, SGD suffers from a few shortcomings. First, choosing a proper learning rate can be difficult - too small and progress is very slow, too large and parameters may oscillate and fail to converge. Second, SGD treats all parameters equally and independently, which may not be ideal in all cases. Finally, vanilla SGD uses only first order gradient information which results in slow progress on ill-conditioned problems.\n\n8.4.1 Optimizations\nOver the years, various optimizations have been proposed to accelerate and improve upon vanilla SGD. Ruder (2016) gives an excellent overview of the different optimizers. Briefly, several commonly used SGD optimization techniques include:\n\nRuder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms.” ArXiv Preprint abs/1609.04747. https://arxiv.org/abs/1609.04747.\nMomentum: Accumulates a velocity vector in directions of persistent gradient across iterations. This helps accelerate progress by dampening oscillations and maintains progress in consistent directions.\nNesterov Accelerated Gradient (NAG): A variant of momentum that computes gradients at the “look ahead” position rather than the current parameter position. This anticipatory update prevents overshooting while the momentum maintains the accelerated progress.\nRMSProp: Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates. (Hinton 2017)\n\nHinton, Geoffrey. 2017. “Overview of Minibatch Gradient Descent.” University of Toronto; University Lecture.\n\nDuchi, John C., Elad Hazan, and Yoram Singer. 2010. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” In COLT 2010 - the 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, edited by Adam Tauman Kalai and Mehryar Mohri, 257–69. Omnipress. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf\\#page=265.\nAdagrad: An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates. (Duchi, Hazan, and Singer 2010)\nAdadelta: A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates. (Zeiler 2012)\n\nZeiler, Matthew D. 2012. “Reinforcement and Systemic Machine Learning for Decision Making.” Wiley. https://doi.org/10.1002/9781118266502.ch6.\n\nKingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1412.6980.\nAdam: - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes. (Kingma and Ba 2015)\nOf these methods, Adam is widely considered the go-to optimization algorithm for many deep learning tasks, consistently outperforming vanilla SGD in terms of both training speed and performance. Other optimizers may be better suited in some cases, particularly for simpler models.\n\n\n8.4.2 Trade-offs\nHere is a pros and cons table for some of the main optimization algorithms for neural network training:\n\n\n\n\n\n\n\n\nAlgorithm\nPros\nCons\n\n\n\n\nMomentum\nFaster convergence due to acceleration along gradients Less oscillation than vanilla SGD\nRequires tuning of momentum parameter\n\n\nNesterov Accelerated Gradient (NAG)\nFaster than standard momentum in some cases Anticipatory updates prevent overshooting\nMore complex to understand intuitively\n\n\nAdagrad\nEliminates need to manually tune learning rates Performs well on sparse gradients\nLearning rate may decay too quickly on dense gradients\n\n\nAdadelta\nLess aggressive learning rate decay than Adagrad\nStill sensitive to initial learning rate value\n\n\nRMSProp\nAutomatically adjusts learning rates Works well in practice\nNo major downsides\n\n\nAdam\nCombination of momentum and adaptive learning rates Efficient and fast convergence\nSlightly worse generalization performance in some cases\n\n\nAMSGrad\nImprovement to Adam addressing generalization issue\nNot as extensively used/tested as Adam\n\n\n\n\n\n8.4.3 Benchmarking Algorithms\nNo single method is best for all problem types. This means we need a comprehensive benchmarking to identify the most effective optimizer for specific datasets and models. The performance of algorithms like Adam, RMSProp, and Momentum varies due to factors such as batch size, learning rate schedules, model architecture, data distribution, and regularization. These variations underline the importance of evaluating each optimizer under diverse conditions.\nTake Adam, for example, which often excels in computer vision tasks, in contrast to RMSProp that may show better generalization in certain natural language processing tasks. Momentum’s strength lies in its acceleration in scenarios with consistent gradient directions, whereas Adagrad’s adaptive learning rates are more suited for sparse gradient problems.\nThis wide array of interactions among different optimizers demonstrates the challenge in declaring a single, universally superior algorithm. Each optimizer has unique strengths, making it crucial to empirically evaluate a range of methods to discover their optimal application conditions.\nA comprehensive benchmarking approach should assess not just the speed of convergence but also factors like generalization error, stability, hyperparameter sensitivity, and computational efficiency, among others. This entails monitoring training and validation learning curves across multiple runs and comparing optimizers on a variety of datasets and models to understand their strengths and weaknesses.\nAlgoPerf, introduced by Dahl et al. (2021), addresses the need for a robust benchmarking system. This platform evaluates optimizer performance using criteria such as training loss curves, generalization error, sensitivity to hyperparameters, and computational efficiency. AlgoPerf tests various optimization methods, including Adam, LAMB, and Adafactor, across different model types like CNNs and RNNs/LSTMs on established datasets. It utilizes containerization and automatic metric collection to minimize inconsistencies and allows for controlled experiments across thousands of configurations, providing a reliable basis for comparing different optimizers.\n\nDahl, George E, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, et al. 2021. “CSF Findings in Acute NMDAR and LGI1 AntibodyAssociated Autoimmune Encephalitis.” Neurology Neuroimmunology &Amp; Neuroinflammation 8 (6). https://doi.org/10.1212/nxi.0000000000001086.\nThe insights gained from AlgoPerf and similar benchmarks are invaluable for guiding the optimal choice or tuning of optimizers. By enabling reproducible evaluations, these benchmarks contribute to a deeper understanding of each optimizer’s performance, paving the way for future innovations and accelerated progress in the field."
+ "text": "8.4 Optimization Algorithms\nStochastic gradient descent (SGD) is a simple yet powerful optimization algorithm commonly used to train machine learning models. SGD works by estimating the gradient of the loss function with respect to the model parameters using a single training example, and then updating the parameters in the direction that reduces the loss.\nWhile conceptually straightforward, SGD suffers from a few shortcomings. First, choosing a proper learning rate can be difficult - too small and progress is very slow, too large and parameters may oscillate and fail to converge. Second, SGD treats all parameters equally and independently, which may not be ideal in all cases. Finally, vanilla SGD uses only first order gradient information which results in slow progress on ill-conditioned problems.\n\n8.4.1 Optimizations\nOver the years, various optimizations have been proposed to accelerate and improve upon vanilla SGD. Ruder (2016) gives an excellent overview of the different optimizers. Briefly, several commonly used SGD optimization techniques include:\n\nRuder, Sebastian. 2016. “An Overview of Gradient Descent Optimization Algorithms.” ArXiv Preprint abs/1609.04747. https://arxiv.org/abs/1609.04747.\nMomentum: Accumulates a velocity vector in directions of persistent gradient across iterations. This helps accelerate progress by dampening oscillations and maintains progress in consistent directions.\nNesterov Accelerated Gradient (NAG): A variant of momentum that computes gradients at the “look ahead” position rather than the current parameter position. This anticipatory update prevents overshooting while the momentum maintains the accelerated progress.\nRMSProp: Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates (Hinton 2017).\n\nHinton, Geoffrey. 2017. “Overview of Minibatch Gradient Descent.” University of Toronto; University Lecture.\n\nDuchi, John C., Elad Hazan, and Yoram Singer. 2010. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” In COLT 2010 - the 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, edited by Adam Tauman Kalai and Mehryar Mohri, 257–69. Omnipress. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf\\#page=265.\nAdagrad: An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates (Duchi, Hazan, and Singer 2010).\nAdadelta: A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates (Zeiler 2012).\n\nZeiler, Matthew D. 2012. “Reinforcement and Systemic Machine Learning for Decision Making.” Wiley. https://doi.org/10.1002/9781118266502.ch6.\n\nKingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1412.6980.\nAdam: - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes (Kingma and Ba 2015).\nOf these methods, Adam is widely considered the go-to optimization algorithm for many deep learning tasks, consistently outperforming vanilla SGD in terms of both training speed and performance. Other optimizers may be better suited in some cases, particularly for simpler models.\n\n\n8.4.2 Trade-offs\nHere is a pros and cons table for some of the main optimization algorithms for neural network training:\n\n\n\n\n\n\n\n\nAlgorithm\nPros\nCons\n\n\n\n\nMomentum\nFaster convergence due to acceleration along gradients Less oscillation than vanilla SGD\nRequires tuning of momentum parameter\n\n\nNesterov Accelerated Gradient (NAG)\nFaster than standard momentum in some cases Anticipatory updates prevent overshooting\nMore complex to understand intuitively\n\n\nAdagrad\nEliminates need to manually tune learning rates Performs well on sparse gradients\nLearning rate may decay too quickly on dense gradients\n\n\nAdadelta\nLess aggressive learning rate decay than Adagrad\nStill sensitive to initial learning rate value\n\n\nRMSProp\nAutomatically adjusts learning rates Works well in practice\nNo major downsides\n\n\nAdam\nCombination of momentum and adaptive learning rates Efficient and fast convergence\nSlightly worse generalization performance in some cases\n\n\nAMSGrad\nImprovement to Adam addressing generalization issue\nNot as extensively used/tested as Adam\n\n\n\n\n\n8.4.3 Benchmarking Algorithms\nNo single method is best for all problem types. This means we need a comprehensive benchmarking to identify the most effective optimizer for specific datasets and models. The performance of algorithms like Adam, RMSProp, and Momentum varies due to factors such as batch size, learning rate schedules, model architecture, data distribution, and regularization. These variations underline the importance of evaluating each optimizer under diverse conditions.\nTake Adam, for example, which often excels in computer vision tasks, in contrast to RMSProp that may show better generalization in certain natural language processing tasks. Momentum’s strength lies in its acceleration in scenarios with consistent gradient directions, whereas Adagrad’s adaptive learning rates are more suited for sparse gradient problems.\nThis wide array of interactions among different optimizers demonstrates the challenge in declaring a single, universally superior algorithm. Each optimizer has unique strengths, making it crucial to empirically evaluate a range of methods to discover their optimal application conditions.\nA comprehensive benchmarking approach should assess not just the speed of convergence but also factors like generalization error, stability, hyperparameter sensitivity, and computational efficiency, among others. This entails monitoring training and validation learning curves across multiple runs and comparing optimizers on a variety of datasets and models to understand their strengths and weaknesses.\nAlgoPerf, introduced by Dahl et al. (2021), addresses the need for a robust benchmarking system. This platform evaluates optimizer performance using criteria such as training loss curves, generalization error, sensitivity to hyperparameters, and computational efficiency. AlgoPerf tests various optimization methods, including Adam, LAMB, and Adafactor, across different model types like CNNs and RNNs/LSTMs on established datasets. It utilizes containerization and automatic metric collection to minimize inconsistencies and allows for controlled experiments across thousands of configurations, providing a reliable basis for comparing different optimizers.\n\nDahl, George E, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, et al. 2021. “CSF Findings in Acute NMDAR and LGI1 AntibodyAssociated Autoimmune Encephalitis.” Neurology Neuroimmunology &Amp; Neuroinflammation 8 (6). https://doi.org/10.1212/nxi.0000000000001086.\nThe insights gained from AlgoPerf and similar benchmarks are invaluable for guiding the optimal choice or tuning of optimizers. By enabling reproducible evaluations, these benchmarks contribute to a deeper understanding of each optimizer’s performance, paving the way for future innovations and accelerated progress in the field."
},
{
"objectID": "contents/training/training.html#hyperparameter-tuning",
"href": "contents/training/training.html#hyperparameter-tuning",
"title": "8 AI Training",
"section": "8.5 Hyperparameter Tuning",
- "text": "8.5 Hyperparameter Tuning\nHyperparameters are important settings in machine learning models that have a large impact on how well your models ultimately perform. Unlike other model parameters that are learned during training, hyperparameters are specified by the data scientists or machine learning engineers prior to training the model.\nChoosing the right hyperparameter values is crucial for enabling your models to effectively learn patterns from data. Some examples of key hyperparameters across ML algorithms include:\n\nNeural networks: Learning rate, batch size, number of hidden units, activation functions\nSupport vector machines: Regularization strength, kernel type and parameters\nRandom forests: Number of trees, tree depth\nK-means: Number of clusters\n\nThe problem is that there are no reliable rules-of-thumb for choosing optimal hyperparameter configurations - you typically have to try out different values and evaluate performance. This process is called hyperparameter tuning.\nIn the early years of modern deep learning, researchers were still grappling with unstable and slow convergence issues. Common pain points included training losses fluctuating wildly, gradients exploding or vanishing, and extensive trial-and-error needed to train networks reliably. As a result, an early focal point was using hyperparameters to control model optimization. For instance, seminal techniques like batch normalization allowed much faster model convergence by tuning aspects of internal covariate shift. Adaptive learning rate methods also mitigated the need for extensive manual schedules. These addressed optimization issues during training like uncontrolled gradient divergence. Carefully adapted learning rates are also the primary control factor even today for achieving rapid and stable convegence.\nAs computational capacity expanded exponentially in subsequent years, much larger models could be trained without falling prey to pure numerical optimization issues. The focus shifted towards generalization - though efficient convergence was a core prerequisite. State-of-the-art techniques like Transformers brought in parameters in billions. At such sizes, hyperparameters around capacity, regularization, ensembling etc. took center stage for tuning rather than only raw convergence metrics.\nThe lesson is that understanding acceleration and stability of the optimization process itself constitutes the groundwork. Even today initialization schemes, batch sizes, weight decays and other training hyperparameters remain indispensable. Mastering fast and flawless convergence allows practitioners to expand focus on emerging needs around tuning for metrics like accuracy, robustness and efficiency at scale.\n\n8.5.1 Search Algorithms\nWhen it comes to the critical process of hyperparameter tuning, there are several sophisticated algorithms machine learning practitioners rely on to systematically search through the vast space of possible model configurations. Some of the most prominent hyperparameter search algorithms include:\n\nGrid Search: The most basic search method, where you manually define a grid of values to check for each hyperparameter. For example, checking learning rates = [0.01, 0.1, 1] and batch sizes = [32, 64, 128]. The key advantage is simplicity, but exploring all combinations leads to exponential search space explosion. Best for fine-tuning a few params.\nRandom Search: Instead of a grid, you define a random distribution per hyperparameter to sample values from during search. It is more efficient at searching a vast hyperparameter space. However, still somewhat arbitrary compared to more adaptive methods.\nBayesian Optimization: An advanced probabilistic approach for adaptive exploration based on a surrogate function to model performance over iterations. It is very sample efficient - finds highly optimized hyperparameters in fewer evaluation steps. Requires more investment in setup. (Snoek, Larochelle, and Adams 2012)\nEvolutionary Algorithms: Mimic natural selection principles - generate populations of hyperparameter combinations, evolve them over time based on performance. These algorithms offer robust search capabilities better suited for complex response surfaces. But many iterations required for reasonable convergence.\nNeural Architecture Search: An approach to designing well-performing architectures for neural networks. Traditionally, NAS approaches use some form of reinforcement learning to propose neural network architectures which are then repeatedly evaluated. (Zoph and Le 2023)\n\n\nSnoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held December 3-6, 2012, Lake Tahoe, Nevada, United States, edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, 2960–68. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html.\n\nZoph, Barret, and Quoc V. Le. 2023. “Cybernetical Intelligence.” Wiley. https://doi.org/10.1002/9781394217519.ch17.\n\n\n8.5.2 System Implications\nHyperparameter tuning can significantly impact time to convergence during model training, directly affecting overall runtime. Selecting the right values for key training hyperparameters is crucial for efficient model convergence. For example, the learning rate hyperparameter controls the step size during gradient descent optimization. Setting a properly tuned learning rate schedule ensures the optimization algorithm converges quickly towards a good minimum. Too small a learning rate leads to painfully slow convergence, while too large a value causes the losses to fluctuate wildly. Proper tuning ensures rapid movement towards optimal weights and biases.\nSimilarly, batch size for stochastic gradient descent impacts convergence stability. The right batch size smooths out fluctuations in parameter updates to approach the minimum faster. Insufficient batch sizes cause noisy convergence, while large batch sizes fail to generalize and also slow down convergence due to less frequent parameter updates. Tuning hyperparameters for faster convergence and reduced training duration has direct implications on cost and resource requirements for scaling machine learning systems:\n\nLower computatioanal costs: Shorter time to convergence means lower computational costs for training models. ML training often leverages large cloud compute instances like GPU and TPU clusters that incur heavy charges per hour. Minimizing training time directly brings down this resource rental cost that tends to dominate ML budgets for organizations. Quicker iteration also lets data scientists experiment more freely within the same budget.\nReduced training time: Reduced training time unlocks opportunities to train more models using the same computational budget. Optimized hyperparameters stretch available resources further allowing businesses to develop and experiment with more models under resource constraints to maximize performance.\nResource efficiency: Quicker training allows allocating smaller compute instances in cloud since models require access to the resources for a shorter duration. For example, a 1-hour training job allows using less powerful GPU instances compared to multi-hour training requiring sustained compute access over longer intervals. This achieves cost savings especially for large workloads.\n\nThere are other benefits as well. For instance, faster convergence reduces pressure on ML engineering teams around provisioning training resources. Simple model retraining routines can use lower powered resources as opposed to requesting for access to high priority queues for constrained production-grade GPU clusters. This frees up deployment resources for other applications.\n\n\n8.5.3 Auto Tuners\nThere are a wide array of commercial offerings to help with hyperparameter tuning given how important it is. We will briefly touch on two examples focused on optimization for machine learning models targeting microcontrollers and another focused on cloud-scale ML.\n\nBigML\nThere are several commercial auto tuning platforms available to deal with this problem. One such solution is Google’s Vertex AI Cloud, which has extensive integrated support for state-of-the-art tuning techniques.\nOne of the most salient capabilities offered by Google’s Vertex AI managed machine learning platform is efficient, integrated hyperparameter tuning for model development. Successfully training performant ML models requires identifying optimal configurations for a set of external hyperparameters that dictate model behavior - which poses a challenging high-dimensional search problem. Vertex AI aims to simplify this through Automated Machine Learning (AutoML) tooling.\nSpecifically, data scientists can leverage Vertex AI’s hyperparameter tuning engines by providing a labeled dataset and choosing a model type such as Neural Network or Random Forest classifier. Vertex launches a Hyperparameter Search job transparently on the backend, fully handling resource provisioning, model training, metric tracking and result analysis automatically using advanced optimization algorithms.\nUnder the hood, Vertex AutoML employs a wide array of different search strategies to intelligently explore the most promising hyperparameter configurations based on previous evaluation results. Compared to standard Grid Search or Random Search methods, Bayesian Optimization offers superior sample efficiency requiring fewer training iterations to arrive at optimized model quality. For more complex neural architecture search spaces, Vertex AutoML utilizes Population Based Training approaches which evolve candidate solutions over time analogous to natural selection principles.\nVertex AI aims to democratize state-of-the-art hyperparameter search techniques at cloud scale for all ML developers, abstracting away the underlying orchestration and execution complexity. Users focus solely on their dataset, model requirements and accuracy goals while Vertex manages the tuning cycle, resource allocation, model training, accuracy tracking and artifact storage under the hood. The end result is getting deployment-ready, optimized ML models faster for the target problem.\n\n\nTinyML\nEdge Impulse’s Efficient On-device Neural Network Tuner (EON Tuner) is an automated hyperparameter optimization tool designed specifically for developing machine learning models for microcontrollers. The EON Tuner streamlines the model development process by automatically finding the best neural network configuration for efficient and accurate deployment on resource-constrained devices.\nThe key functionality of the EON Tuner is as follows. First, developers define the model hyperparameters, such as number of layers, nodes per layer, activation functions, and learning rate annealing schedule. These parameters constitute the search space that will be optimized. Next, the target microcontroller platform is selected, providing embedded hardware constraints. The user can also specify optimization objectives, such as minimizing memory footprint, lowering latency, reducing power consumption or maximizing accuracy.\nWith the search space and optimization goals defined, the EON Tuner leverages Bayesian hyperparameter optimization to intelligently explore possible configurations. Each prospective configuration is automatically implemented as a full model specification, trained and evaluated for quality metrics. The continual process balances exploration and exploitation to arrive at optimized settings tailored to the developer’s chosen chip architecture and performance requirements.\nBy automatically tuning models for embedded deployment, the EON Tuner frees machine learning engineers from the demandingly iterative process of hand-tuning models. The tool integrates seamlessly into the Edge Impulse workflow for taking models from concept to efficiently optimized implementations on microcontrollers. The expertise encapsulated in EON Tuner regarding ML model optimization for microcontrollers ensures beginner and experienced developers alike can rapidly iterate to models fitting their project needs."
+ "text": "8.5 Hyperparameter Tuning\nHyperparameters are important settings in machine learning models that have a large impact on how well your models ultimately perform. Unlike other model parameters that are learned during training, hyperparameters are specified by the data scientists or machine learning engineers prior to training the model.\nChoosing the right hyperparameter values is crucial for enabling your models to effectively learn patterns from data. Some examples of key hyperparameters across ML algorithms include:\n\nNeural networks: Learning rate, batch size, number of hidden units, activation functions\nSupport vector machines: Regularization strength, kernel type and parameters\nRandom forests: Number of trees, tree depth\nK-means: Number of clusters\n\nThe problem is that there are no reliable rules-of-thumb for choosing optimal hyperparameter configurations - you typically have to try out different values and evaluate performance. This process is called hyperparameter tuning.\nIn the early years of modern deep learning, researchers were still grappling with unstable and slow convergence issues. Common pain points included training losses fluctuating wildly, gradients exploding or vanishing, and extensive trial-and-error needed to train networks reliably. As a result, an early focal point was using hyperparameters to control model optimization. For instance, seminal techniques like batch normalization allowed much faster model convergence by tuning aspects of internal covariate shift. Adaptive learning rate methods also mitigated the need for extensive manual schedules. These addressed optimization issues during training like uncontrolled gradient divergence. Carefully adapted learning rates are also the primary control factor even today for achieving rapid and stable convegence.\nAs computational capacity expanded exponentially in subsequent years, much larger models could be trained without falling prey to pure numerical optimization issues. The focus shifted towards generalization - though efficient convergence was a core prerequisite. State-of-the-art techniques like Transformers brought in parameters in billions. At such sizes, hyperparameters around capacity, regularization, ensembling etc. took center stage for tuning rather than only raw convergence metrics.\nThe lesson is that understanding acceleration and stability of the optimization process itself constitutes the groundwork. Even today initialization schemes, batch sizes, weight decays and other training hyperparameters remain indispensable. Mastering fast and flawless convergence allows practitioners to expand focus on emerging needs around tuning for metrics like accuracy, robustness and efficiency at scale.\n\n8.5.1 Search Algorithms\nWhen it comes to the critical process of hyperparameter tuning, there are several sophisticated algorithms machine learning practitioners rely on to systematically search through the vast space of possible model configurations. Some of the most prominent hyperparameter search algorithms include:\n\nGrid Search: The most basic search method, where you manually define a grid of values to check for each hyperparameter. For example, checking learning rates = [0.01, 0.1, 1] and batch sizes = [32, 64, 128]. The key advantage is simplicity, but exploring all combinations leads to exponential search space explosion. Best for fine-tuning a few params.\nRandom Search: Instead of a grid, you define a random distribution per hyperparameter to sample values from during search. It is more efficient at searching a vast hyperparameter space. However, still somewhat arbitrary compared to more adaptive methods.\nBayesian Optimization: An advanced probabilistic approach for adaptive exploration based on a surrogate function to model performance over iterations. It is very sample efficient - finds highly optimized hyperparameters in fewer evaluation steps. Requires more investment in setup (Snoek, Larochelle, and Adams 2012).\nEvolutionary Algorithms: Mimic natural selection principles - generate populations of hyperparameter combinations, evolve them over time based on performance. These algorithms offer robust search capabilities better suited for complex response surfaces. But many iterations required for reasonable convergence.\nNeural Architecture Search: An approach to designing well-performing architectures for neural networks. Traditionally, NAS approaches use some form of reinforcement learning to propose neural network architectures which are then repeatedly evaluated (Zoph and Le 2023).\n\n\nSnoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. “Practical Bayesian Optimization of Machine Learning Algorithms.” In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held December 3-6, 2012, Lake Tahoe, Nevada, United States, edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, 2960–68. https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html.\n\nZoph, Barret, and Quoc V. Le. 2023. “Cybernetical Intelligence.” Wiley. https://doi.org/10.1002/9781394217519.ch17.\n\n\n8.5.2 System Implications\nHyperparameter tuning can significantly impact time to convergence during model training, directly affecting overall runtime. Selecting the right values for key training hyperparameters is crucial for efficient model convergence. For example, the learning rate hyperparameter controls the step size during gradient descent optimization. Setting a properly tuned learning rate schedule ensures the optimization algorithm converges quickly towards a good minimum. Too small a learning rate leads to painfully slow convergence, while too large a value causes the losses to fluctuate wildly. Proper tuning ensures rapid movement towards optimal weights and biases.\nSimilarly, batch size for stochastic gradient descent impacts convergence stability. The right batch size smooths out fluctuations in parameter updates to approach the minimum faster. Insufficient batch sizes cause noisy convergence, while large batch sizes fail to generalize and also slow down convergence due to less frequent parameter updates. Tuning hyperparameters for faster convergence and reduced training duration has direct implications on cost and resource requirements for scaling machine learning systems:\n\nLower computatioanal costs: Shorter time to convergence means lower computational costs for training models. ML training often leverages large cloud compute instances like GPU and TPU clusters that incur heavy charges per hour. Minimizing training time directly brings down this resource rental cost that tends to dominate ML budgets for organizations. Quicker iteration also lets data scientists experiment more freely within the same budget.\nReduced training time: Reduced training time unlocks opportunities to train more models using the same computational budget. Optimized hyperparameters stretch available resources further allowing businesses to develop and experiment with more models under resource constraints to maximize performance.\nResource efficiency: Quicker training allows allocating smaller compute instances in cloud since models require access to the resources for a shorter duration. For example, a 1-hour training job allows using less powerful GPU instances compared to multi-hour training requiring sustained compute access over longer intervals. This achieves cost savings especially for large workloads.\n\nThere are other benefits as well. For instance, faster convergence reduces pressure on ML engineering teams around provisioning training resources. Simple model retraining routines can use lower powered resources as opposed to requesting for access to high priority queues for constrained production-grade GPU clusters. This frees up deployment resources for other applications.\n\n\n8.5.3 Auto Tuners\nThere are a wide array of commercial offerings to help with hyperparameter tuning given how important it is. We will briefly touch on two examples focused on optimization for machine learning models targeting microcontrollers and another focused on cloud-scale ML.\n\nBigML\nThere are several commercial auto tuning platforms available to deal with this problem. One such solution is Google’s Vertex AI Cloud, which has extensive integrated support for state-of-the-art tuning techniques.\nOne of the most salient capabilities offered by Google’s Vertex AI managed machine learning platform is efficient, integrated hyperparameter tuning for model development. Successfully training performant ML models requires identifying optimal configurations for a set of external hyperparameters that dictate model behavior - which poses a challenging high-dimensional search problem. Vertex AI aims to simplify this through Automated Machine Learning (AutoML) tooling.\nSpecifically, data scientists can leverage Vertex AI’s hyperparameter tuning engines by providing a labeled dataset and choosing a model type such as Neural Network or Random Forest classifier. Vertex launches a Hyperparameter Search job transparently on the backend, fully handling resource provisioning, model training, metric tracking and result analysis automatically using advanced optimization algorithms.\nUnder the hood, Vertex AutoML employs a wide array of different search strategies to intelligently explore the most promising hyperparameter configurations based on previous evaluation results. Compared to standard Grid Search or Random Search methods, Bayesian Optimization offers superior sample efficiency requiring fewer training iterations to arrive at optimized model quality. For more complex neural architecture search spaces, Vertex AutoML utilizes Population Based Training approaches which evolve candidate solutions over time analogous to natural selection principles.\nVertex AI aims to democratize state-of-the-art hyperparameter search techniques at cloud scale for all ML developers, abstracting away the underlying orchestration and execution complexity. Users focus solely on their dataset, model requirements and accuracy goals while Vertex manages the tuning cycle, resource allocation, model training, accuracy tracking and artifact storage under the hood. The end result is getting deployment-ready, optimized ML models faster for the target problem.\n\n\nTinyML\nEdge Impulse’s Efficient On-device Neural Network Tuner (EON Tuner) is an automated hyperparameter optimization tool designed specifically for developing machine learning models for microcontrollers. The EON Tuner streamlines the model development process by automatically finding the best neural network configuration for efficient and accurate deployment on resource-constrained devices.\nThe key functionality of the EON Tuner is as follows. First, developers define the model hyperparameters, such as number of layers, nodes per layer, activation functions, and learning rate annealing schedule. These parameters constitute the search space that will be optimized. Next, the target microcontroller platform is selected, providing embedded hardware constraints. The user can also specify optimization objectives, such as minimizing memory footprint, lowering latency, reducing power consumption or maximizing accuracy.\nWith the search space and optimization goals defined, the EON Tuner leverages Bayesian hyperparameter optimization to intelligently explore possible configurations. Each prospective configuration is automatically implemented as a full model specification, trained and evaluated for quality metrics. The continual process balances exploration and exploitation to arrive at optimized settings tailored to the developer’s chosen chip architecture and performance requirements.\nBy automatically tuning models for embedded deployment, the EON Tuner frees machine learning engineers from the demandingly iterative process of hand-tuning models. The tool integrates seamlessly into the Edge Impulse workflow for taking models from concept to efficiently optimized implementations on microcontrollers. The expertise encapsulated in EON Tuner regarding ML model optimization for microcontrollers ensures beginner and experienced developers alike can rapidly iterate to models fitting their project needs."
},
{
"objectID": "contents/training/training.html#regularization",
@@ -508,7 +508,7 @@
"href": "contents/training/training.html#system-bottlenecks",
"title": "8 AI Training",
"section": "8.9 System Bottlenecks",
- "text": "8.9 System Bottlenecks\nAs introduced earlier, neural networks are comprised of linear operations (matrix multiplications) interleaved with element-wise nonlinear activation functions. The most computationally expensive portion of neural networks is the linear transformations, specifically the matrix multiplications between each layer. These linear layers map the activations from the previous layer to a higher dimensional space that serves as inputs to the next layer’s activation function.\n\n8.9.1 Runtime Complexity of Matrix Multiplication\n\nLayer Multiplications vs. Activations\nThe bulk of computation in neural networks arises from the matrix multiplications between layers. Consider a neural network layer with an input dimension of \\(M\\) = 500 and output dimension of \\(N\\) = 1000, the matrix multiplication requires \\(O(N \\cdot M) = O(1000 \\cdot 500) = 500,000\\) multiply-accumulate (MAC) operations between those layers.\nContrast this with the preceding layer which had \\(M\\) = 300 inputs, requiring \\(O(500 \\cdot 300) = 150,000\\) ops. We can see how the computations scale exponentially as the layer widths increase, with the total computations across \\(L\\) layers being \\(\\sum_{l=1}^{L-1} O\\big(N^{(l)} \\cdot M^{(l-1)}\\big)\\).\nNow comparing the matrix multiplication to the activation function which requires only \\(O(N) = 1000\\) element-wise nonlinearities for \\(N = 1000\\) outputs, we can clearly see the linear transformations dominating the activations computationally.\nThese large matrix multiplications directly impact hardware choices, inference latency, and power constraints for real-world neural network applications. For example, a typical DNN layer may require 500,000 multiply-accumulates vs. only 1000 nonlinear activations, demonstrating a 500x increase in mathematical operations.\nWhen training neural networks, we typically use mini-batch gradient descent, operating on small batches of data at a time. Considering a batch size of \\(B\\) training examples, the input to the matrix multiplication becomes a \\(M \\times B\\) matrix, while the output is an \\(N \\times B\\) matrix.\n\n\nMini-batch\nIn training neural networks, we need to repeatedly estimate the gradient of the loss function with respect to the network parameters (i.e. weights and biases). This gradient indicates which direction the parameters should be updated in order to minimize the loss. As introduced previously, use perform updates over a batch of datapoints every update, also known as stochastic gradient descent, or mini-batch gradient descent.\nThe most straightforward approach is to estimate the gradient based on a single training example, compute the parameter update, lather, rinse, and repeat for the next example. However, this involves very small and frequent parameter updates that can be computationally inefficient, and may additionally be inaccurate in terms of convergence due to the stochasticity of using just a single datapoint for a model update.\nInstead, mini-batch gradient descent strikes a balance between convergence stability and computational efficiency. Rather than compute the gradient on single examples, we estimate the gradient based on small “mini-batches” of data - usually between 8 to 256 examples in practice.\nThis provides a noisy but consistent gradient estimate that leads to more stable convergence. Additionally, the parameter update only needs to be performed once per mini-batch rather than once per example, reducing computational overhead.\nBy tuning the mini-batch size, we can control the tradeoff between the smoothness of the estimate (larger batches are generally better) and the frequency of updates (smaller batches allow more frequent updates). Mini-batch sizes are usually powers of 2 so they can leverage parallelism across GPU cores efficiently.\nSo the total computation is performing an \\(N \\times M\\) by \\(M \\times B\\) matrix multiplication, yielding \\(O(N \\cdot M \\cdot B)\\) floating point operations. As a numerical example, with \\(N=1000\\) hidden units, \\(M=500\\) input units, and a batch size \\(B=64\\), this equates to 1000 x 500 x 64 = 32 million multiply-accumulates per training iteration!\nIn contrast, the activation functions are applied element-wise to the \\(N \\times B\\) output matrix, requiring only \\(O(N \\cdot B)\\) computations. For \\(N=1000\\) and \\(B=64\\), that is just 64,000 nonlinearities - 500X less work than the matrix multiplication.\nAs we increase the batch size to fully leverage parallel hardware like GPUs, the discrepancy between matrix multiplication and activation function cost grows even larger. This reveals how optimizing the linear algebra operations offers tremendous efficiency gains.\nTherefore, when analyzing where and how neural networks spend computation, matrix multiplication clearly plays a central role. For example, matrix multiplications often account for over 90% of both inference latency and training time in common convolutional and recurrent neural networks.\n\n\nOptimizing Matrix Multiplication\nA number of techniques enhance the efficiency of general dense/sparse matrix-matrix and matrix-vector operations to directly improve overall efficiency. Some key methods include:\n\nLeveraging optimized math libraries like cuBLAS for GPU acceleration\nEnabling lower precision formats like FP16 or INT8 where accuracy permits\nEmploying Tensor Processing Units with hardware matrix multiplication\nSparsity-aware computations and data storage formats to exploit zero parameters\nApproximating matrix multiplications with algorithms like Fast Fourier Transforms\nModel architecture design to reduce layer widths and activations\nQuantization, pruning, distillation and other compression techniques\nParallelization of computation across available hardware\nCaching/pre-computing results where possible to reduce redundant operations\n\nThe potential optimization techniques are vast given the outsized portion of time models spend in matrix and vector math. Even incremental improvements would directly speed up runtimes and lower energy usage. Finding new ways to enhance these linear algebra primitives continues to be an active area of research aligned with the future demands of machine learning. We will discuss these in detail in the Optimizations and AI Acceleration chapters.\n\n\n\n8.9.2 Compute vs Memory Bottleneck\nAt this point, it should be clear that the core mathematical operation underpinning neural networks is the matrix-matrix multiplication. Both training and inference for neural networks heavily utilize these matrix multiply operations. Analysis shows that over 90% of computational requirements in state-of-the-art neural networks arise from matrix multiplications. Consequently, the performance of matrix multiplication has an enormous influence on overall model training or inference time.\n\nTraining versus Inference\nWhile both training and inference rely heavily on matrix multiplication performance, their precise computational profiles differ. Specifically, neural network inference tends to be more compute-bound compared to training for an equivalent batch size. The key difference lies in the backpropagation pass which is only required during training. Backpropagation involves a sequence matrix multiply operations to calculate gradients with respect to activations across each network layer. Critically though, no additional memory bandwidth is needed here - the inputs, outputs, and gradients are read/written from cache or registers.\nAs a result, training exhibits lower arithmetic intensities, with gradient calculations bounded by memory access instead of FLOPs. In contrast, neural network inference is dominated by the forward propagation which corresponds to a series of matrix-matrix multiplies. With no memory-intensive gradient retrospecting, larger batch sizes readily push inference into being extremely compute-bound. This is exhibited by the high measured arithmetic intensities. Note that for some inference applications, response times may be a critical requirement, which might force the application-provider to use a smaller batch size to meet these response-time requirements, thereby reducing hardware efficiency; hence in these cases inference may see lower hardware utilization.\nThe implications are that hardware provisioning and bandwidth vs FLOP tradeoffs differ based on whether a system targets training or inference. High-throughput low-latency servers for inference should emphasize computational power instead of memory while training clusters require a more balanced architecture.\nHowever, matrix multiplication exhibits an interesting tension - it can either be bound by the memory bandwidth or arithmetic throughput capabilities of the underlying hardware. The system’s ability to fetch and supply matrix data versus its ability to perform computational operations determines this direction.\nThis phenomenon has profound impacts; hardware must be designed judiciously and software optimizations need to keep this in mind. Optimizing and balancing compute versus memory to alleviate this underlying matrix multiplication bottleneck is crucial for both efficient model training as well as deployment.\nFinally, the batch size used may impact convergence rates during neural network training, which is another important consideration. For example, there is generally diminishing returns in benefits to convergence with extremely large batch sizes (i.e: > 16384), and hence while extremely large batch sizes may be increasingly beneficial from a hardware/arithmetic intensity perspective, using such large batches may not translate to faster convergence vs wall-clock time due to their diminishing benefits to convergence. These tradeoffs are part of the design decisions core to systems for machine-learning type of research.\n\n\nBatch Size\nThe batch size used during neural network training and inference has a significant impact on whether matrix multiplication poses more of a computational or memory bottleneck. Concretely, the batch size refers to the number of samples that are propagated through the network together in one forward/backward pass. In terms of matrix multiplication, this equates to larger matrix sizes.\nSpecifically, let’s look at the arithmetic intensity of matrix multiplication during neural network training. This measures the ratio between computational operations and memory transfers. The matrix multiply of two matrices of size \\(N \\times M\\) and \\(M \\times B\\) requires \\(N \\times M \\times B\\) multiply-accumulate operations, but only transfers of \\(N \\times M + M \\times B\\) matrix elements.\nAs we increase the batch size \\(B\\), the number of arithmetic operations grows much faster than the memory transfers. For example, with a batch size of 1, we need \\(N \\times M\\) operations and \\(N + M\\) transfers, giving an arithmetic intensity ratio of around \\(\\frac{N \\times M}{N+M}\\). But with a large batch size of 128, the intensity ratio becomes \\(\\frac{128 \\times N \\times M}{N \\times M + M \\times 128} \\approx 128\\). Using a larger batch size shifts the overall computation from being more memory-bounded to being more compute-bounded. In practice, AI training uses large batch sizes and is generally limited by peak arithmetic computational performance, i.e: Application 3 in Figure 8.1.\nTherefore, batched matrix multiplication is far more computationally intensive than memory access bound. This has implications on hardware design as well as software optimizations, which we will cover next. The key insight is that by tuning the batch size, we can significantly alter the computational profile and bottlenecks posed by neural network training and inference.\n\n\n\nFigure 8.1: AI training is typically compute bound due to the high arithmetic intensity of matrix-multiplication when batch size is large.\n\n\n\n\nHardware Characteristics\nModern hardware like CPUs and GPUs are highly optimized for computational throughput as opposed to memory bandwidth. For example, high-end H100 Tensor Core GPUs can deliver over 60 TFLOPS of double-precision performance but only provide up to 3 TB/s of memory bandwidth. This means there is almost a 20x imbalance between arithmetic units and memory access. Consequently, for hardware like GPU accelerators, neural network training workloads need to be made as computationally intensive as possible in order to fully utilize the available resources.\nThis further motivates the need for using large batch sizes during training. When using a small batch, the matrix multiplication is bounded by memory bandwidth, underutilizing the abundant compute resources. However, with sufficiently large batches, we can shift the bottleneck more towards computation and attain much higher arithmetic intensity. For instance, batches of 256 or 512 samples may be needed to saturate a high-end GPU. The downside is that larger batches provide less frequent parameter updates, which can impact convergence. Still, the parameter serves as an important tuning knob to balance memory vs compute limitations.\nTherefore, given the imbalanced compute-memory architectures of modern hardware, employing large batch sizes is essential to alleviate bottlenecks and maximize throughput. The subsequent software and algorithms also need to accommodate such batch sizes, as mentioned, since larger batch sizes may have diminishing returns towards the convergence of the network. Using very small batch sizes may lead to suboptimal hardware utilization, ultimately limiting training efficiency. Scaling up to large batch sizes is a topic of research and has been explored in various works that aim to do large scale training. (You et al. 2018)\n\nYou, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. “ImageNet Training in Minutes.” https://arxiv.org/abs/1709.05011.\n\n\nModel Architectures\nThe underlying neural network architecture also affects whether matrix multiplication poses more of a computational or memory bottleneck during execution. Transformers and MLPs tend to be much more compute-bound compared to CNN convolutional neural networks. This stems from the types of matrix multiplication operations involved in each model. Transformers rely on self-attention - multiplying large activation matrices by massive parameter matrices to relate elements. MLPs stack fully-connected layers also requiring large matrix multiplies.\nIn contrast, the convolutional layers in CNNs have a sliding window that reuses activations and parameters across the input. This means fewer unique matrix operations are needed. However, the convolutions require repeatedly accessing small parts of the input and moving partial sums to populate each window. Even though the arithmetic operations in convolutions are intense, this data movement and buffer manipulation imposes huge memory access overheads. Additionally, CNNs comprise several layered stages so intermediate outputs need to be materialized to memory frequently.\nAs a result, CNN training tends to be more memory bandwidth bound relative to arithmetic bound compared to Transformers and MLPs. Therefore, the matrix multiplication profile and in turn the bottleneck posed varies significantly based on model choice. Hardware and systems need to be designed with appropriate compute-memory bandwidth balance depending on target model deployment. Models relying more on attention and MLP layers require higher arithmetic throughput compared to CNNs which necessitate high memory bandwidth."
+ "text": "8.9 System Bottlenecks\nAs introduced earlier, neural networks are comprised of linear operations (matrix multiplications) interleaved with element-wise nonlinear activation functions. The most computationally expensive portion of neural networks is the linear transformations, specifically the matrix multiplications between each layer. These linear layers map the activations from the previous layer to a higher dimensional space that serves as inputs to the next layer’s activation function.\n\n8.9.1 Runtime Complexity of Matrix Multiplication\n\nLayer Multiplications vs. Activations\nThe bulk of computation in neural networks arises from the matrix multiplications between layers. Consider a neural network layer with an input dimension of \\(M\\) = 500 and output dimension of \\(N\\) = 1000, the matrix multiplication requires \\(O(N \\cdot M) = O(1000 \\cdot 500) = 500,000\\) multiply-accumulate (MAC) operations between those layers.\nContrast this with the preceding layer which had \\(M\\) = 300 inputs, requiring \\(O(500 \\cdot 300) = 150,000\\) ops. We can see how the computations scale exponentially as the layer widths increase, with the total computations across \\(L\\) layers being \\(\\sum_{l=1}^{L-1} O\\big(N^{(l)} \\cdot M^{(l-1)}\\big)\\).\nNow comparing the matrix multiplication to the activation function which requires only \\(O(N) = 1000\\) element-wise nonlinearities for \\(N = 1000\\) outputs, we can clearly see the linear transformations dominating the activations computationally.\nThese large matrix multiplications directly impact hardware choices, inference latency, and power constraints for real-world neural network applications. For example, a typical DNN layer may require 500,000 multiply-accumulates vs. only 1000 nonlinear activations, demonstrating a 500x increase in mathematical operations.\nWhen training neural networks, we typically use mini-batch gradient descent, operating on small batches of data at a time. Considering a batch size of \\(B\\) training examples, the input to the matrix multiplication becomes a \\(M \\times B\\) matrix, while the output is an \\(N \\times B\\) matrix.\n\n\nMini-batch\nIn training neural networks, we need to repeatedly estimate the gradient of the loss function with respect to the network parameters (i.e. weights and biases). This gradient indicates which direction the parameters should be updated in order to minimize the loss. As introduced previously, use perform updates over a batch of datapoints every update, also known as stochastic gradient descent, or mini-batch gradient descent.\nThe most straightforward approach is to estimate the gradient based on a single training example, compute the parameter update, lather, rinse, and repeat for the next example. However, this involves very small and frequent parameter updates that can be computationally inefficient, and may additionally be inaccurate in terms of convergence due to the stochasticity of using just a single datapoint for a model update.\nInstead, mini-batch gradient descent strikes a balance between convergence stability and computational efficiency. Rather than compute the gradient on single examples, we estimate the gradient based on small “mini-batches” of data - usually between 8 to 256 examples in practice.\nThis provides a noisy but consistent gradient estimate that leads to more stable convergence. Additionally, the parameter update only needs to be performed once per mini-batch rather than once per example, reducing computational overhead.\nBy tuning the mini-batch size, we can control the tradeoff between the smoothness of the estimate (larger batches are generally better) and the frequency of updates (smaller batches allow more frequent updates). Mini-batch sizes are usually powers of 2 so they can leverage parallelism across GPU cores efficiently.\nSo the total computation is performing an \\(N \\times M\\) by \\(M \\times B\\) matrix multiplication, yielding \\(O(N \\cdot M \\cdot B)\\) floating point operations. As a numerical example, with \\(N=1000\\) hidden units, \\(M=500\\) input units, and a batch size \\(B=64\\), this equates to 1000 x 500 x 64 = 32 million multiply-accumulates per training iteration!\nIn contrast, the activation functions are applied element-wise to the \\(N \\times B\\) output matrix, requiring only \\(O(N \\cdot B)\\) computations. For \\(N=1000\\) and \\(B=64\\), that is just 64,000 nonlinearities - 500X less work than the matrix multiplication.\nAs we increase the batch size to fully leverage parallel hardware like GPUs, the discrepancy between matrix multiplication and activation function cost grows even larger. This reveals how optimizing the linear algebra operations offers tremendous efficiency gains.\nTherefore, when analyzing where and how neural networks spend computation, matrix multiplication clearly plays a central role. For example, matrix multiplications often account for over 90% of both inference latency and training time in common convolutional and recurrent neural networks.\n\n\nOptimizing Matrix Multiplication\nA number of techniques enhance the efficiency of general dense/sparse matrix-matrix and matrix-vector operations to directly improve overall efficiency. Some key methods include:\n\nLeveraging optimized math libraries like cuBLAS for GPU acceleration\nEnabling lower precision formats like FP16 or INT8 where accuracy permits\nEmploying Tensor Processing Units with hardware matrix multiplication\nSparsity-aware computations and data storage formats to exploit zero parameters\nApproximating matrix multiplications with algorithms like Fast Fourier Transforms\nModel architecture design to reduce layer widths and activations\nQuantization, pruning, distillation and other compression techniques\nParallelization of computation across available hardware\nCaching/pre-computing results where possible to reduce redundant operations\n\nThe potential optimization techniques are vast given the outsized portion of time models spend in matrix and vector math. Even incremental improvements would directly speed up runtimes and lower energy usage. Finding new ways to enhance these linear algebra primitives continues to be an active area of research aligned with the future demands of machine learning. We will discuss these in detail in the Optimizations and AI Acceleration chapters.\n\n\n\n8.9.2 Compute vs Memory Bottleneck\nAt this point, it should be clear that the core mathematical operation underpinning neural networks is the matrix-matrix multiplication. Both training and inference for neural networks heavily utilize these matrix multiply operations. Analysis shows that over 90% of computational requirements in state-of-the-art neural networks arise from matrix multiplications. Consequently, the performance of matrix multiplication has an enormous influence on overall model training or inference time.\n\nTraining versus Inference\nWhile both training and inference rely heavily on matrix multiplication performance, their precise computational profiles differ. Specifically, neural network inference tends to be more compute-bound compared to training for an equivalent batch size. The key difference lies in the backpropagation pass which is only required during training. Backpropagation involves a sequence matrix multiply operations to calculate gradients with respect to activations across each network layer. Critically though, no additional memory bandwidth is needed here - the inputs, outputs, and gradients are read/written from cache or registers.\nAs a result, training exhibits lower arithmetic intensities, with gradient calculations bounded by memory access instead of FLOPs. In contrast, neural network inference is dominated by the forward propagation which corresponds to a series of matrix-matrix multiplies. With no memory-intensive gradient retrospecting, larger batch sizes readily push inference into being extremely compute-bound. This is exhibited by the high measured arithmetic intensities. Note that for some inference applications, response times may be a critical requirement, which might force the application-provider to use a smaller batch size to meet these response-time requirements, thereby reducing hardware efficiency; hence in these cases inference may see lower hardware utilization.\nThe implications are that hardware provisioning and bandwidth vs FLOP tradeoffs differ based on whether a system targets training or inference. High-throughput low-latency servers for inference should emphasize computational power instead of memory while training clusters require a more balanced architecture.\nHowever, matrix multiplication exhibits an interesting tension - it can either be bound by the memory bandwidth or arithmetic throughput capabilities of the underlying hardware. The system’s ability to fetch and supply matrix data versus its ability to perform computational operations determines this direction.\nThis phenomenon has profound impacts; hardware must be designed judiciously and software optimizations need to keep this in mind. Optimizing and balancing compute versus memory to alleviate this underlying matrix multiplication bottleneck is crucial for both efficient model training as well as deployment.\nFinally, the batch size used may impact convergence rates during neural network training, which is another important consideration. For example, there is generally diminishing returns in benefits to convergence with extremely large batch sizes (i.e: > 16384), and hence while extremely large batch sizes may be increasingly beneficial from a hardware/arithmetic intensity perspective, using such large batches may not translate to faster convergence vs wall-clock time due to their diminishing benefits to convergence. These tradeoffs are part of the design decisions core to systems for machine-learning type of research.\n\n\nBatch Size\nThe batch size used during neural network training and inference has a significant impact on whether matrix multiplication poses more of a computational or memory bottleneck. Concretely, the batch size refers to the number of samples that are propagated through the network together in one forward/backward pass. In terms of matrix multiplication, this equates to larger matrix sizes.\nSpecifically, let’s look at the arithmetic intensity of matrix multiplication during neural network training. This measures the ratio between computational operations and memory transfers. The matrix multiply of two matrices of size \\(N \\times M\\) and \\(M \\times B\\) requires \\(N \\times M \\times B\\) multiply-accumulate operations, but only transfers of \\(N \\times M + M \\times B\\) matrix elements.\nAs we increase the batch size \\(B\\), the number of arithmetic operations grows much faster than the memory transfers. For example, with a batch size of 1, we need \\(N \\times M\\) operations and \\(N + M\\) transfers, giving an arithmetic intensity ratio of around \\(\\frac{N \\times M}{N+M}\\). But with a large batch size of 128, the intensity ratio becomes \\(\\frac{128 \\times N \\times M}{N \\times M + M \\times 128} \\approx 128\\). Using a larger batch size shifts the overall computation from being more memory-bounded to being more compute-bounded. In practice, AI training uses large batch sizes and is generally limited by peak arithmetic computational performance, i.e: Application 3 in Figure 8.1.\nTherefore, batched matrix multiplication is far more computationally intensive than memory access bound. This has implications on hardware design as well as software optimizations, which we will cover next. The key insight is that by tuning the batch size, we can significantly alter the computational profile and bottlenecks posed by neural network training and inference.\n\n\n\nFigure 8.1: AI training is typically compute bound due to the high arithmetic intensity of matrix-multiplication when batch size is large.\n\n\n\n\nHardware Characteristics\nModern hardware like CPUs and GPUs are highly optimized for computational throughput as opposed to memory bandwidth. For example, high-end H100 Tensor Core GPUs can deliver over 60 TFLOPS of double-precision performance but only provide up to 3 TB/s of memory bandwidth. This means there is almost a 20x imbalance between arithmetic units and memory access. Consequently, for hardware like GPU accelerators, neural network training workloads need to be made as computationally intensive as possible in order to fully utilize the available resources.\nThis further motivates the need for using large batch sizes during training. When using a small batch, the matrix multiplication is bounded by memory bandwidth, underutilizing the abundant compute resources. However, with sufficiently large batches, we can shift the bottleneck more towards computation and attain much higher arithmetic intensity. For instance, batches of 256 or 512 samples may be needed to saturate a high-end GPU. The downside is that larger batches provide less frequent parameter updates, which can impact convergence. Still, the parameter serves as an important tuning knob to balance memory vs compute limitations.\nTherefore, given the imbalanced compute-memory architectures of modern hardware, employing large batch sizes is essential to alleviate bottlenecks and maximize throughput. The subsequent software and algorithms also need to accommodate such batch sizes, as mentioned, since larger batch sizes may have diminishing returns towards the convergence of the network. Using very small batch sizes may lead to suboptimal hardware utilization, ultimately limiting training efficiency. Scaling up to large batch sizes is a topic of research and has been explored in various works that aim to do large scale training (You et al. 2018).\n\nYou, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. “ImageNet Training in Minutes.” https://arxiv.org/abs/1709.05011.\n\n\nModel Architectures\nThe underlying neural network architecture also affects whether matrix multiplication poses more of a computational or memory bottleneck during execution. Transformers and MLPs tend to be much more compute-bound compared to CNN convolutional neural networks. This stems from the types of matrix multiplication operations involved in each model. Transformers rely on self-attention - multiplying large activation matrices by massive parameter matrices to relate elements. MLPs stack fully-connected layers also requiring large matrix multiplies.\nIn contrast, the convolutional layers in CNNs have a sliding window that reuses activations and parameters across the input. This means fewer unique matrix operations are needed. However, the convolutions require repeatedly accessing small parts of the input and moving partial sums to populate each window. Even though the arithmetic operations in convolutions are intense, this data movement and buffer manipulation imposes huge memory access overheads. Additionally, CNNs comprise several layered stages so intermediate outputs need to be materialized to memory frequently.\nAs a result, CNN training tends to be more memory bandwidth bound relative to arithmetic bound compared to Transformers and MLPs. Therefore, the matrix multiplication profile and in turn the bottleneck posed varies significantly based on model choice. Hardware and systems need to be designed with appropriate compute-memory bandwidth balance depending on target model deployment. Models relying more on attention and MLP layers require higher arithmetic throughput compared to CNNs which necessitate high memory bandwidth."
},
{
"objectID": "contents/training/training.html#training-parallelization",
|