Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed Dec 10, 2023
1 parent 9b55170 commit 83b0449
Show file tree
Hide file tree
Showing 5 changed files with 15 additions and 15 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
bdc47399
cde099d5
8 changes: 4 additions & 4 deletions contents/contributors.html
Original file line number Diff line number Diff line change
Expand Up @@ -552,7 +552,7 @@ <h1 class="title">Contributors</h1>
</tr>
<tr>
<td align="center" valign="top" width="20%">
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/613d3a2e949553ea5ddca6b86cfa0165?d=identicon&amp;s=100?s=100" width="100px;" alt="Maximilian Lam"><br><sub><b>Maximilian Lam</b></sub></a><br>
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/efaae2f4eb6a9447faf76cdb1d4bd513?d=identicon&amp;s=100?s=100" width="100px;" alt="Maximilian Lam"><br><sub><b>Maximilian Lam</b></sub></a><br>
</td>
<td align="center" valign="top" width="20%">
<a href="https://github.com/jaysonzlin"><img src="https://avatars.githubusercontent.com/jaysonzlin?s=100" width="100px;" alt="Jayson Lin"><br><sub><b>Jayson Lin</b></sub></a><br>
Expand Down Expand Up @@ -589,7 +589,7 @@ <h1 class="title">Contributors</h1>
<a href="https://github.com/aptl26"><img src="https://avatars.githubusercontent.com/aptl26?s=100" width="100px;" alt="Aghyad Deeb"><br><sub><b>Aghyad Deeb</b></sub></a><br>
</td>
<td align="center" valign="top" width="20%">
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/2e775865be2da637187b2079acb92e83?d=identicon&amp;s=100?s=100" width="100px;" alt="Aghyad Deeb"><br><sub><b>Aghyad Deeb</b></sub></a><br>
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/6f7005975970f417d37873b5471ca2b1?d=identicon&amp;s=100?s=100" width="100px;" alt="Aghyad Deeb"><br><sub><b>Aghyad Deeb</b></sub></a><br>
</td>
<td align="center" valign="top" width="20%">
<a href="https://github.com/zishenwan"><img src="https://avatars.githubusercontent.com/zishenwan?s=100" width="100px;" alt="Zishen"><br><sub><b>Zishen</b></sub></a><br>
Expand Down Expand Up @@ -671,7 +671,7 @@ <h1 class="title">Contributors</h1>
</tr>
<tr>
<td align="center" valign="top" width="20%">
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/c3bc0d62ac9c74d7a8d414af692451bb?d=identicon&amp;s=100?s=100" width="100px;" alt="Annie Laurie Cook"><br><sub><b>Annie Laurie Cook</b></sub></a><br>
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/56964fa935089608fa13c9865fe44424?d=identicon&amp;s=100?s=100" width="100px;" alt="Annie Laurie Cook"><br><sub><b>Annie Laurie Cook</b></sub></a><br>
</td>
<td align="center" valign="top" width="20%">
<a href="https://github.com/eezike"><img src="https://avatars.githubusercontent.com/eezike?s=100" width="100px;" alt="Emeka Ezike"><br><sub><b>Emeka Ezike</b></sub></a><br>
Expand All @@ -694,7 +694,7 @@ <h1 class="title">Contributors</h1>
<a href="https://github.com/abigailswallow"><img src="https://avatars.githubusercontent.com/abigailswallow?s=100" width="100px;" alt="abigailswallow"><br><sub><b>abigailswallow</b></sub></a><br>
</td>
<td align="center" valign="top" width="20%">
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/41b331507fc9006e6d38b9e5e24550cd?d=identicon&amp;s=100?s=100" width="100px;" alt="Costin-Andrei Oncescu"><br><sub><b>Costin-Andrei Oncescu</b></sub></a><br>
<a href="https://github.com/harvard-edge/cs249r_book/graphs/contributors"><img src="https://www.gravatar.com/avatar/894d3532fa1a3bfbc47af000453a2f21?d=identicon&amp;s=100?s=100" width="100px;" alt="Costin-Andrei Oncescu"><br><sub><b>Costin-Andrei Oncescu</b></sub></a><br>
</td>
<td align="center" valign="top" width="20%">
<a href="https://github.com/vijay-edu"><img src="https://avatars.githubusercontent.com/vijay-edu?s=100" width="100px;" alt="Vijay Edupuganti"><br><sub><b>Vijay Edupuganti</b></sub></a><br>
Expand Down
Binary file not shown.
14 changes: 7 additions & 7 deletions contents/training/training.html
Original file line number Diff line number Diff line change
Expand Up @@ -997,18 +997,18 @@ <h3 data-number="8.4.1" class="anchored" data-anchor-id="optimizations"><span cl
Ruder, Sebastian. 2016. <span>“An Overview of Gradient Descent Optimization Algorithms.”</span> <em>ArXiv Preprint</em> abs/1609.04747. <a href="https://arxiv.org/abs/1609.04747">https://arxiv.org/abs/1609.04747</a>.
</div></div><p><strong>Momentum:</strong> Accumulates a velocity vector in directions of persistent gradient across iterations. This helps accelerate progress by dampening oscillations and maintains progress in consistent directions.</p>
<p><strong>Nesterov Accelerated Gradient (NAG):</strong> A variant of momentum that computes gradients at the “look ahead” position rather than the current parameter position. This anticipatory update prevents overshooting while the momentum maintains the accelerated progress.</p>
<p><strong>RMSProp:</strong> Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates. <span class="citation" data-cites="rmsprop">(<a href="../references.html#ref-rmsprop" role="doc-biblioref">Hinton 2017</a>)</span></p>
<p><strong>RMSProp:</strong> Divides the learning rate by an exponentially decaying average of squared gradients. This has a similar normalizing effect as Adagrad but does not accumulate the gradients over time, avoiding a rapid decay of learning rates <span class="citation" data-cites="rmsprop">(<a href="../references.html#ref-rmsprop" role="doc-biblioref">Hinton 2017</a>)</span>.</p>
<div class="no-row-height column-margin column-container"><div id="ref-rmsprop" class="csl-entry" role="listitem">
Hinton, Geoffrey. 2017. <span>“Overview of Minibatch Gradient Descent.”</span> University of Toronto; University Lecture.
</div><div id="ref-adagrad" class="csl-entry" role="listitem">
Duchi, John C., Elad Hazan, and Yoram Singer. 2010. <span>“Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”</span> In <em><span>COLT</span> 2010 - the 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010</em>, edited by Adam Tauman Kalai and Mehryar Mohri, 257–69. Omnipress. <a href="http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf\#page=265">http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdf\#page=265</a>.
</div></div><p><strong>Adagrad:</strong> An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates. <span class="citation" data-cites="adagrad">(<a href="../references.html#ref-adagrad" role="doc-biblioref">Duchi, Hazan, and Singer 2010</a>)</span></p>
<p><strong>Adadelta:</strong> A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates. <span class="citation" data-cites="adelta">(<a href="../references.html#ref-adelta" role="doc-biblioref">Zeiler 2012</a>)</span></p>
</div></div><p><strong>Adagrad:</strong> An adaptive learning rate algorithm that maintains a per-parameter learning rate that is scaled down proportionate to the historical sum of gradients on each parameter. This helps eliminate the need to manually tune learning rates <span class="citation" data-cites="adagrad">(<a href="../references.html#ref-adagrad" role="doc-biblioref">Duchi, Hazan, and Singer 2010</a>)</span>.</p>
<p><strong>Adadelta:</strong> A modification to Adagrad which restricts the window of accumulated past gradients thus reducing the aggressive decay of learning rates <span class="citation" data-cites="adelta">(<a href="../references.html#ref-adelta" role="doc-biblioref">Zeiler 2012</a>)</span>.</p>
<div class="no-row-height column-margin column-container"><div id="ref-adelta" class="csl-entry" role="listitem">
Zeiler, Matthew D. 2012. <span>“Reinforcement and Systemic Machine Learning for Decision Making.”</span> Wiley. <a href="https://doi.org/10.1002/9781118266502.ch6">https://doi.org/10.1002/9781118266502.ch6</a>.
</div><div id="ref-adam" class="csl-entry" role="listitem">
Kingma, Diederik P., and Jimmy Ba. 2015. <span>“Adam: <span>A</span> Method for Stochastic Optimization.”</span> In <em>3rd International Conference on Learning Representations, <span>ICLR</span> 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings</em>, edited by Yoshua Bengio and Yann LeCun. <a href="http://arxiv.org/abs/1412.6980">http://arxiv.org/abs/1412.6980</a>.
</div></div><p><strong>Adam:</strong> - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes. <span class="citation" data-cites="adam">(<a href="../references.html#ref-adam" role="doc-biblioref">Kingma and Ba 2015</a>)</span></p>
</div></div><p><strong>Adam:</strong> - Combination of momentum and rmsprop where rmsprop modifies the learning rate based on average of recent magnitudes of gradients. Displays very fast initial progress and automatically tunes step sizes <span class="citation" data-cites="adam">(<a href="../references.html#ref-adam" role="doc-biblioref">Kingma and Ba 2015</a>)</span>.</p>
<p>Of these methods, Adam is widely considered the go-to optimization algorithm for many deep learning tasks, consistently outperforming vanilla SGD in terms of both training speed and performance. Other optimizers may be better suited in some cases, particularly for simpler models.</p>
</section>
<section id="trade-offs" class="level3" data-number="8.4.2">
Expand Down Expand Up @@ -1098,9 +1098,9 @@ <h3 data-number="8.5.1" class="anchored" data-anchor-id="search-algorithms"><spa
<ul>
<li><p><strong>Grid Search:</strong> The most basic search method, where you manually define a grid of values to check for each hyperparameter. For example, checking learning rates = [0.01, 0.1, 1] and batch sizes = [32, 64, 128]. The key advantage is simplicity, but exploring all combinations leads to exponential search space explosion. Best for fine-tuning a few params.</p></li>
<li><p><strong>Random Search:</strong> Instead of a grid, you define a random distribution per hyperparameter to sample values from during search. It is more efficient at searching a vast hyperparameter space. However, still somewhat arbitrary compared to more adaptive methods.</p></li>
<li><p><strong>Bayesian Optimization:</strong> An advanced probabilistic approach for adaptive exploration based on a surrogate function to model performance over iterations. It is very sample efficient - finds highly optimized hyperparameters in fewer evaluation steps. Requires more investment in setup. <span class="citation" data-cites="bayes_hyperparam">(<a href="../references.html#ref-bayes_hyperparam" role="doc-biblioref">Snoek, Larochelle, and Adams 2012</a>)</span></p></li>
<li><p><strong>Bayesian Optimization:</strong> An advanced probabilistic approach for adaptive exploration based on a surrogate function to model performance over iterations. It is very sample efficient - finds highly optimized hyperparameters in fewer evaluation steps. Requires more investment in setup <span class="citation" data-cites="bayes_hyperparam">(<a href="../references.html#ref-bayes_hyperparam" role="doc-biblioref">Snoek, Larochelle, and Adams 2012</a>)</span>.</p></li>
<li><p><strong>Evolutionary Algorithms:</strong> Mimic natural selection principles - generate populations of hyperparameter combinations, evolve them over time based on performance. These algorithms offer robust search capabilities better suited for complex response surfaces. But many iterations required for reasonable convergence.</p></li>
<li><p><strong>Neural Architecture Search:</strong> An approach to designing well-performing architectures for neural networks. Traditionally, NAS approaches use some form of reinforcement learning to propose neural network architectures which are then repeatedly evaluated. <span class="citation" data-cites="nas">(<a href="../references.html#ref-nas" role="doc-biblioref">Zoph and Le 2023</a>)</span></p></li>
<li><p><strong>Neural Architecture Search:</strong> An approach to designing well-performing architectures for neural networks. Traditionally, NAS approaches use some form of reinforcement learning to propose neural network architectures which are then repeatedly evaluated <span class="citation" data-cites="nas">(<a href="../references.html#ref-nas" role="doc-biblioref">Zoph and Le 2023</a>)</span>.</p></li>
</ul>
<div class="no-row-height column-margin column-container"><div id="ref-bayes_hyperparam" class="csl-entry" role="listitem">
Snoek, Jasper, Hugo Larochelle, and Ryan P. Adams. 2012. <span>“Practical Bayesian Optimization of Machine Learning Algorithms.”</span> In <em>Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a Meeting Held December 3-6, 2012, Lake Tahoe, Nevada, United States</em>, edited by Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, 2960–68. <a href="https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html">https://proceedings.neurips.cc/paper/2012/hash/05311655a15b75fab86956663e1819cd-Abstract.html</a>.
Expand Down Expand Up @@ -1396,7 +1396,7 @@ <h4 class="anchored" data-anchor-id="batch-size">Batch Size</h4>
<h4 class="anchored" data-anchor-id="hardware-characteristics">Hardware Characteristics</h4>
<p>Modern hardware like CPUs and GPUs are highly optimized for computational throughput as opposed to memory bandwidth. For example, high-end H100 Tensor Core GPUs can deliver over 60 TFLOPS of double-precision performance but only provide up to 3 TB/s of memory bandwidth. This means there is almost a 20x imbalance between arithmetic units and memory access. Consequently, for hardware like GPU accelerators, neural network training workloads need to be made as computationally intensive as possible in order to fully utilize the available resources.</p>
<p>This further motivates the need for using large batch sizes during training. When using a small batch, the matrix multiplication is bounded by memory bandwidth, underutilizing the abundant compute resources. However, with sufficiently large batches, we can shift the bottleneck more towards computation and attain much higher arithmetic intensity. For instance, batches of 256 or 512 samples may be needed to saturate a high-end GPU. The downside is that larger batches provide less frequent parameter updates, which can impact convergence. Still, the parameter serves as an important tuning knob to balance memory vs compute limitations.</p>
<p>Therefore, given the imbalanced compute-memory architectures of modern hardware, employing large batch sizes is essential to alleviate bottlenecks and maximize throughput. The subsequent software and algorithms also need to accommodate such batch sizes, as mentioned, since larger batch sizes may have diminishing returns towards the convergence of the network. Using very small batch sizes may lead to suboptimal hardware utilization, ultimately limiting training efficiency. Scaling up to large batch sizes is a topic of research and has been explored in various works that aim to do large scale training. <span class="citation" data-cites="bigbatch">(<a href="../references.html#ref-bigbatch" role="doc-biblioref">You et al. 2018</a>)</span></p>
<p>Therefore, given the imbalanced compute-memory architectures of modern hardware, employing large batch sizes is essential to alleviate bottlenecks and maximize throughput. The subsequent software and algorithms also need to accommodate such batch sizes, as mentioned, since larger batch sizes may have diminishing returns towards the convergence of the network. Using very small batch sizes may lead to suboptimal hardware utilization, ultimately limiting training efficiency. Scaling up to large batch sizes is a topic of research and has been explored in various works that aim to do large scale training <span class="citation" data-cites="bigbatch">(<a href="../references.html#ref-bigbatch" role="doc-biblioref">You et al. 2018</a>)</span>.</p>
<div class="no-row-height column-margin column-container"><div id="ref-bigbatch" class="csl-entry" role="listitem">
You, Yang, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. 2018. <span><span>ImageNet</span> Training in Minutes.”</span> <a href="https://arxiv.org/abs/1709.05011">https://arxiv.org/abs/1709.05011</a>.
</div></div></section>
Expand Down
Loading

0 comments on commit 83b0449

Please sign in to comment.