diff --git a/_config.yml b/_config.yml index fad5a5a1..bc4e8f99 100644 --- a/_config.yml +++ b/_config.yml @@ -338,4 +338,4 @@ mdb: js: "sha256-NdbiivsvWt7VYCt6hYNT3h/th9vSTL4EDWeGs5SN3DA=" medium_zoom: version: "1.0.6" - integrity: "sha256-EdPgYcPk/IIrw7FYeuJQexva49pVRZNmt3LculEr7zM=" + integrity: "sha256-EdPgYcPk/IIrw7FYeuJQexva49pVRZNmt3LculEr7zM=" \ No newline at end of file diff --git a/_posts/2023-11-09-eunhae-project.md b/_posts/2023-11-09-eunhae-project.md new file mode 100644 index 00000000..c097754b --- /dev/null +++ b/_posts/2023-11-09-eunhae-project.md @@ -0,0 +1,311 @@ +--- +layout: distill +title: How does model size impact catastrophic forgetting in online continual learning? +description: Yes, model size matters. +date: 2023-11-09 +htmlwidgets: true + +authors: + - name: Eunhae Lee + url: "https://www.linkedin.com/in/eunhaelee/" + affiliations: + name: MIT + +# must be the exact same name as your blogpost +bibliography: 2023-11-09-eunhae-project.bib + +# Add a table of contents to your post. +# - make sure that TOC names match the actual section names +# for hyperlinks within the post to work correctly. +toc: + - name: Introduction + - name: Related Work + - name: Method + - name: Experiment + - name: Results + - name: Discussion + - name: Conclusion + # - name: Appendix +_styles: > + .caption { + font-size: 0.8em; + text-align: center; + color: grey; + } + h1 { + font-size: 2.5em; + margin: 0.3em 0em 0.3em; + } + h2 { + font-size: 2em; + } + h3 { + font-size: 1.5em; + margin-top: 0; + } + .fake-img { + margin-bottom: 12px; + } + .fake-img p { + font-family: monospace; + color: white; + text-align: left; + margin: 12px 0; + text-align: center; + font-size: 16px; + } +--- + + + + + +# Introduction + +One of the biggest unsolved challenges in continual learning is preventing forgetting previously learned information upon acquiring new information. Known as “catastrophic forgetting,” this phenomenon is particularly pertinent in scenarios where AI systems must adapt to new data without losing valuable insights from past experiences. Numerous studies have investigated different approaches to solving this problem in the past years, mostly around proposing innovative strategies to modify the way models are trained and measuring its impact on model performance, such as accuracy and forgetting. + +Yet, compared to the numerous amount of studies done in establishing new strategies and evaluative approaches in visual continual learning, there is surprisingly little discussion on the impact of model size. It is commonly known that the size of a deep learning model (the number of parameters) is known to play a crucial role in its learning capabilities . Given the limitations in computational resources in most real-world circumstances, it is often not practical or feasible to choose the largest model available. In addition, sometimes smaller models perform just as well as larger models in specific contexts. Given this context, a better understanding of how model size impacts performance in a continual learning setting can provide insights and implications on real-world deployment of continual learning systems. + +In this blog post, I explore the following research question: _How do network depth and width impact model performance in an online continual learning setting?_ I set forth a hypothesis based on existing literature and conduct a series experiments with models of varying sizes to explore this relationship. This study aims to shed light on whether larger models truly offer an advantage in mitigating catastrophic forgetting, or if the reality is more nuanced. + + +# Related Work +### Online continual learning +Continual learning (CL), also known as lifelong learning or incremental learning, is an approach that seeks to continually learn from non-iid data streams without forgetting previously acquired knowledge. The challenge in continual learning is generally known as the stability-plasticity dilemma, and the goal of continual learning is to strike a balance between learning stability and plasticity. + +While traditional CL models assume new data arrives task by task, each with a stable data distribution, enabling *offline* training. However, this requires having access to all task data, which can be impractical due to privacy or resource limitations. In this study, I will consider a more realistic setting of Online Continual Learning (OCL), where data arrives in smaller batches and are not accessible after training, requiring models to learn from a single pass over an online data stream. This allows the model to learn data in real-time. + +Online continual learning can involve adapting to new classes (class-incremental) or changing data characteristics (domain-incremental). Specifically, for class-incremental learning, the goal is to continually expand the model's ability to recognize an increasing number of classes, maintaining its performance on all classes it has seen so far, despite not having continued access to the old class data. Moreover, there has been more recent work done in unsupervised continual learning . To narrow the scope of the vast CL landscape to focus on learning the impact of model size in CL performance, I will focus on the more common problem of class-incremental learning in supervised image classification in this study. + +### Continual learning techniques + +Popular methods to mitigate catastrophic forgetting in continual learning generally fall into three buckets: : +1. *regularization-based* approaches that modify the classification objective to preserve past representations or foster more insightful representations, such as Elastic Weight Consolidation (EWC) and Learning without Forgetting (LwF); +2. *memory-based* approaches that replay samples retrieved from a memory buffer along with every incoming mini-batch, including Experience Replay (ER) and Maximally Interfered Retrieval, with variations on how the memory is retrieved and how the model and memory are updated; and +3. *architectural* approaches including parameter-isolation approaches where new parameters are added for new tasks and leaving previous parameters unchanged such as Progressive Neural Networks (PNNs). + +Moreover, there are many methods that combine two or more of these techniques such as Averaged Gradient Episodic Memory (A-GEM) and Incremental Classifier and Representation Learning (iCaRL). + +Among the methods, **Experience Replay (ER)** is a classic replay-based method and widely used for online continual learning. Despite its simplicity, recent studies have shown ER still outperforms many of the newer methods that have come after that, especially for online continual learning . + + +### Model size and performance + +It is generally known across literature that deeper models increase performance. Bianco et al. conducted a survey of key performance-related metrics to compare across various architectures, including accuracy, model complexity, computational complexity, and accuracy density. Relationship between model width and performance is also been discussed, albeit less frequently. + +He et al. introduced Residual Networks (ResNets) which was a major innovation in computer vision by tackling the problem of degradation in deeper networks. ResNets do this by residual blocks to increase the accuracy of deeper models. Residual blocks that contain two ore more layers are stacked together, and "skip connections" are used in between these blocks. The skip connections act as an alternate shortcut for the gradient to pass through, which alleviates the issue of vanishing gradient. They also make it easier for the model to learn identity functions. As a result, ResNet improves the efficiency of deep neural networks with more neural layers while minimizing the percentage of errors. The authors compare models of different depths (composed of 18, 34, 50, 101, 152 layers) and show that accuracy increases with depth of the model. + + + +| | **ResNet18** | **ResNet34** | **ResNet50** | **ResNet101** | **ResNet152** | +|:------------------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:| +| **Number of Layers** | 18 | 34 | 50 | 101 | 152 | +| **Number of Parameters** | ~11.7 million | ~21.8 million | ~25.6 million | ~44.5 million | ~60 million | +| **Top-1 Accuracy** | 69.76% | 73.31% | 76.13% | 77.37% | 78.31% | +| **Top-5 Accuracy** | 89.08% | 91.42% | 92.86% | 93.68% | 94.05% | +| **FLOPs** | 1.8 billion | 3.6 billion | 3.8 billion | 7.6 billion | 11.3 billion | + +
Table 1: Comparison of ResNet Architectures
+ +This leads to the question: do larger models perform better in continual learning? While much of the focus in continual learning research has often been on developing various strategies, methods, and establishing benchmarks, the impact of model scale remains a less explored path. + +Moreover, recent studies on model scale in slightly different contexts have shown conflicting results. Luo et al. highlights a direct correlation between increasing model size and the severity of catastrophic forgetting in large language models (LLMs). They test models of varying sizes from 1 to 7 billion parameters. Yet, Dyer et al. show a constrasting perspective in the context of pretrained deep learning models. Their results show that large, pretrained ResNets and Transformers are a lot more resistant to forgetting than randomly-initialized, trained-from-scratch models, and that this tendency increases with the scale of model and the pretraining dataset size. + +The relative lack of discussion on model size and the conflicting perspectives among existing studies indicate that the answer to the question is far from being definitive. In the next section, I will describe further how I approach this study. + + + +# Method +### Problem definition + +Online continual learning can be defined as follows: + +The objective is to learn a function $f_\theta : \mathcal X \rightarrow \mathcal Y$ with parameters $\theta$ that predicts the label $Y \in \mathcal Y$ of the input $\mathbf X \in \mathcal X$. Over time steps $t \in \lbrace 1, 2, \ldots \infty \rbrace$, a distribution-varying stream $\mathcal S$ reveals data sequentially, which is different from classical supervised learning. + +At every time step, + +1. $\mathcal S$ reveals a set of data points (images) $\mathbf X_t \sim \pi_t$ from a non-stationary distribution $\pi_t$ +2. Learner $f_\theta$ makes predictions $\hat Y_t$ based on current parameters $\theta_t$ +3. $\mathcal S$ reveals true labels $Y_t$ +4. Compare the predictions with the true labels, compute the training loss $L(Y_t, \hat Y_t)$ +5. Learner updates the parameters of the model to $\theta_{t+1}$ + + +### Task-agnostic and boundary-agnostic +In the context of class-incremental learning, I will adopt the definitions of task-agnostic and boundary-agnostic from Soutif et al. 2023. A *task-agnostic* setting refers to when task labels are not available, which means the model does not know that the samples belong to a certain task. A *boundary-agnostic* setting is considered, where information on task boundaries are not available. This means that the model does not know when the data distribution changes to a new task. + +| | **Yes** | **No** | +|:-------------------:|:--------------:|:-----------------:| +| **Task labels** | Task-aware | Task-agnotic | +| **Task boundaries** | Boundary-aware | Boundary-agnostic | + +
Table 2: Task labels and task boundaries. This project assumes task-agnostic and boundary-agnostic settings.
+ + +### Experience Replay (ER) +In a class-incremental learning setting, the nature of the Experience Replay (ER) method aligns well with task-agnostic and boundary-agnostic settings. This is because ER focuses on replaying a subset of past experiences, which helps in maintaining knowledge of previous classes without needing explicit task labels or boundaries. This characteristic of ER allows it to adapt to new classes as they are introduced, while retaining the ability to recognize previously learned classes, making it inherently suitable for task-agnostic and boundary-agnostic continual learning scenarios. + +Implementation-wise, ER involves randomly initializing an external memory buffer $\mathcal M$, then implementing `before_training_exp` and `after_training_exp` callbacks to use the dataloader to create mini-batches with samples from both training stream and the memory buffer. Each mini-batch is balanced so that all tasks or experiences are equally represented in terms of stored samples. As ER is known be well-suited for online continual learning, it will be the go-to method used to compare performances across models of varying sizes. + +### Benchmark +For this study, the SplitCIFAR-10 is used as the main benchmark. SplitCIFAR-10 splits the popular CIFAR-10 dataset into 5 tasks with disjoint classes, each task including 2 classes each. Each task has 10,000 3×32×32 images for training and 2000 images for testing. The model is exposed to these tasks or experiences sequentially, which simulates a real-world scenario where a learning system is exposed to new categories of data over time. This is suitable for class-incremental learning scenarios. This benchmark is used for both testing online and offline continual learning in this study. + +### Metrics + +Key metrics established in earlier work in online continual learning are used to evaluate the performance of each model. + +**Average Anytime Accuracy (AAA)** +as defined in + +The concept of average anytime accuracy serves as an indicator of a model's overall performance throughout its learning phase, extending the idea of average incremental accuracy to include continuous assessment scenarios. This metric assesses the effectiveness of the model across all stages of training, rather than at a single endpoint, offering a more comprehensive view of its learning trajectory. + +$$\text{AAA} = \frac{1}{T} \sum_{t=1}^{T} (\text{AA})_t$$ + +**Average Cumulative Forgetting (ACF)** as defined in + +This equation represents the calculation of the **Cumulative Accuracy** ($b_k^t$) for task $k$ after the model has been trained up to task $t$. It computes the mean accuracy over the evaluation set $E^k_\Sigma$, which contains all instances $x$ and their true labels $y$ up to task $k$. The model's prediction for each instance is given by $\underset{c \in C^k_\Sigma}{\text{arg max }} f^t(x)_c$, which selects the class $c$ with the highest predicted logit $f^t(x)_c$. The indicator function $1_y(\hat{y})$ outputs 1 if the prediction matches the true label, and 0 otherwise. The sum of these outputs is then averaged over the size of the evaluation set to compute the cumulative accuracy. + + +$$ b_k^t = \frac{1}{|E^k_\Sigma|} \sum_{(x,y) \in E^k_\Sigma} 1_y(\underset{c \in C^k_\Sigma}{\text{arg max }} f^t(x)_c)$$ + +From Cumulative Accuracy, we can calculate the **Average Cumulative Forgetting** ($F_{\Sigma}^t$) by setting the cumulative forgetting about a previous cumulative task $k$, then averaging over all tasks learned so far: + +$$F_{\Sigma}^t = \frac{1}{t-1} \sum_{k=1}^{t-1} \max_{i=1,...,t} \left( b_k^i - b_k^t \right)$$ + +**Average Accuracy (AA) and Average Forgetting (AF)** +as defined in + +$a_{i,j}$ is the accuracy evaluated on the test set of task $j$ after training the network from task 1 to $i$, while $i$ is the current task being trained. Average Accuracy (AA) is computed by averaging this over the number of tasks. + +$$\text{Average Accuracy} (AA_i) = \frac{1}{i} \sum_{j=1}^{i} a_{i,j}$$ + +Average Forgetting measures how much a model's performance on a previous task (task $j$) decreases after it has learned a new task (task $i$). It is calculated by comparing the highest accuracy the model $\max_{l \in {1, \ldots, k-1}} (a_{l, j})$ had on task $j$ before it learned task $k$, with the accuracy $a_{k, j}$ on task $j$ after learning task $k$. + +$$\text{Average Forgetting}(F_i) = \frac{1}{i - 1} \sum_{j=1}^{i-1} f_{i,j} $$ + +$$f_{k,j} = \max_{l \in \{1,...,k-1\}} (a_{l,j}) - a_{k,j}, \quad \forall j < k$$ + +In the context of class-incremental learning, the concept of classical forgetting may not provide meaningful insight due to its tendency to increase as the complexity of the task grows (considering more classes within the classification problem). Therefore, recommendeds avoiding relying on classical forgetting as a metric in settings of class-incremental learning, both online and offline settings. Thus, Average Anytime Accuracy (AAA) and Average Cumulative Forgetting (ACF) are used throughout this experiment, although AA and AF are computed as part of the process. + +### Model selection +To compare learning performance across varying model depths, I chose to use the popular ResNet architectures, particularly ResNet18, ResNet34, and ResNet50. As mentioned earlier in this blog, ResNets were designed to increase the performance of deeper neural networks, and their performance metrics are well known. While using custom models for more variability in sizes was a consideration, existing popular architectures were chosen for better reproducibility. + +Moreover, while there are newer versions (i.e. ResNeXt) that have shown to perform better without a huge increase in computational complexity, for this study the original smaller models were chosen to avoid introducing unnecessary variables. ResNet18 and ResNet34 have the basic residual network structure, and ResNet50, ResNet101, and ResNet152 use slightly modified building blocks that have 3 layers instead of 2. This ”bottleneck design” was made to reduce training time. The specifics of the design of these models are detailed in the table from the original paper by He et al.. + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/resnets_comparison.png" class="img-fluid" caption="ResNet architecture. Table from He et al. (2015)"%} + +Moreover, in order to observe the effect of model width on performance, I also test a slim version of ResNet18 that has been used in previous works. The slim version uses fewer filters per layer, reducing the model width and computational load while keeping the original depth. + +### Saliency maps + +I use saliency maps to visualize “attention” of the networks. Saliency maps are known to be useful for understanding which parts of the input image are most influential for the model's predictions. By visualizing the specific areas of an image that a CNN considers important for classification, saliency maps provide insights into the internal representation and decision-making process of the network. + + +# Experiment + +### The setup + +- Each model was trained from scratch using the Split-CIFAR10 benchmark with 2 classes per task, for 3 epoches with a mini-batch size of 64. +- SGD optimizer with a 0.9 momentum and 1e-5 weight decay was used. The initial learning rate is set to 0.01 and the scheduler reduces it by a factor of 0.1 every 30 epochs, as done in . +- Cross entropy loss is used as the criterion, as is common for image classification in continual learning. +- Basic data augmentation is done on the training data to enhance model robustness and generalization by artificially expanding the dataset with varied, modified versions of the original images. +- Each model is trained offline as well to serve as baselines. +- Memory size of 500 is used to implement Experience Replay. This represents 1% of the training dataset. + + +### Implementation + +The continual learning benchmark was implemented using the Avalanche framework, an open source continual learning library, as well as the code for online continual learning by Soutif et al.. The experiments were run on Google Colab using NVIDIA Tesla T4 GPU. + +| | **Experiment 1** | **Experiment 2** | **Experiment 3** | **Experiment 4** | **Experiment 5** | **Experiment 6** | **Experiment 7** | +|:----------------------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:| +| **Model** | ResNet18 | ResNet34 | ResNet50 | SlimResNet18 | ResNet18 | ResNet34 | ResNet50 | +| **Strategy** | Experience Replay | Experience Replay | Experience Replay | Experience Replay | Experience Replay | Experience Replay | Experience Replay | +| **Benchmark** | SplitCIFAR10 | SplitCIFAR10 | SplitCIFAR10 | SplitCIFAR10 | SplitCIFAR10 | SplitCIFAR10 | SplitCIFAR10 | +| **Training** | Online | Online | Online | Online | Offline | Offline | Offline | +| **GPU** | V100 | T4 | A100 | T4 | T4 | T4 | T4 | +| **Training time (estimate)** | 3h | 4.5h | 5h | 1h | <5m | <5m | <5m | + +
Table 3: Details of experiments conducted in this study
+ + +# Results + +Average Anytime Accuracy (AAA) decreases with model size (Chart 1), with a sharper drop from ResNet34 to ResNet50. The decrease in AAA is more significant in online learning than offline learning. + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/AAA_on_off.png" class="img-fluid" caption="Chart 1: Average Anytime Accuracy (AAA) of different sized ResNets in online and offline continual learning"%} + +When looking at average accuracy for validation stream for online CL setting (Chart 2), we see that the rate to which accuracy increases with each task degrade with larger models. Slim-ResNet18 shows the highest accuracy and growth trend. This could indicate that larger models are worse at generalizing to a class-incremental learning scenario. + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/stream_acc1.png" class="img-fluid" caption="Chart 2: Validation stream accuracy (Online CL)"%} + +| | **Average Anytime Acc (AAA)** | **Final Average Acc** | +|:-----------------:|:-----------------------------:|:---------------------:| +| **Slim ResNet18** | 0.664463 | 0.5364 | +| **ResNet18** | 0.610965 | 0.3712 | +| **ResNet34** | 0.576129 | 0.3568 | +| **ResNet50** | 0.459375 | 0.3036 | + +
Table 4: Accuracy metrics across differently sized models (Online CL)
+ +Now we turn to forgetting. + +Looking at Average Cumulative Forgetting (ACF), we see that for online CL setting, ResNet34 performs the best (with a slight overlap at the end with ResNet18), and ResNet50 shows the mosts forgetting. An noticeable observation in both ACF and AF is that ResNet50 performed better initially but forgetting started to increase after a few tasks. + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/forgetting_online.png" class="img-fluid" caption="Chart 3: forgetting curves, Online CL (Solid: Average Forgetting (AF); Dotted: Average Cumulative Forgetting (ACF))"%} + +However, results look different for offline CL setting. ResNet50 has the lowest Average Cumulative Forgetting (ACF) (although with a slight increase in the middle), followed by ResNet18, and finally ResNet34. This differences in forgetting between online and offline CL setting is aligned with the accuracy metrics earlier, where the performance of ResNet50 decreases more starkly in the online CL setting. + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/forgetting_offline.png" class="img-fluid" caption="Chart 4: Forgetting curves, Offline CL (Solid: Average Forgetting (AF); Dotted: Average Cumulative Forgetting (ACF))"%} + + +Visual inspection of the saliency maps revealed some interesting observations. When it comes to the ability to highlight intuitive areas of interest in the images, there seemed to be a noticeable improvement from ResNet18 to ResNet34, but this was not necessarily the case from ResNet34 to ResNet50. This phenomenon was more salient in the online CL setting. + + +**Online** + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/saliency_online.png" class="img-fluid" caption="Image: Saliency map visualizations for Online CL"%} + + +**Offline** + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/saliency_offline.png" class="img-fluid" caption="Image: Saliency map visualization for Offline CL"%} + +Interestingly, Slim-ResNet18 seems to be doing better than most of them, certainly better than its plain counterpart ResNet18. A further exploration of model width on performance and representation quality would be an interesting avenue of research. + +**Slim-ResNet18** + +{% include figure.html path="assets/img/2023-11-09-eunhae-project/saliencymap_exp4.png" class="img-fluid" caption="Image: Saliency map visualization (Slim ResNet18)"%} + + +# Discussion + +In this study, I compared key accuracy and forgetting metrics in online continual learning across ResNets of different depths and width, as well as brief qualitative inspection of the models' internal representation. These results show that larger models do not necessary lead to better continual learning performance. We saw that Average Anytime Accuracy (AAA) and stream accuracy dropped progressively with model size, hinting that larger models struggle to generalize to newly trained tasks, especially in an online CL setting. Forgetting curves showed similar trends but with more nuance; larger models perform well at first but suffer from increased forgetting with more incoming tasks. Interestingly, the problem was not as pronounced in the offline CL setting, which highlights the challenges of training models in a more realistic, online continual learning context. + +Why do larger models perform worse at continual learning? One of the reasons is that larger models tend to have more parameters, which might make it harder to maintain stability in the learned features as new data is introduced. This makes them more prone to overfitting and forgetting previously learned information, reducing their ability to generalize. + +Building on this work, future research could investigate the impact of model size on CL performance by exploring the following questions: + +- Do pre-trained larger models (vs trained-from-scratch models) generalize better in continual learning settings? +- Do longer training improve relatively performance of larger models in CL setting? +- Can different CL strategies (other than Experience Replay) mitigate the degradation of performance in larger models? +- Do slimmer versions of existing models always perform better? +- How might different hyperparameters (i.e. learning rate) impact CL performance of larger models? + +# Conclusion + +To conclude, this study has empirically explored the role of model size on performance in the context of online continual learning. Specifically, it has shown that model size matters when it comes to continual learning and forgetting, albeit in nuanced ways. These findings contribute to the ongoing discussions on the role of the scale of deep learning models on performance and have implications for future area of research. diff --git a/_posts/2022-12-01-distill-example.md b/_posts/template.md similarity index 99% rename from _posts/2022-12-01-distill-example.md rename to _posts/template.md index 2d133452..17bfa4f5 100644 --- a/_posts/2022-12-01-distill-example.md +++ b/_posts/template.md @@ -428,4 +428,4 @@ Here's a line for us to start with. This line is separated from the one above by two newlines, so it will be a *separate paragraph*. This line is also a separate paragraph, but... -This line is only separated by a single newline, so it's a separate line in the *same paragraph*. +This line is only separated by a single newline, so it's a separate line in the *same paragraph*. \ No newline at end of file diff --git a/assets/bibliography/2023-11-09-eunhae-project.bib b/assets/bibliography/2023-11-09-eunhae-project.bib index 56d762d5..8f1259da 100644 --- a/assets/bibliography/2023-11-09-eunhae-project.bib +++ b/assets/bibliography/2023-11-09-eunhae-project.bib @@ -1,3 +1,147 @@ +@article{Bressem_2020, + title={Comparing different deep learning architectures for classification of chest radiographs}, + url={http://dx.doi.org/10.1038/s41598-020-70479-z}, + DOI={10.1038/s41598-020-70479-z}, + journal={Scientific Reports}, + publisher={Springer Science and Business Media LLC}, + author={Bressem, Keno K. and Adams, Lisa C. and Erxleben, Christoph and Hamm, Bernd and Niehues, Stefan M. and Vahldiek, Janis L.}, + year={2020} + } + + +@misc{hu2021model, + title={Model Complexity of Deep Learning: A Survey}, + author={Xia Hu and Lingyang Chu and Jian Pei and Weiqing Liu and Jiang Bian}, + year={2021}, + eprint={2103.05127}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{xie2017aggregated, + title={Aggregated Residual Transformations for Deep Neural Networks}, + author={Saining Xie and Ross Girshick and Piotr Dollár and Zhuowen Tu and Kaiming He}, + year={2017}, + eprint={1611.05431}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} + +@article{dyer2022, + title={Effect of scale on catastrophic forgetting in neural networks}, + author={Ethan Dyer and Aitor Lewkowycz and Vinay Ramasesh}, + year={2022}, + URL={https://openreview.net/forum?id=GhVS8_yPeEa}, + booktitle={ICLR}, +} + + +@misc{luo2023empirical, + title={An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning}, + author={Yun Luo and Zhen Yang and Fandong Meng and Yafu Li and Jie Zhou and Yue Zhang}, + year={2023}, + archivePrefix={arXiv}, +} + +@misc{soutifcormerais2021importance, + title={On the importance of cross-task features for class-incremental learning}, + author={Albin Soutif--Cormerais and Marc Masana and Joost Van de Weijer and Bartłomiej Twardowski}, + year={2021}, + eprint={2106.11930}, + archivePrefix={arXiv}, +} + +@article{Bianco_2018, + title={Benchmark Analysis of Representative Deep Neural Network Architectures}, + volume={6}, + ISSN={2169-3536}, + url={http://dx.doi.org/10.1109/ACCESS.2018.2877890}, + DOI={10.1109/access.2018.2877890}, + journal={IEEE Access}, + publisher={Institute of Electrical and Electronics Engineers (IEEE)}, + author={Bianco, Simone and Cadene, Remi and Celona, Luigi and Napoletano, Paolo}, + year={2018}, + pages={64270–64277} } + + +@misc{he2015deep, + title={Deep Residual Learning for Image Recognition}, + author={Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun}, + year={2015}, + eprint={1512.03385}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} + +@misc{chaudhry2019efficient, + title={Efficient Lifelong Learning with A-GEM}, + author={Arslan Chaudhry and Marc'Aurelio Ranzato and Marcus Rohrbach and Mohamed Elhoseiny}, + year={2019}, + eprint={1812.00420}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{rebuffi2017icarl, + title={iCaRL: Incremental Classifier and Representation Learning}, + author={Sylvestre-Alvise Rebuffi and Alexander Kolesnikov and Georg Sperl and Christoph H. Lampert}, + year={2017}, + eprint={1611.07725}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} + +@misc{aljundi2019online, + title={Online Continual Learning with Maximally Interfered Retrieval}, + author={Rahaf Aljundi and Lucas Caccia and Eugene Belilovsky and Massimo Caccia and Min Lin and Laurent Charlin and Tinne Tuytelaars}, + year={2019}, + eprint={1908.04742}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{chaudhry2019tiny, + title={On Tiny Episodic Memories in Continual Learning}, + author={Arslan Chaudhry and Marcus Rohrbach and Mohamed Elhoseiny and Thalaiyasingam Ajanthan and Puneet K. Dokania and Philip H. S. Torr and Marc'Aurelio Ranzato}, + year={2019}, + eprint={1902.10486}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@misc{rusu2022progressive, + title={Progressive Neural Networks}, + author={Andrei A. Rusu and Neil C. Rabinowitz and Guillaume Desjardins and Hubert Soyer and James Kirkpatrick and Koray Kavukcuoglu and Razvan Pascanu and Raia Hadsell}, + year={2022}, + eprint={1606.04671}, + archivePrefix={arXiv}, + primaryClass={cs.LG} +} + +@article{mermillod2013-dilemma, + title={The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects}, + author={Martial Mermillod and Aur{\'e}lia Bugaiska and Patrick Bonin}, + journal={Frontiers in Psychology}, + year={2013}, + volume={4}, + url={https://api.semanticscholar.org/CorpusID:8262392} +} + +@misc{simonyan2014deep, + title={Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps}, + author={Karen Simonyan and Andrea Vedaldi and Andrew Zisserman}, + year={2014}, + eprint={1312.6034}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} + +@InProceedings{lomonaco2021avalanche, + title={Avalanche: an End-to-End Library for Continual Learning}, + author={Vincenzo Lomonaco and Lorenzo Pellegrini and Andrea Cossu and Antonio Carta and Gabriele Graffieti and Tyler L. Hayes and Matthias De Lange and Marc Masana and Jary Pomponi and Gido van de Ven and Martin Mundt and Qi She and Keiland Cooper and Jeremy Forest and Eden Belouadah and Simone Calderara and German I. Parisi and Fabio Cuzzolin and Andreas Tolias and Simone Scardapane and Luca Antiga and Subutai Amhad and Adrian Popescu and Christopher Kanan and Joost van de Weijer and Tinne Tuytelaars and Davide Bacciu and Davide Maltoni}, + booktitle={Proceedings of IEEE Conference on Computer Vision and Pattern Recognition}, + series={2nd Continual Learning in Computer Vision Workshop}, + year={2021} @article{luo2023empirical, title={An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning}, author={Luo, Yun and Yang, Zhen and Meng, Fandong and Li, Yafu and Zhou, Jie and Zhang, Yue}, @@ -18,6 +162,346 @@ @article{kirkpatrick2017overcoming URL = {https://www.pnas.org/doi/abs/10.1073/pnas.1611835114} } +@article{perkonigg_dynamic_2021, + title = {Dynamic memory to alleviate catastrophic forgetting in continual learning with medical imaging}, + volume = {12}, + issn = {2041-1723}, + url = {https://www.nature.com/articles/s41467-021-25858-z}, + doi = {10.1038/s41467-021-25858-z}, + urldate = {2023-11-09}, + journal = {Nature Communications}, + author = {Perkonigg, Matthias and Hofmanninger, Johannes and Herold, Christian J. and Brink, James A. and Pianykh, Oleg and Prosch, Helmut and Langs, Georg}, + month = sep, + year = {2021}, +} + +@article{ramasesh_effect_2022, + title = {Effect of model and pretraining scale on catastrophic forgetting in neural networks}, + abstract = {Catastrophic forgetting presents a challenge in developing deep learning models capable of continual learning, i.e. learning tasks sequentially. Recently, both computer vision and natural-language processing have witnessed great progress through the use of large-scale pretrained models. In this work, we present an empirical study of catastrophic forgetting in this pretraining paradigm. Our experiments indicate that large, pretrained ResNets and Transformers are significantly more resistant to forgetting than randomly-initialized, trained-from-scratch models; this robustness systematically improves with scale of both model and pretraining dataset size. We take initial steps towards characterizing what aspect of model representations allows them to perform continual learning so well, finding that in the pretrained models, distinct class representations grow more orthogonal with scale. Our results suggest that, when possible, scale and a diverse pretraining dataset can be useful ingredients in mitigating catastrophic forgetting.}, + language = {en}, + author = {Ramasesh, Vinay and Lewkowycz, Aitor and Dyer, Ethan}, + year = {2022}, +} + +@misc{kamra_deep_2018, + title = {Deep {Generative} {Dual} {Memory} {Network} for {Continual} {Learning}}, + url = {http://arxiv.org/abs/1710.10368}, + abstract = {Despite advances in deep learning, neural networks can only learn multiple tasks when trained on them jointly. When tasks arrive sequentially, they lose performance on previously learnt tasks. This phenomenon called catastrophic forgetting is a fundamental challenge to overcome before neural networks can learn continually from incoming data. In this work, we derive inspiration from human memory to develop an architecture capable of learning continuously from sequentially incoming tasks, while averting catastrophic forgetting. Specifically, our contributions are: (i) a dual memory architecture emulating the complementary learning systems (hippocampus and the neocortex) in the human brain, (ii) memory consolidation via generative replay of past experiences, (iii) demonstrating advantages of generative replay and dual memories via experiments, and (iv) improved performance retention on challenging tasks even for low capacity models. Our architecture displays many characteristics of the mammalian memory and provides insights on the connection between sleep and learning.}, + language = {en}, + urldate = {2023-11-23}, + publisher = {arXiv}, + author = {Kamra, Nitin and Gupta, Umang and Liu, Yan}, + month = may, + year = {2018}, + note = {arXiv:1710.10368 [cs]}, + keywords = {Computer Science - Machine Learning}, +} + +@misc{wang_comprehensive_2023, + title = {A {Comprehensive} {Survey} of {Continual} {Learning}: {Theory}, {Method} and {Application}}, + shorttitle = {A {Comprehensive} {Survey} of {Continual} {Learning}}, + url = {http://arxiv.org/abs/2302.00487}, + abstract = {To cope with real-world dynamics, an intelligent agent needs to incrementally acquire, update, accumulate, and exploit knowledge throughout its lifetime. This ability, known as continual learning, provides a foundation for AI systems to develop themselves adaptively. In a general sense, continual learning is explicitly limited by catastrophic forgetting, where learning a new task usually results in a dramatic performance degradation of the old tasks. Beyond this, increasingly numerous advances have emerged in recent years that largely extend the understanding and application of continual learning. The growing and widespread interest in this direction demonstrates its realistic significance as well as complexity. In this work, we present a comprehensive survey of continual learning, seeking to bridge the basic settings, theoretical foundations, representative methods, and practical applications. Based on existing theoretical and empirical results, we summarize the general objectives of continual learning as ensuring a proper stability-plasticity trade-off and an adequate intra/inter-task generalizability in the context of resource efficiency. Then we provide a state-of-the-art and elaborated taxonomy, extensively analyzing how representative strategies address continual learning, and how they are adapted to particular challenges in various applications. Through an in-depth discussion of promising directions, we believe that such a holistic perspective can greatly facilitate subsequent exploration in this field and beyond.}, + language = {en}, + urldate = {2023-11-24}, + publisher = {arXiv}, + author = {Wang, Liyuan and Zhang, Xingxing and Su, Hang and Zhu, Jun}, + month = jun, + year = {2023}, + note = {arXiv:2302.00487 [cs]}, + keywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition}, +} + +@incollection{ferrari_lifelong_2018, + address = {Cham}, + title = {Lifelong {Learning} via {Progressive} {Distillation} and {Retrospection}}, + volume = {11207}, + isbn = {978-3-030-01218-2 978-3-030-01219-9}, + url = {https://link.springer.com/10.1007/978-3-030-01219-9_27}, + abstract = {Lifelong learning aims at adapting a learned model to new tasks while retaining the knowledge gained earlier. A key challenge for lifelong learning is how to strike a balance between the preservation on old tasks and the adaptation to a new one within a given model. Approaches that combine both objectives in training have been explored in previous works. Yet the performance still suffers from considerable degradation in a long sequence of tasks. In this work, we propose a novel approach to lifelong learning, which tries to seek a better balance between preservation and adaptation via two techniques: Distillation and Retrospection. Specifically, the target model adapts to the new task by knowledge distillation from an intermediate expert, while the previous knowledge is more effectively preserved by caching a small subset of data for old tasks. The combination of Distillation and Retrospection leads to a more gentle learning curve for the target model, and extensive experiments demonstrate that our approach can bring consistent improvements on both old and new tasks4 .}, + language = {en}, + urldate = {2023-11-24}, + booktitle = {Computer {Vision} – {ECCV} 2018}, + publisher = {Springer International Publishing}, + author = {Hou, Saihui and Pan, Xinyu and Loy, Chen Change and Wang, Zilei and Lin, Dahua}, + editor = {Ferrari, Vittorio and Hebert, Martial and Sminchisescu, Cristian and Weiss, Yair}, + year = {2018}, + doi = {10.1007/978-3-030-01219-9_27}, +} + +@misc{wang_learning_2022, + title = {Learning to {Prompt} for {Continual} {Learning}}, + url = {http://arxiv.org/abs/2112.08654}, + abstract = {The mainstream paradigm behind continual learning has been to adapt the model parameters to non-stationary data distributions, where catastrophic forgetting is the central challenge. Typical methods rely on a rehearsal buffer or known task identity at test time to retrieve learned knowledge and address forgetting, while this work presents a new paradigm for continual learning that aims to train a more succinct memory system without accessing task identity at test time. Our method learns to dynamically prompt (L2P) a pre-trained model to learn tasks sequentially under different task transitions. In our proposed framework, prompts are small learnable parameters, which are maintained in a memory space. The objective is to optimize prompts to instruct the model prediction and explicitly manage task-invariant and task-specific knowledge while maintaining model plasticity. We conduct comprehensive experiments under popular image classification benchmarks with different challenging continual learning settings, where L2P consistently outperforms prior state-ofthe-art methods. Surprisingly, L2P achieves competitive results against rehearsal-based methods even without a rehearsal buffer and is directly applicable to challenging taskagnostic continual learning. Source code is available at https://github.com/google- research/l2p.}, + language = {en}, + urldate = {2023-11-24}, + publisher = {arXiv}, + author = {Wang, Zifeng and Zhang, Zizhao and Lee, Chen-Yu and Zhang, Han and Sun, Ruoxi and Ren, Xiaoqi and Su, Guolong and Perot, Vincent and Dy, Jennifer and Pfister, Tomas}, + month = mar, + year = {2022}, + note = {arXiv:2112.08654 [cs]}, +} + +@misc{radford_learning_2021, + title = {Learning {Transferable} {Visual} {Models} {From} {Natural} {Language} {Supervision}}, + url = {http://arxiv.org/abs/2103.00020}, + abstract = {State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.}, + language = {en}, + urldate = {2023-11-27}, + publisher = {arXiv}, + author = {Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and Krueger, Gretchen and Sutskever, Ilya}, + month = feb, + year = {2021}, + note = {arXiv:2103.00020 [cs]}, + keywords = {Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition}, +} + + +@inproceedings{yu_scale_2023, + address = {Vancouver, BC, Canada}, + title = {SCALE: Online Self-Supervised Lifelong Learning without Prior Knowledge}, + isbn = {9798350302493}, + shorttitle = {{SCALE}}, + url = {https://ieeexplore.ieee.org/document/10208356/}, + doi = {10.1109/CVPRW59228.2023.00247}, + abstract = {Unsupervised lifelong learning refers to the ability to learn over time while memorizing previous patterns without supervision. Although great progress has been made in this direction, existing work often assumes strong prior knowledge about the incoming data (e.g., knowing the class boundaries), which can be impossible to obtain in complex and unpredictable environments. In this paper, motivated by real-world scenarios, we propose a more practical problem setting called online self-supervised lifelong learning without prior knowledge. The proposed setting is challenging due to the non-iid and single-pass data, the absence of external supervision, and no prior knowledge. To address the challenges, we propose Self-Supervised ContrAstive Lifelong LEarning without Prior Knowledge (SCALE) which can extract and memorize representations on the fly purely from the data continuum. SCALE is designed around three major components: a pseudo-supervised contrastive loss, a self-supervised forgetting loss, and an online memory update for uniform subset selection. All three components are designed to work collaboratively to maximize learning performance. We perform comprehensive experiments of SCALE under iid and four non-iid data streams. The results show that SCALE outperforms the state-of-the-art algorithm in all settings with improvements up to 3.83\%, 2.77\% and 5.86\% in terms of kNN accuracy on CIFAR-10, CIFAR100, and TinyImageNet datasets. We release the implementation at https://github.com/Orienfish/ SCALE.}, + language = {en}, + urldate = {2023-11-28}, + booktitle = {2023 {IEEE}/{CVF} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} {Workshops} ({CVPRW})}, + publisher = {IEEE}, + author = {Yu, Xiaofan and Guo, Yunhui and Gao, Sicun and Rosing, Tajana}, + month = jun, + year = {2023}, + pages = {2484--2495}, + file = {Yu et al. - 2023 - SCALE Online Self-Supervised Lifelong Learning wi.pdf:/Users/eunhaelee/Zotero/storage/8BSEBLCE/Yu et al. - 2023 - SCALE Online Self-Supervised Lifelong Learning wi.pdf:application/pdf}, +} + +@misc{madaan_representational_2022, + title = {Representational Continuity for Unsupervised Continual Learning}, + url = {http://arxiv.org/abs/2110.06976}, + abstract = {Continual learning (CL) aims to learn a sequence of tasks without forgetting the previously acquired knowledge. However, recent CL advances are restricted to supervised continual learning (SCL) scenarios. Consequently, they are not scalable to real-world applications where the data distribution is often biased and unannotated. In this work, we focus on unsupervised continual learning (UCL), where we learn the feature representations on an unlabelled sequence of tasks and show that reliance on annotated data is not necessary for continual learning. We conduct a systematic study analyzing the learned feature representations and show that unsupervised visual representations are surprisingly more robust to catastrophic forgetting, consistently achieve better performance, and generalize better to out-ofdistribution tasks than SCL. Furthermore, we find that UCL achieves a smoother loss landscape through qualitative analysis of the learned representations and learns meaningful feature representations. Additionally, we propose Lifelong Unsupervised Mixup (LUMP), a simple yet effective technique that interpolates between the current task and previous tasks’ instances to alleviate catastrophic forgetting for unsupervised representations. We release our code online.}, + language = {en}, + urldate = {2023-11-28}, + publisher = {arXiv}, + author = {Madaan, Divyam and Yoon, Jaehong and Li, Yuanchun and Liu, Yunxin and Hwang, Sung Ju}, + month = apr, + year = {2022}, + note = {arXiv:2110.06976 [cs]}, + keywords = {Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition}, + file = {Madaan et al. - 2022 - Representational Continuity for Unsupervised Conti.pdf:/Users/eunhaelee/Zotero/storage/TEKMCYFV/Madaan et al. - 2022 - Representational Continuity for Unsupervised Conti.pdf:application/pdf}, +} + +@inproceedings{davari_probing_2022, + address = {New Orleans, LA, USA}, + title = {Probing {Representation} {Forgetting} in {Supervised} and {Unsupervised} {Continual} {Learning}}, + isbn = {978-1-66546-946-3}, + url = {https://ieeexplore.ieee.org/document/9879096/}, + doi = {10.1109/CVPR52688.2022.01621}, + language = {en}, + urldate = {2023-11-28}, + booktitle = {2022 {IEEE}/{CVF} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})}, + publisher = {IEEE}, + author = {Davari, MohammadReza and Asadi, Nader and Mudur, Sudhir and Aljundi, Rahaf and Belilovsky, Eugene}, + month = jun, + year = {2022}, + pages = {16691--16700}, + file = {Davari et al. - 2022 - Probing Representation Forgetting in Supervised an.pdf:/Users/eunhaelee/Zotero/storage/B3Q3R7C3/Davari et al. - 2022 - Probing Representation Forgetting in Supervised an.pdf:application/pdf}, +} + +@misc{chaudhry_tiny_2019, + title = {On Tiny Episodic Memories in Continual Learning}, + url = {http://arxiv.org/abs/1902.10486}, + abstract = {In continual learning (CL), an agent learns from a stream of tasks leveraging prior experience to transfer knowledge to future tasks. It is an ideal framework to decrease the amount of supervision in the existing learning algorithms. But for a successful knowledge transfer, the learner needs to remember how to perform previous tasks. One way to endow the learner the ability to perform tasks seen in the past is to store a small memory, dubbed episodic memory, that stores few examples from previous tasks and then to replay these examples when training for future tasks. In this work, we empirically analyze the effectiveness of a very small episodic memory in a CL setup where each training example is only seen once. Surprisingly, across four rather different supervised learning benchmarks adapted to CL, a very simple baseline, that jointly trains on both examples from the current task as well as examples stored in the episodic memory, significantly outperforms specifically designed CL approaches with and without episodic memory. Interestingly, we find that repetitive training on even tiny memories of past tasks does not harm generalization, on the contrary, it improves it, with gains between 7{\textbackslash}\% and 17{\textbackslash}\% when the memory is populated with a single example per class.}, + language = {en}, + urldate = {2023-11-28}, + publisher = {arXiv}, + author = {Chaudhry, Arslan and Rohrbach, Marcus and Elhoseiny, Mohamed and Ajanthan, Thalaiyasingam and Dokania, Puneet K. and Torr, Philip H. S. and Ranzato, Marc'Aurelio}, + month = jun, + year = {2019}, + note = {arXiv:1902.10486 [cs, stat]}, + keywords = {Computer Science - Machine Learning, Statistics - Machine Learning}, + file = {Chaudhry et al. - 2019 - On Tiny Episodic Memories in Continual Learning.pdf:/Users/eunhaelee/Zotero/storage/Z8S7RXN2/Chaudhry et al. - 2019 - On Tiny Episodic Memories in Continual Learning.pdf:application/pdf}, +} + +@article{lopez-paz_gradient_2017, + title = {Gradient Episodic Memory for Continual Learning}, + abstract = {One major obstacle towards AI is the poor ability of models to solve new problems quicker, and without forgetting previously acquired knowledge. To better understand this issue, we study the problem of continual learning, where the model observes, once and one by one, examples concerning a sequence of tasks. First, we propose a set of metrics to evaluate models learning over a continuum of data. These metrics characterize models not only by their test accuracy, but also in terms of their ability to transfer knowledge across tasks. Second, we propose a model for continual learning, called Gradient Episodic Memory (GEM) that alleviates forgetting, while allowing beneficial transfer of knowledge to previous tasks. Our experiments on variants of the MNIST and CIFAR-100 datasets demonstrate the strong performance of GEM when compared to the state-of-the-art.}, + language = {en}, + author = {Lopez-Paz, David and Ranzato, Marc'Aurelio}, + year = {2017}, + file = {Lopez-Paz and Ranzato - Gradient Episodic Memory for Continual Learning.pdf:/Users/eunhaelee/Zotero/storage/VZMBUMG3/Lopez-Paz and Ranzato - Gradient Episodic Memory for Continual Learning.pdf:application/pdf}, +} + +@article{rao_continual_2019, + title = {Continual {Unsupervised} {Representation} {Learning}}, + abstract = {Continual learning aims to improve the ability of modern learning systems to deal with non-stationary distributions, typically by attempting to learn a series of tasks sequentially. Prior art in the field has largely considered supervised or reinforcement learning tasks, and often assumes full knowledge of task labels and boundaries. In this work, we propose an approach (CURL) to tackle a more general problem that we will refer to as unsupervised continual learning. The focus is on learning representations without any knowledge about task identity, and we explore scenarios when there are abrupt changes between tasks, smooth transitions from one task to another, or even when the data is shuffled. The proposed approach performs task inference directly within the model, is able to dynamically expand to capture new concepts over its lifetime, and incorporates additional rehearsal-based techniques to deal with catastrophic forgetting. We demonstrate the efficacy of CURL in an unsupervised learning setting with MNIST and Omniglot, where the lack of labels ensures no information is leaked about the task. Further, we demonstrate strong performance compared to prior art in an i.i.d setting, or when adapting the technique to supervised tasks such as incremental class learning.}, + language = {en}, + author = {Rao, Dushyant and Visin, Francesco and Rusu, Andrei and Pascanu, Razvan and Teh, Yee Whye and Hadsell, Raia}, + year = {2019}, + file = {Rao et al. - Continual Unsupervised Representation Learning.pdf:/Users/eunhaelee/Zotero/storage/V7DHDC4S/Rao et al. - Continual Unsupervised Representation Learning.pdf:application/pdf}, +} + +@misc{zhang_self-supervised_2020, + title = {Self-{Supervised} {Learning} {Aided} {Class}-{Incremental} {Lifelong} {Learning}}, + url = {http://arxiv.org/abs/2006.05882}, + abstract = {Lifelong or continual learning remains to be a challenge for artificial neural network, as it is required to be both stable for preservation of old knowledge and plastic for acquisition of new knowledge. It is common to see previous experience get overwritten, which leads to the well-known issue of catastrophic forgetting, especially in the scenario of class-incremental learning (Class-IL). Recently, many lifelong learning methods have been proposed to avoid catastrophic forgetting. However, models which learn without replay of the input data, would encounter another problem which has been ignored, and we refer to it as prior information loss (PIL). In training procedure of Class-IL, as the model has no knowledge about following tasks, it would only extract features necessary for tasks learned so far, whose information is insufficient for joint classification. In this paper, our empirical results on several image datasets show that PIL limits the performance of current state-of-the-art method for Class-IL, the orthogonal weights modification (OWM) algorithm. Furthermore, we propose to combine self-supervised learning, which can provide effective representations without requiring labels, with Class-IL to partly get around this problem. Experiments show superiority of proposed method to OWM, as well as other strong baselines.}, + language = {en}, + urldate = {2023-11-28}, + publisher = {arXiv}, + author = {Zhang, Song and Shen, Gehui and Huang, Jinsong and Deng, Zhi-Hong}, + month = oct, + year = {2020}, + note = {arXiv:2006.05882 [cs, stat]}, + keywords = {Computer Science - Machine Learning, Statistics - Machine Learning}, + file = {Zhang et al. - 2020 - Self-Supervised Learning Aided Class-Incremental L.pdf:/Users/eunhaelee/Zotero/storage/TBB6S9PB/Zhang et al. - 2020 - Self-Supervised Learning Aided Class-Incremental L.pdf:application/pdf}, +} + +@misc{gallardo_self-supervised_2021, + title = {Self-{Supervised} {Training} {Enhances} {Online} {Continual} {Learning}}, + url = {http://arxiv.org/abs/2103.14010}, + abstract = {In continual learning, a system must incrementally learn from a non-stationary data stream without catastrophic forgetting. Recently, multiple methods have been devised for incrementally learning classes on large-scale image classification tasks, such as ImageNet. State-of-the-art continual learning methods use an initial supervised pre-training phase, in which the first 10\% - 50\% of the classes in a dataset are used to learn representations in an offline manner before continual learning of new classes begins. We hypothesize that self-supervised pre-training could yield features that generalize better than supervised learning, especially when the number of samples used for pre-training is small. We test this hypothesis using the self-supervised MoCo-V2, Barlow Twins, and SwAV algorithms. On ImageNet, we find that these methods outperform supervised pretraining considerably for online continual learning, and the gains are larger when fewer samples are available. Our findings are consistent across three online continual learning algorithms. Our best system achieves a 14.95\% relative increase in top-1 accuracy on class incremental ImageNet over the prior state of the art for online continual learning.}, + language = {en}, + urldate = {2023-11-28}, + publisher = {arXiv}, + author = {Gallardo, Jhair and Hayes, Tyler L. and Kanan, Christopher}, + month = oct, + year = {2021}, + note = {arXiv:2103.14010 [cs]}, + keywords = {Computer Science - Computer Vision and Pattern Recognition}, + file = {Gallardo et al. - 2021 - Self-Supervised Training Enhances Online Continual.pdf:/Users/eunhaelee/Zotero/storage/5MZM9P9F/Gallardo et al. - 2021 - Self-Supervised Training Enhances Online Continual.pdf:application/pdf}, +} + +@misc{van_de_ven_three_2019, + title = {Three scenarios for continual learning}, + url = {http://arxiv.org/abs/1904.07734}, + abstract = {Standard artificial neural networks suffer from the well-known issue of catastrophic forgetting, making continual or lifelong learning difficult for machine learning. In recent years, numerous methods have been proposed for continual learning, but due to differences in evaluation protocols it is difficult to directly compare their performance. To enable more structured comparisons, we describe three continual learning scenarios based on whether at test time task identity is provided and—in case it is not—whether it must be inferred. Any sequence of well-defined tasks can be performed according to each scenario. Using the split and permuted MNIST task protocols, for each scenario we carry out an extensive comparison of recently proposed continual learning methods. We demonstrate substantial differences between the three scenarios in terms of difficulty and in terms of how efficient different methods are. In particular, when task identity must be inferred (i.e., class incremental learning), we find that regularization-based approaches (e.g., elastic weight consolidation) fail and that replaying representations of previous experiences seems required for solving this scenario.}, + language = {en}, + urldate = {2023-11-28}, + publisher = {arXiv}, + author = {van de Ven, Gido M. and Tolias, Andreas S.}, + month = apr, + year = {2019}, + note = {arXiv:1904.07734 [cs, stat]}, + keywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Computer Vision and Pattern Recognition}, + file = {van de Ven and Tolias - 2019 - Three scenarios for continual learning.pdf:/Users/eunhaelee/Zotero/storage/RPUXQVV8/van de Ven and Tolias - 2019 - Three scenarios for continual learning.pdf:application/pdf}, +} + +@inproceedings{fini_self-supervised_2022, + title = {Self-Supervised Models are Continual Learners}, + isbn = {978-1-66546-946-3}, + url = {https://ieeexplore.ieee.org/document/9878593/}, + doi = {10.1109/CVPR52688.2022.00940}, + abstract = {Self-supervised models have been shown to produce comparable or better visual representations than their supervised counterparts when trained offline on unlabeled data at scale. However, their efficacy is catastrophically reduced in a Continual Learning (CL) scenario where data is presented to the model sequentially. In this paper, we show that self-supervised loss functions can be seamlessly converted into distillation mechanisms for CL by adding a predictor network that maps the current state of the representations to their past state. This enables us to devise a framework for Continual self-supervised visual representation Learning that (i) significantly improves the quality of the learned representations, (ii) is compatible with several state-of-the-art self-supervised objectives, and (iii) needs little to no hyperparameter tuning. We demonstrate the effectiveness of our approach empirically by training six popular self-supervised models in various CL settings. Code: github.com/DonkeyShot21/cassle.}, + language = {en}, + urldate = {2023-12-02}, + booktitle = {2022 {IEEE}/{CVF} {Conference} on {Computer} {Vision} and {Pattern} {Recognition} ({CVPR})}, + publisher = {IEEE}, + author = {Fini, Enrico and Da Costa, Victor G. Turrisi and Alameda-Pineda, Xavier and Ricci, Elisa and Alahari, Karteek and Mairal, Julien}, + month = jun, + year = {2022}, + pages = {9611--9620}, +} + +@misc{soutif-cormerais_comprehensive_2023, + title = {A Comprehensive Empirical Evaluation on Online Continual Learning}, + url = {http://arxiv.org/abs/2308.10328}, + abstract = {Online continual learning aims to get closer to a live learning experience by learning directly on a stream of data with temporally shifting distribution and by storing a minimum amount of data from that stream. In this empirical evaluation, we evaluate various methods from the literature that tackle online continual learning. More specifically, we focus on the class-incremental setting in the context of image classification, where the learner must learn new classes incrementally from a stream of data. We compare these methods on the Split-CIFAR100 and Split-TinyImagenet benchmarks, and measure their average accuracy, forgetting, stability, and quality of the representations, to evaluate various aspects of the algorithm at the end but also during the whole training period. We find that most methods suffer from stability and underfitting issues. However, the learned representations are comparable to i.i.d. training under the same computational budget. No clear winner emerges from the results and basic experience replay, when properly tuned and implemented, is a very strong baseline. We release our modular and extensible codebase at https://github.com/AlbinSou/ocl\_survey based on the avalanche framework to reproduce our results and encourage future research.}, + language = {en}, + urldate = {2023-12-02}, + publisher = {arXiv}, + author = {Soutif-Cormerais, Albin and Carta, Antonio and Cossu, Andrea and Hurtado, Julio and Hemati, Hamed and Lomonaco, Vincenzo and Van de Weijer, Joost}, + month = sep, + year = {2023}, + note = {arXiv:2308.10328 [cs]}, + keywords = {Computer Science - Machine Learning}, +} + +@misc{caccia_new_2022, + title = {New Insights on Reducing Abrupt Representation Change in Online Continual Learning}, + url = {http://arxiv.org/abs/2104.05025}, + abstract = {In the online continual learning paradigm, agents must learn from a changing distribution while respecting memory and compute constraints. Experience Replay (ER), where a small subset of past data is stored and replayed alongside new data, has emerged as a simple and effective learning strategy. In this work, we focus on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones. We shed new light on this question by showing that applying ER causes the newly added classes’ representations to overlap significantly with the previous classes, leading to highly disruptive parameter updates. Based on this empirical analysis, we propose a new method which mitigates this issue by shielding the learned representations from drastic adaptation to accommodate new classes. We show that using an asymmetric update rule pushes new classes to adapt to the older ones (rather than the reverse), which is more effective especially at task boundaries, where much of the forgetting typically occurs. Empirical results show significant gains over strong baselines on standard continual learning benchmarks 1.}, + language = {en}, + urldate = {2023-12-03}, + publisher = {arXiv}, + author = {Caccia, Lucas and Aljundi, Rahaf and Asadi, Nader and Tuytelaars, Tinne and Pineau, Joelle and Belilovsky, Eugene}, + month = may, + year = {2022}, + note = {arXiv:2104.05025 [cs]}, + keywords = {Computer Science - Machine Learning}, +} + +@misc{ghunaim_real-time_2023, + title = {Real-Time Evaluation in Online Continual Learning: A New Hope}, + url = {http://arxiv.org/abs/2302.01047}, + abstract = {Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods.}, + language = {en}, + urldate = {2023-12-03}, + publisher = {arXiv}, + author = {Ghunaim, Yasir and Bibi, Adel and Alhamoud, Kumail and Alfarra, Motasem and Hammoud, Hasan Abed Al Kader and Prabhu, Ameya and Torr, Philip H. S. and Ghanem, Bernard}, + month = mar, + year = {2023}, + note = {arXiv:2302.01047 [cs]}, + keywords = {Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition}, +} + +@inproceedings{cai_online_2021, + title = {Online Continual Learning with Natural Distribution Shifts: An Empirical Study with Visual Data}, + isbn = {978-1-66542-812-5}, + url = {https://ieeexplore.ieee.org/document/9710740/}, + doi = {10.1109/ICCV48922.2021.00817}, + abstract = {Continual learning is the problem of learning and retaining knowledge through time over multiple tasks and environments. Research has primarily focused on the incremental classification setting, where new tasks/classes are added at discrete time intervals. Such an “offline” setting does not evaluate the ability of agents to learn effectively and efficiently, since an agent can perform multiple learning epochs without any time limitation when a task is added. We argue that “online” continual learning, where data is a single continuous stream without task boundaries, enables evaluating both information retention and online learning efficacy. In online continual learning, each incoming small batch of data is first used for testing and then added to the training set, making the problem truly online. Trained models are later evaluated on historical data to assess information retention. We introduce a new benchmark for online continual visual learning that exhibits large scale and natural distribution shifts. Through a large-scale analysis, we identify critical and previously unobserved phenomena of gradient-based optimization in continual learning, and propose effective strategies for improving gradient-based online continual learning with real data. The source code and dataset are available in: https://github.com/ IntelLabs/continuallearning.}, + language = {en}, + urldate = {2023-12-04}, + booktitle = {2021 {IEEE}/{CVF} {International} {Conference} on {Computer} {Vision} ({ICCV})}, + publisher = {IEEE}, + author = {Cai, Zhipeng and Sener, Ozan and Koltun, Vladlen}, + month = oct, + year = {2021}, + pages = {8261--8270}, +} + +@misc{lin_clear_2022, + title = {The CLEAR Benchmark: Continual Learning on Real-World Imagery}, + url = {http://arxiv.org/abs/2201.06289}, + abstract = {Continual learning (CL) is widely regarded as crucial challenge for lifelong AI. However, existing CL benchmarks, e.g. Permuted-MNIST and Split-CIFAR, make use of artificial temporal variation and do not align with or generalize to the realworld. In this paper, we introduce CLEAR, the first continual image classification benchmark dataset with a natural temporal evolution of visual concepts in the real world that spans a decade (2004-2014). We build CLEAR from existing large-scale image collections (YFCC100M) through a novel and scalable low-cost approach to visio-linguistic dataset curation. Our pipeline makes use of pretrained vision-language models (e.g. CLIP) to interactively build labeled datasets, which are further validated with crowd-sourcing to remove errors and even inappropriate images (hidden in original YFCC100M). The major strength of CLEAR over prior CL benchmarks is the smooth temporal evolution of visual concepts with real-world imagery, including both high-quality labeled data along with abundant unlabeled samples per time period for continual semi-supervised learning. We find that a simple unsupervised pre-training step can already boost state-of-the-art CL algorithms that only utilize fully-supervised data. Our analysis also reveals that mainstream CL evaluation protocols that train and test on iid data artificially inflate performance of CL system. To address this, we propose novel "streaming" protocols for CL that always test on the (near) future. Interestingly, streaming protocols (a) can simplify dataset curation since today’s testset can be repurposed for tomorrow’s trainset and (b) can produce more generalizable models with more accurate estimates of performance since all labeled data from each time-period is used for both training and testing (unlike classic iid train-test splits). Project webpage at: https://clear-benchmark.github.io.}, + language = {en}, + urldate = {2023-12-04}, + publisher = {arXiv}, + author = {Lin, Zhiqiu and Shi, Jia and Pathak, Deepak and Ramanan, Deva}, + month = jun, + year = {2022}, + note = {arXiv:2201.06289 [cs]}, + keywords = {Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning}, +} + +@misc{li_learning_2017, + title = {Learning without Forgetting}, + url = {http://arxiv.org/abs/1606.09282}, + abstract = {When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.}, + language = {en}, + urldate = {2023-12-05}, + publisher = {arXiv}, + author = {Li, Zhizhong and Hoiem, Derek}, + month = feb, + year = {2017}, + note = {arXiv:1606.09282 [cs, stat]}, + keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Statistics - Machine Learning}, +} + +@misc{mai_online_2021, + title = {Online Continual Learning in Image Classification: An Empirical Survey}, + shorttitle = {Online {Continual} {Learning} in {Image} {Classification}}, + url = {http://arxiv.org/abs/2101.10423}, + abstract = {Online continual learning for image classification studies the problem of learning to classify images from an online stream of data and tasks, where tasks may include new classes (class incremental) or data nonstationarity (domain incremental). One of the key challenges of continual learning is to avoid catastrophic forgetting (CF), i.e., forgetting old tasks in the presence of more recent tasks. Over the past few years, a large range of methods and tricks have been introduced to address the continual learning problem, but many have not been fairly and systematically compared under a variety of realistic and practical settings.}, + language = {en}, + urldate = {2023-12-10}, + publisher = {arXiv}, + author = {Mai, Zheda and Li, Ruiwen and Jeong, Jihwan and Quispe, David and Kim, Hyunwoo and Sanner, Scott}, + month = oct, + year = {2021}, + note = {arXiv:2101.10423 [cs]}, + keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning}, +} @article{perkonigg2021dynamic, title={Dynamic memory to alleviate catastrophic forgetting in continual learning with medical imaging}, author={Perkonigg, Matthias and Hofmanninger, Johannes and Herold, Christian J and Brink, James A and Pianykh, Oleg and Prosch, Helmut and Langs, Georg}, @@ -59,4 +543,4 @@ @article{elikok2019InteractiveAW year={2019}, volume={abs/1912.05284}, url={https://arxiv.org/abs/1912.05284} -} \ No newline at end of file +} diff --git a/assets/img/2022-12-01-distill-example/10.jpg b/assets/img/2022-12-01-distill-example/10.jpg deleted file mode 100644 index e9958d49..00000000 Binary files a/assets/img/2022-12-01-distill-example/10.jpg and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/11.jpg b/assets/img/2022-12-01-distill-example/11.jpg deleted file mode 100644 index 9db0d27f..00000000 Binary files a/assets/img/2022-12-01-distill-example/11.jpg and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/12.jpg b/assets/img/2022-12-01-distill-example/12.jpg deleted file mode 100644 index e343b391..00000000 Binary files a/assets/img/2022-12-01-distill-example/12.jpg and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/7.jpg b/assets/img/2022-12-01-distill-example/7.jpg deleted file mode 100644 index f581ccdd..00000000 Binary files a/assets/img/2022-12-01-distill-example/7.jpg and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/8.jpg b/assets/img/2022-12-01-distill-example/8.jpg deleted file mode 100644 index 498432a1..00000000 Binary files a/assets/img/2022-12-01-distill-example/8.jpg and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/9.jpg b/assets/img/2022-12-01-distill-example/9.jpg deleted file mode 100644 index d5b79728..00000000 Binary files a/assets/img/2022-12-01-distill-example/9.jpg and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/iclr.png b/assets/img/2022-12-01-distill-example/iclr.png deleted file mode 100644 index daf520ee..00000000 Binary files a/assets/img/2022-12-01-distill-example/iclr.png and /dev/null differ diff --git a/assets/img/2022-12-01-distill-example/plotly_demo_1.html b/assets/img/2022-12-01-distill-example/plotly_demo_1.html deleted file mode 100644 index 61b76eb0..00000000 --- a/assets/img/2022-12-01-distill-example/plotly_demo_1.html +++ /dev/null @@ -1,64 +0,0 @@ - - - -
-
- - \ No newline at end of file diff --git a/assets/img/2023-11-09-eunhae-project/AAA_on_off.png b/assets/img/2023-11-09-eunhae-project/AAA_on_off.png new file mode 100644 index 00000000..73fbf0f9 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/AAA_on_off.png differ diff --git a/assets/img/2023-11-09-eunhae-project/acc_comparison.png b/assets/img/2023-11-09-eunhae-project/acc_comparison.png new file mode 100644 index 00000000..b2a8f123 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/acc_comparison.png differ diff --git a/assets/img/2023-11-09-eunhae-project/forgetting_curves.png b/assets/img/2023-11-09-eunhae-project/forgetting_curves.png new file mode 100644 index 00000000..b5707150 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/forgetting_curves.png differ diff --git a/assets/img/2023-11-09-eunhae-project/forgetting_offline.png b/assets/img/2023-11-09-eunhae-project/forgetting_offline.png new file mode 100644 index 00000000..cf946fc9 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/forgetting_offline.png differ diff --git a/assets/img/2023-11-09-eunhae-project/forgetting_online.png b/assets/img/2023-11-09-eunhae-project/forgetting_online.png new file mode 100644 index 00000000..9c59df33 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/forgetting_online.png differ diff --git a/assets/img/2023-11-09-eunhae-project/resnets_comparison.png b/assets/img/2023-11-09-eunhae-project/resnets_comparison.png new file mode 100644 index 00000000..d7a8618a Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/resnets_comparison.png differ diff --git a/assets/img/2023-11-09-eunhae-project/saliency_offline.png b/assets/img/2023-11-09-eunhae-project/saliency_offline.png new file mode 100644 index 00000000..0d9f0ac8 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/saliency_offline.png differ diff --git a/assets/img/2023-11-09-eunhae-project/saliency_online.png b/assets/img/2023-11-09-eunhae-project/saliency_online.png new file mode 100644 index 00000000..e74990eb Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/saliency_online.png differ diff --git a/assets/img/2023-11-09-eunhae-project/saliencymap_exp4.png b/assets/img/2023-11-09-eunhae-project/saliencymap_exp4.png new file mode 100644 index 00000000..e4c85e71 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/saliencymap_exp4.png differ diff --git a/assets/img/2023-11-09-eunhae-project/stream_acc1.png b/assets/img/2023-11-09-eunhae-project/stream_acc1.png new file mode 100644 index 00000000..94a831b9 Binary files /dev/null and b/assets/img/2023-11-09-eunhae-project/stream_acc1.png differ