Curse of memory refers to the difficulty of learning long-term memory using recurrent models. Although recurrent models benefit from low inference costs, this curse restricts their effectiveness for tasks involving long sequences. In this paper, we study nonlinear RNNs' curse of memory phenomenon. It is shown that, simply adding nonlinear activations such as hardtanh and tanh does not relax the curse. Using stable reparameterisation such as exp parameterisation and softplus parameterisation can relax the curse of memory and achieve stable approximation for long-term memories.
Curse of memory in linear RNNs
Let
Exponential decaying memory can be stably approximated | Polynomial decaying memory cannot be stably approximated |
---|---|
Curse of memory in nonlinear RNNs
Next, we still work on the polynomial decay memory. We show that the commonly-used activations (hardtanh and tanh) do not directly relaxed the difficuly in the polynomial decaying memory task.
Hardtanh | Tanh |
---|---|
Proper parameterisation enables stable approximation for long memory
We'll designate the parameterizations that accommodate long-term memory as stable parameterizations.
Parameterisation | Exponential decay | Polynomial decay |
---|---|---|
Diagonal RNN | Stable | Unstable |
Vanilla RNN | Stable | Unstable |
State-space model | Stable | Unstable |
Linear Recurrent Unit | Stable | Unstable |
Stable Reparameterisation | Stable | Stable |
Vanilla RNN | Stable Parameterisation |
---|---|
Discrete-time case:
Continuous-time case:
State-space models refer to the linear RNNs with layer-wise nonlinear activations.
Discrete-time case:
Continuous-time case:
State-space models refer to the linear RNNs with layer-wise nonlinear activations.
Discrete-time case:
# clone project
git clone https://github.com/radarFudan/Curse-of-memory
cd Curse-of-memory
conda create -n CoM python=3.10
conda activate CoM
pip install -r requirements.txt
@inproceedings{
wang2024inverse,
title={Inverse Approximation Theory for Nonlinear Recurrent Neural Networks},
author={Shida Wang and Zhong Li and Qianxiao Li},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=yC2waD70Vj}
}
@inproceedings{
li2021on,
title={On the Curse of Memory in Recurrent Neural Networks: Approximation and Optimization Analysis},
author={Zhong Li and Jiequn Han and Weinan E and Qianxiao Li},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=8Sqhl-nF50}
}
@inproceedings{
wang2023statespace,
title={State-space models with layer-wise nonlinearity are universal approximators with exponential decaying memory},
author={Shida Wang and Beichen Xue},
booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
year={2023},
url={https://openreview.net/forum?id=i0OmcF14Kf}
}
@misc{wang2023stablessm,
title={StableSSM: Alleviating the Curse of Memory in State-space Models through Stable Reparameterization},
author={Shida Wang and Qianxiao Li},
year={2023},
eprint={2311.14495},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@Article{JML-2-1,
author = {Haotian Jiang and Qianxiao Li and Zhong Li and Shida Wang},
title = {A Brief Survey on the Approximation Theory for Sequence Modelling},
journal = {Journal of Machine Learning},
year = {2023},
volume = {2},
number = {1},
pages = {1--30},
abstract = {We survey current developments in the approximation theory of sequence modelling in machine learning. Particular emphasis is placed on classifying existing results for various model architectures through the lens of classical approximation paradigms, and the insights one can gain from these results. We also outline some future research directions towards building a theory of sequence modelling.},
issn = {2790-2048},
doi = {https://doi.org/10.4208/jml.221221},
url = {http://global-sci.org/intro/article_detail/jml/21511.html} }