The Fragile Memory of Neural Networks, and the Metrics We Trust

TABLE OF LINKS

8 Conclusion

In this work, we sought to improve our understanding of catastrophic forgetting in ANNs by revisiting the fundamental questions of (1) how we can quantify catastrophic forgetting, and (2) how do the choices we make when designing learning systems affect the amount of catastrophic forgetting that occurs during training. To answer these questions we explored four metrics for measuring catastrophic forgetting: retention, relearning, activation overlap, and pairwise interference. We applied these four metrics to four testbeds from the reinforcement learning and supervised learning literature and showed that (1) catastrophic forgetting is not a phenomenon which can be effectively described by either a single metric or a single family of metrics, and (2) the choice of which modern gradient-based optimizer is used to train an ANN has a serious effect on the amount of catastrophic forgetting. Our results suggest that users should be wary of the optimization algorithm they use with their ANN in problems susceptible to catastrophic forgetting—especially when using Adam but less so when using SGD. When in doubt, we recommend simply using SGD without any kind of momentum and would advise against using Adam. Our results also suggest that, when studying catastrophic forgetting, it is important to consider many different metrics. We recommend using at least a retention-based metric and a relearning-based metric. If the testbed prohibits using those metrics, we recommend using pairwise interference. Regardless of the metric used, though, research into catastrophic forgetting—like much research in AI—must be cognisant that different testbeds are likely to favor different algorithms, and results on single testbeds are at high risk of not generalizing.

9 Future Work

While we used various testbeds and metrics to quantify catastrophic forgetting, we only applied it to answer whether one particular set of mechanisms affected catastrophic forgetting. Moreover, no attempt was made to use the testbed to examine the effect of mechanisms specifically designed to mitigate catastrophic forgetting. The decision to not focus on such methods was made as Kemker et al. (2018) already showed that these mechanisms’ effectiveness varies substantially as both the testbed changes and the metric used to quantify catastrophic forgetting changes. Kemker et al., however, only considered the retention metric in their work, so some value exists in looking at these methods again under the broader set of metrics we explore here. In this work, we only considered shallow ANNs. Contemporary deep learning frequently utilizes networks with many—sometimes hundreds—of hidden layers. While, Ghiassian, Rafiee, Lo, et al. (2020) showed that this might not be the most impactful factor in catastrophic forgetting (p. 444), how deeper networks affect the nature of catastrophic forgetting remains largely unexplored. Thus further research into this is required.

One final opportunity for future research lies in the fact that, while we explored several testbeds and multiple metrics for quantifying catastrophic forgetting, there are many other, more complicated testbeds, as well as several still-unexplored metrics which also quantify catastrophic forgetting (e.g., Fedus et al. (2020)). Whether the results of this work extend to significantly more complicated testbeds remains an important open question, as is the question of whether or not these results carry over to the control case of the reinforcement learning problem. Notably, though, it remains an open problem how exactly forgetting should be measured in the control case.

Acknowledgements

The authors would like to thank Patrick Pilarsky and Mark Ring for their comments on an earlier version of this work. The authors would also like to thank Compute Canada for generously providing the computational resources needed to carry out the experiments contained herein. This work was partially funded by the European Research Council Advanced Grant AlgoRNN to Jürgen Schmidhuber (ERC no: 742870).

References

Barnes, J. M., & Underwood, B. J. (1959). “Fate” of first-list associations in transfer theory. Journal of Experimental Psychology, 58(2), 97–105. https://doi.org/10.1037/h0047507

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI gym. arXiv. https://arxiv.org/abs/1606.01540

Chen, Z., & Liu, B. (2018). Lifelong machine learning (2nd ed.). Morgan & Claypool Publishers. https://doi.org/gd8g2p

DeJong, G., & Spong, M. W. (1994). Swinging up the acrobot: An example of intelligent control. Proceedings of the 1994 American Control Conference, 2, 2158–2162. https://doi.org/10. 1109/ACC.1994.752458

Ebbinghaus, H. (1913). Memory: A contribution to experimental psychology (H. A. Ruger & C. E. Bussenius, Trans.). Teachers College Press. (Original work published 1885).

Farquhar, S., & Gal, Y. (2018). Towards robust evaluations of continual learning. arXiv. https: //arxiv.org/abs/1805.09733 Fedus, W., Ghosh, D., Martin, J. D., Bellemare, M. G., Bengio, Y., & Larochelle, H. (2020). On catastrophic interference in atari 2600 games. arXiv. https://arxiv.org/abs/2002.12499

French, R. M. (1991). Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks. Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society, 173–178. https://cognitivesciencesociety.org/wp-content/uploads/2019/01/ cogsci_13.pdf

Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2414–2423. https://doi.org/10.1109/CVPR.2016.265

Geramifard, A., Dann, C., Klein, R. H., Dabney, W., & How, J. P. (2015). RLPy: A value-functionbased reinforcement learning framework for education and research. Journal of Machine Learning Research, 16(46), 1573–1578. http://jmlr.org/papers/v16/geramifard15a.html

Ghiassian, S., Rafiee, B., Lo, Y. L., & White, A. (2020). Improving performance in reinforcement learning by breaking generalization in neural networks. Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, 438–446. http://ifaamas.org/ Proceedings/aamas2020/pdfs/p438.pdf

Ghiassian, S., Rafiee, B., & Sutton, R. S. (2017). A first empirical study of emphatic temporal difference learning. arXiv. https://arxiv.org/abs/1705.04185

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, 9, 249–256. http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf

Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, 15, 315–323. http://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf

Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., & Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv. https://arxiv.org/abs/ 1312.6211

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. Proceedings of the 2015 IEEE International Conference on Computer Vision, 1026–1034. https://doi.org/10.1109/ICCV.2015.123

Hetherington, P. A., & Seidenberg, M. S. (1989). Is there ‘catastrophic interference’ in connectionist networks? Proceedings of the Eleventh Annual Conference of the Cognitive Science Society, 26–33. https://cognitivesciencesociety.org/wp-content/uploads/2019/01/cogsci_11.pdf

Hinton, G. E., Srivastava, N., & Swersky, K. (n.d.). RMSProp: Divide the gradient by a running average of its recent magnitude [PDF slides]. https://www.cs.toronto.edu/~tijmen/csc321/ slides/lecture_slides_lec6.pdf

Jarrett, K., Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? Proceedings of the 2009 IEEE International Conference on Computer Vision, 2146–2153. https://doi.org/10.1109/ICCV.2009.5459469

Kemker, R., McClure, M., Abitino, A., Hayes, T. L., & Kanan, C. (2018). Measuring catastrophic forgetting in neural networks. Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 3390–3398. https://aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16410 Kingma, D. P., & Ba, J. (2014).

Adam: A method for stochastic optimization. arXiv. https://arxiv.org/ abs/1412.6980

Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D., & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521–3526. https://doi.org/10.1073/pnas. 1611835114

Kornblith, S., Norouzi, M., Lee, H., & Hinton, G. E. (2019). Similarity of neural network representations revisited. Proceedings of the 36th International Conference on Machine Learning, 97, 3519–3529. http://proceedings.mlr.press/v97/kornblith19a/kornblith19a.pdf

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. https://doi.org/10.1109/5.726791

Lee, S.-W., Kim, J.-H., Jun, J., Ha, J.-W., & Zhang, B.-T. (2017). Overcoming catastrophic forgetting by incremental moment matching. Advances in Neural Information Processing Systems, 30, 4652–4662. http://papers.nips.cc/paper/7051-overcoming-catastrophic-forgetting-byincremental-moment-matching.pdf

Liu, V. (2019). Sparse Representation Neural Networks for Online Reinforcement Learning [Master’s thesis, University of Alberta]. https://era.library.ualberta.ca/items/b4cd1257-69ae-4349- 9de6-3feed2648eb1

Masse, N. Y., Grant, G. D., & Freedman, D. J. (2018). Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences, 115(44), E10467–E10475. https://doi.org/10.1073/pnas.1803839115 McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109–165. https: //doi.org/10.1016/S0079-7421(08)60536-8

Mirzadeh, S.-I., Farajtabar, M., Pascanu, R., & Ghasemzadeh, H. (2020). Understanding the role of training regimes in continual learning. Advances in Neural Information Processing Systems, 33. https://papers.nips.cc/paper/2020/file/518a38cc9a0173d0b2dc088166981cf8-Paper.pdf

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/ 10.1038/nature14236

Moore, A. W. (1990). Efficient memory-based learning for robot control [Doctoral dissertation, University of Cambridge]. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning, 807–814. http: //www.icml2010.org/papers/432.pdf Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1), 145–151. https://doi.org/10.1016/S0893-6080(98)00116-6

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners (tech. rep.). OpenAI. https://cdn.openai.com/betterlanguage-models/language_models_are_unsupervised_multitask_learners.pdf

Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions. Psychological Review, 97(2), 285–308. https://doi.org/10.1037/0033- 295X.97.2.285

Riemer, M., Cases, I., Ajemian, R., Liu, M., Rish, I., Tu, Y., & Tesauro, G. (2019). Learning to learn without forgetting by maximizing transfer and minimizing interference. Proceedings of the International Conference on Learning Representations. https://openreview.net/forum?id= B1gTShAct7

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0

Senior, A. W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žídek, A., Nelson, A. W. R., Bridgland, A., Penedones, H., Petersen, S., Simonyan, K., Crossan, S., Kohli, P., Jones, D. T., Silver, D., Kavukcuoglu, K., & Hassabis, D. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577(7792), 706–710. https://doi.org/10.1038/s41586-019-1923-7

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961 Sodhani, S., Chandar, S., & Bengio, Y. (2020). Toward training recurrent neural networks for lifelong learning. Neural Computation, 32(1), 1–35. https://doi.org/10.1162/neco_a_01246

Spong, M. W., & Vidyasagar, M. (1989). Robot dynamics and control. Wiley. Sutton, R. S. (1995). Generalization in reinforcement learning: Successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, 8, 1038–1044. http: // papers . nips . cc/ paper/ 1109 - generalization - in - reinforcement - learning - successful - examples-using-sparse-coarse-coding.pdf

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction (1st ed.). MIT Press. Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.

Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2014). Deepface: Closing the gap to human-level performance in face verification. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 1701–1708. https://doi.org/10.1109/CVPR.2014.220

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv. http://arxiv.org/abs/1708.07747

Zenke, F., Poole, B., & Ganguli, S. (2017). Continual learning through synaptic intelligence. Proceedings of the 34th International Conference on Machine Learning, 70, 3987–3995. http: //proceedings.mlr.press/v70/zenke17a/zenke17a.pdf

Authors:

This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.