Abstract and 1 Introduction

2 Related Work

3 Model and 3.1 Associative memories

3.2 Transformer blocks

4 A New Energy Function

4.1 The layered structure

5 Cross-Entropy Loss

6 Empirical Results and 6.1 Empirical evaluation of the radius

6.2 Training GPT-2

6.3 Training Vanilla Transformers

7 Conclusion and Acknowledgments

Appendix A. Deferred Tables

Appendix B. Some Properties of the Energy Functions

Appendix C. Deferred Proofs from Section 5

Appendix D. Transformer Details: Using GPT-2 as an Example

References

4 A New Energy Function

We first introduce a new energy function that does not rely on additional regularization terms. We then adapt this function to the layered transformer blocks using the majorization minimization technique. For reference, related energy functions for Hopfield networks are listed in Table 1 within Appendix A. In particular, the energy function for the modern continuous Hopfield network (Ramsauer et al., 2020) is

Notice that the negative LogSumExp function was adapted from (Demircigil et al., 2017). However, in the continuous domain, the negative LogSumExp function is not convex, which makes it a less suitable candidate for the energy function. The MCHN energy then adds regularization terms to create a convex energy function. These regularization terms involve both the max norm of the input and the number of patterns.

Instead of designing different regularization terms, we define a new energy function through an auxiliary function

We consider a new energy function E(x) which also takes the form of LogSumExp. It is worth noting that the softmax function is the gradient of the LogSumExp function. By summing up the negative distance between x and each stored pattern, the function assigns smaller values to points near the patterns. Our proposed energy function is

By replacing the dot product in the MCHN energy with a distance metric, E(x) achieves the same goal without additional regularization. As shown in Figures 1a and 1b, as an extension of (Demircigil et al., 2017), the negative LogSumExp is not convex in the real domain, so regularization terms are applied. Figures 1d and 1c show that the landscape of the proposed energy resembles that of the MCHN energy. In (Ramsauer et al., 2020), it is shown that EMCHN induces stationary points near the stored patterns. Here, the proposed function E(x) serves as a smooth surrogate of the desired function g(x) in Eq. (3), therefore also demonstrates the retrieval ability.

Since the proposed energy and the MCHN energy both approximate the search for the nearest pattern (desired stationary point), according to Theorem 4 in (Ramsauer et al., 2020), in each transformer layer, the probability density of the transformer layer, corresponding to the retrieval, is

Authors:

(1) Xueyan Niu, Theory Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd.;

(2) Bo Bai baibo ([email protected]);

(3) Lei Deng ([email protected]);

(4) Wei Han ([email protected]).


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.