Abstract and 1. Introduction

1.1 Syllogisms composition

1.2 Hardness of long compositions

1.3 Hardness of global reasoning

1.4 Our contributions

  1. Results on the local reasoning barrier

    2.1 Defining locality and auto-regressive locality

    2.2 Transformers require low locality: formal results

    2.3 Agnostic scratchpads cannot break the locality

  2. Scratchpads to break the locality

    3.1 Educated scratchpad

    3.2 Inductive Scratchpads

  3. Conclusion, Acknowledgments, and References

A. Further related literature

B. Additional experiments

C. Experiment and implementation details

D. Proof of Theorem 1

E. Comment on Lemma 1

F. Discussion on circuit complexity connections

G. More experiments with ChatGPT

A.1 Reasoning capabilities of Transformers

Reasoning vs. memorization. The performance of large language models and Transformers has been shown to improve as the model size, amount of data, and training time are increased [34, 3]. Furthermore, it has been shown that Transformers are prone to memorizing the training data [35, 36, 37, 38, 39]. Thus, it is natural to ask whether Transformers mostly rely on memorization or if they in fact use memorization along with a significant degree of reasoning. Reasoning is also essential in solving more challenging tasks such as planning and mathematics. In recent years, the reasoning power of Transformers has been studied on synthetic reasoning tasks such as PVR [6], LEGO [40], algorithmic tasks such as CLRS [9], and more natural settings such as solving math problems [4, 5]. Note that reasoning tasks often have a combinatorial nature and thus an exponentially large input space. Moreover, this input space may not lie on a low-dimensional manifold which makes memorization approaches ineffective. For example, in arithmetic, all digit combinations are often possible and also change the result significantly, whereas, in text, only specific combinations of words are valid and besides changing a single word often does not change the meaning of the text drastically. Another important criterion for reasoning is the ability to generalize to out-of-distribution (OOD) samples and in-particular length generalization [41, 42, 43, 11, 44]. It has been observed that simple tasks such as addition, multiplication, and parity are generally hard for Transformers both in the in-distribution setting and more notably in the length generalization setting [11, 45].

Positional embeddings. An important component of length generalization is the positional embedding of the model. It has been shown that various forms of relative positional embedding [46, 47] including T5’s relative positional embedding [48], ALiBi [49], and Rotary [50] can yield better length generalization performance than absolute positional embedding [51, 52, 53]. In particular, the recent work of Kazemnejad et al. [53] evaluates different positional embeddings for decoder-only Transformers on a benchmark of reasoning tasks where it is shown that relative positional embeddings perform better than absolute ones. Interestingly, it is also shown that decoder-only Transformers with no positional embedding (NoPE) can still encode positional information and in fact have better length generalization performance than absolute/relative positional embeddings on certain tasks. The inductive scratchpad put forward in this paper also reindexes the position of the tokens for each new state. Using this technique, the inductive scratchpad circumvents the need for learning new positional information as the number of reasoning steps (i.e., the number of states) increases. Nevertheless, the inductive scratchpad can be used with both absolute and relative positional embeddings. We further discuss the relation between the inductive scratchpad and relative positional embeddings in Appendix C.3.

Architectural modifications. Several architectural modifications have also been proposed that can potentially enhance the reasoning ability of Transformers on certain tasks. A line of work focuses on adding recurrent structures to Transformers [54, 55, 56]. In particular, Universal Transformers [54], share the weights between Transformer layers and also have a variable depth (similar to adaptive computation time [57]) implemented using a dynamic halting mechanism. Generally scratchpads and in particular the inductive scratchpad also share some recurrent flavor since when each token of the scratchpad is generated, the whole input and the scratchpad tokens are given to the Transformer again. The differences are however quite significant both on the side of supervision (e.g., scratchpads are usually supervised) and halting mechanism (e.g., generation of <EOS> token ends the process for scratchpad models). Another relevant architecture is the neural data router [58] where the weights are shared among transformer layers and also copying gates and a special attention format, geometric attention, are used. The copying gate allows some tokens not to be transformed and instead just be copied at certain transformer layers while geometric attention induces a recency bias. The neural data router is shown to generalize to samples with up to three more operations than what is seen during training for simple arithmetic tasks and ListOps dataset [59]. We note that our inductive scratchpad can also generalize to samples with a higher number of operations/reasoning steps. Note that the neural data router approach requires the model’s depth to be (at least) equal to the maximum number of operations, while our model does not have such limitations and can thus be trained and tested on samples with larger numbers of operations than the neural data router approach. Moreover, intermediate steps are supervised through the scratchpad mechanism in our approach while there is no supervision for intermediate steps in the neural data router.

Scratchpad and chain-of-thought. Nye et al. [32] put forward the idea of using scratchpads by showing that training Transformers on the intermediate steps in addition to the final answer can improve their performance on different tasks such as executing Python code, addition, and evaluating polynomials. Similarly, in chain-of-though reasoning (CoT) [60] models are shown step-by-step demos of problem-solving (or models generate chains of thought without any examples as in zero-shot CoT [61], among other variants). It has been further shown that using explicit explanations and reducing ambiguity can be proven useful as in the notion of algorithmic prompting [62]. Lanchantin et al. [63] introduced the concept of self-notes, showing that interleaving the intermediate reasoning steps within the question/context rather than after the question can boost the performance of scratchpad/CoT on certain tasks. Goyal et al. [64] introduced pause tokens which act as dummy tokens that provide models with more compute time and processing before the generation of the true next token. On a related note, the iterative prompting method in [65] involves querying LLMs iteratively to optimize question prompts. Further, the Select and Inference framework (SI) [66] utilizes pre-trained LLMs as general processing modules, alternating between selection and inference to generate interpretable reasoning steps leading to the final results. In this work, we used the concept of autoregressive locality to explain why scratchpads are helpful and how to design them. Further, we introduced inductive scratchpads that are suitable for a broad range of algorithmic reasoning tasks and can potentially generalize to more complex samples than what appears in the train distribution.

The RASP-L work [68] considers the addition task and can get length generalization from 35/40 to 50 digits (depending on the random seed) by outputting the result in the reverse order and also using ‘index hints’ which are special tokens that come before the operands’ digits in a specific order. For instance, an example would look like a5b4 + a3b7 = b1a9. Likewise, for parity targets, they use index hints along with scratchpad and can get length generalization from 30/35 bits to 50 bits (depending on the seed).

The work of Shen et al. [17] uses a scratchpad with a recursive format for the addition task to generalize to numbers with 12/13 digits while the training data includes numbers with up to 10 digits. Our special case of inductive scratchpad based on shifting the numbers is similar to their recursive scratchpad in terms of format, however, [17] does not enforce any inductive structure (as achieved with the attention masking for the inductive scratchpad) and hence [17] gets a much more limited length generalization.

In this work, we proposed inductive scratchpads for the parity and addition tasks that only require the use of random spaces in the question formatting. For parity, our scratchpad can easily generalize to 50 or 55 (depending on the seed) bits when trained on samples with up to 30 bits. The inductive scratchpad for the addition that requires random spaces in the question can also generalize to numbers with 18/20 digits when trained on numbers with up to 10 digits. We have also provided another scratchpad based on shifting the operands at each step for the addition task that can generalize to numbers with up to 26 digits when trained on numbers with up to 4 digits at the cost of having a less natural question format.

Reasoning over graphs. Numerous reasoning tasks can be thought of as some form of rule-based logical inference or entailments, over implicit or explicit graphs, where at their core sit common graph operations such as graph connectivity checks (as in the cycle task in the present paper). Here, we review a sample of notable related works on logical reasoning over graphs sharing the same primary motivation as the one behind our current work.

In recent years, several benchmarks have emerged, focusing on logical inference over diverse modalities. Within the context of natural language, LogiQA [69], DROP [70] or the Cluttr benchmarking set [71] are worth mentioning. These benchmarks assess logical relational reasoning abilities concerning entities and relations expressed in natural language. Another example but from a different modality is STAR [72], which focuses on situated logical reasoning over video clips. The work by Ontanon et al. [73] introduced the LogicInference dataset for training and benchmarking sequence-to-sequence models for logical inference tasks. The work by Wang et al. [74] introduces NLGraph, a benchmarking set for measuring LLMs’ capabilities in solving graph-based problems, including the graph connectivity task studied in the current paper. GraphQA, as presented in the work by Fatemi et al. [75], explores the impact of graph structure on language model prompting, while separating the impact of graph encoding and question prompting on the final performance. The common denominator of these studies’ results is that while language models excel as shallow reasoners, their performance in logical reasoning deteriorates with an increase in the number of required reasoning steps [66].

On the modeling front, a group of efforts have primarily focused on baking the appropriate inductive biases into the models for reasoning and logical inference over graphs. The work [76] proposes architectural innovations to use both the context and external knowledge sources or as in [77] for learning compositional logical rules. The Neurosymbolic approach, presented in [78] involves performing semantic parsing on natural language input using neural components to transform the input into discrete structures suitable for symbolic reasoning by the logic theorem provers. Graphomer [79] extends standard Transformer architecture with a graph reasoning inductive bias. The work of [54] combines Transformers with the inductive bias of the recurrent models. Other works take the route of generating more data using external sources like a Knowledge Graph [80] or extending pre-training [81]. The work in [74] evaluates LLMs’ graph reasoning capabilities in the presence of various prompting techniques like scratchpads/CoT and their variants. This study introduces Build-a-Graph Prompting, an instruction-based prompting method, as well as Algorithmic Prompting, which includes references to relevant graph algorithms in the prompts, as two approaches for enhancing LLMs in solving graph tasks in natural language.

A.2 Learning measures and lower-bounds for GD

Some recent literature has studied complexity measures for (S)GD-trained neural networks. In particular, the noise sensitivity [20, 6, 82, 21], which applies mostly to settings with i.i.d. inputs and is known to be loose for strong learning [83, 28]; the statistical query (SQ) dimension [14, 19] and the cross-predictability [12], which are usually defined for a class of targets/distributions rather than a single distribution (in particular the full parity is efficiently SQ learnable since there is a single function); the NTK alignment [22, 23] that are limited to the NTK framework; the initial alignment (INAL) [24], which is also related to the noise-sensitivity with the advantage of depending on the network initialization at the cost of being a more implicit measure; the information exponent [25, 26], generative exponent [27] and leap [28], which measure when fully connected neural networks can strongly learn target functions on i.i.d. or isotropic input distributions and sparse or single/multi-index functions.

We now discuss the proof techniques for Theorem 1. We prove that the Transformer cannot learn to distinguish between 1 cycle and 3 cycles by means of a statistical query argument. We find a group of permutations that preserve the input distribution, take the orbit of the target function under these permutations, and show that a random pair of functions in the orbit are nearly uncorrelated. Then we argue that that means that no function is significantly correlated with a random function in the orbit, so the gradient is uninformative and the net fails to learn anything meaningful. The main complication to this argument is the fact that the input distribution is fairly complicated, in contrast to orbit arguments used in previously discussed works (e.g., for subset parities). The majority of the symbols in the input are fixed and the rest have nontrivial dependencies. However, we get around that by dividing the input into blocks and observing that switching the nonfixed symbols in any two blocks leaves the overall probability distribution unchanged. This argument would largely carry over to other cases with a sufficiently symmetric input distribution and high locality target function.

A.3 On RASP and RASP-L

In [84], the authors define a programming language, RASP, to model the types of computations a fixed-size Transformer can compute. This is more about expressivity than learnability, but to the degree that a Transformer will tend to learn a computation representable by a short RASP program if there is one that achieves low loss, it does have some implications for learning. For instance, [68] observes that Transformers tend to generalize to different input lengths on tasks that can be represented by short RASP programs but not on other tasks. It also seems plausible that for tasks that can be solved by short RASP programs a sufficiently small Transformer would have a nontrivial probability of stumbling on the solution. However, we expect that randomly initialized large Transformers would tend to mix together a large number of computations in a chaotic manner, at least until they find some computation that they can improve their performance by focusing more on. So, we do not expect the existence of a short RASP program solving a problem to imply that Transformers would learn to solve it easily.

Also, RASP-L [68] is a variant of RASP that does not contain the full parity function as a short program. Our locality theory predicts that the full parity can be learned by some regular Transformer (in contrast to subset parities), so this also gives a nuance. Incidentally, our reason for expecting that some regular Transformer can learn the parity is that if the Transformer is set to mostly ignore the positional embeddings then in the first layer it will essentially average all of the inputs together. The correct label is a function of this average and it only has n + 1 possible values, so it seems likely that with the right setup it would be able to memorize the appropriate response to each of them.

Authors:

(1) Emmanuel Abbe, Apple and EPFL;

(2) Samy Bengio, Apple;

(3) Aryo Lotf, EPFL;

(4) Colin Sandon, EPFL;

(5) Omid Saremi, Apple.


This paper is available on arxiv under CC BY 4.0 license.