Sia HackewrNoon

Authors:

(1) Suzanna Sia, Johns Hopkins University;

(2) David Mueller;

(3) Kevin Duh.

Table of Links

5. Inference Efficiency

Speeding up transformer inference is of great interest to the community (Fournier et al., 2023). We highlight the potential of speeding up inference time as a direct consequence of identifying where task recognition occurs in the model and redundancy of self-attention processing. Our results indicate that we can achieve significant speedups in inference by removing the processing of context-tokens all-together after a certain point in the model, with little to no impact on downstream performance.

Then, for a model with nℓ layers, the amount of processing in terms of speed and memory saved is approximately (nℓ − r)/nℓ × (k/k + 1).

Using the example of LLAMA7B (32 layers), we see from Figure 2 that the model is very close to it’s ceiling score after processing the examples at layer 14 (ℓ = 14). If we no longer need to process examples after ℓ = 14, under a prompt size of 5 the savings are approximately 45%.

For instruction-tuned models which are typically deployed in production, even if we assume that no examples are provided, savings can be non-trivial as very long-form instructions are typically provided to the model in an attempt to control it’s behavior (prompt engineering).

This paper is available on arxiv under CC BY 4.0 DEED license.

Where does In-context Translation Happen in Large Language Models: Inference Efficiency

Table of Links

5. Inference Efficiency