This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: EHpQ7HrhOh1KtLwzXFYUTHYr4IXn-KTz8X7BNe-cSV8
Cover

Where does In-context Translation Happen in Large Language Models: Inference Efficiency

Written by @computational | Published on 2024/8/30

TL;DR
In this study, researchers attempt to characterize the region where large language models transition from in-context learners to translation models.

Authors:

(1) Suzanna Sia, Johns Hopkins University;

(2) David Mueller;

(3) Kevin Duh.

5. Inference Efficiency

Speeding up transformer inference is of great interest to the community (Fournier et al., 2023). We highlight the potential of speeding up inference time as a direct consequence of identifying where task recognition occurs in the model and redundancy of self-attention processing. Our results indicate that we can achieve significant speedups in inference by removing the processing of context-tokens all-together after a certain point in the model, with little to no impact on downstream performance.

Then, for a model with nℓ layers, the amount of processing in terms of speed and memory saved is approximately (nℓ − r)/nℓ × (k/k + 1).

Using the example of LLAMA7B (32 layers), we see from Figure 2 that the model is very close to it’s ceiling score after processing the examples at layer 14 (ℓ = 14). If we no longer need to process examples after ℓ = 14, under a prompt size of 5 the savings are approximately 45%.

For instruction-tuned models which are typically deployed in production, even if we assume that no examples are provided, savings can be non-trivial as very long-form instructions are typically provided to the model in an attempt to control it’s behavior (prompt engineering).

This paper is available on arxiv under CC BY 4.0 DEED license.

[story continues]


Written by
@computational
Computational: We take random inputs, follow complex steps, and hope the output makes sense. And then blog about it.

Topics and
tags
large-language-models|context-masking-experiments|in-context-learning|machine-translation|translation-models|supervised-neural-mt-models|gpt-models|fine-tuning-llms
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: EHpQ7HrhOh1KtLwzXFYUTHYr4IXn-KTz8X7BNe-cSV8