Table of Links
3 SUTRA Approach
4 Training Multilingual Tokenizers
5 Multilingual MMLU
5.1 Massive Multitask Language Understanding
5.2 Extending MMLU to Multiple Languages and 5.3 Consistent Performance across Languages
5.4 Comparing with leading models for Multilingual Performance
6 Quantitative Evaluation for Real-Time Queries
7 Discussion and Conclusion, and References
3 SUTRA Approach
3.1 What is SUTRA?
SUTRA is a novel multilingual large language model architecture that is trained by decoupling concept learning from language learning. Inspired by how humans learn, SUTRA decouples core concept learning from language learning, making it scalable and easier to reach large number of languages. Humans first understand the world through concepts and then gradually learn their native language. Once fluent in one language, they learn new languages without having to re-learn common core concepts. Similarly, central to our approach is the innovative strategy of separating concept learning from language learning. This enables the core LLM capabilities to operate within a conceptual or latent space, while the heavy lifting of tokenization and translation is handled by specialized encoders and decoders inspired from Neural Machine Translation. This approach makes training of LLMs more scalable, whilst making it easier to reach a larger number of languages.
Our training methodology unfolds in three phases: concept learning, language learning and language alignment.
• Concept Learning: Initially, the core concept model undergoes training to grasp concepts within a small set of languages, setting a solid foundation for understanding basic concepts and skills.
• Language Learning: In parallel, we train specialized Neural Machine Translation (NMT) based encoders and decoders, alongside a multilingual tokenizer, specifically designed to master multi-language translation and ensure concept consistency across languages.
• Language Alignment: Finally, we perform language alignment, merging concept understanding with linguistic proficiency.
In the inference stage, SUTRA employs a structured path: Input is processed through an NMT Encoder, followed by the Concept Model, and finally through the NMT Decoder to produce the output.
Authors:
(1) Abhijit Bendale, Two Platforms ([email protected]);
(2) Michael Sapienza, Two Platforms ([email protected]);
(3) Steven Ripplinger, Two Platforms ([email protected]);
(4) Simon Gibbs, Two Platforms ([email protected]);
(5) Jaewon Lee, Two Platforms ([email protected]);
(6) Pranav Mistry, Two Platforms ([email protected]).
This paper is