The last year was all about the race of commercial LLMs. Though the LLaMa leakage and further Meta actions on the open-source front were significant, and Hugging Face kept blooming as a platform for open source, this year starts with a powerful player stepping in with a real open-source approach.

We speak, of course, about OLMoAccelerating the Science of Language Model, released by the Allen Institute for Artificial Intelligence. Newsletters such as AlphaSignal, TheSequence, Data Machina, Smol Talk, and Interconnects (Nathan Lambert is one of the authors of the OLMo paper) explained pretty well what’s the difference between almost open-source and truly open-source models.

The gist of it is that truly open source means that not only the weights of the model and inference code are released but truly the whole package: the training data, training and evaluation code, and a comprehensive framework for studying language modeling.

Who is behind OLMo?

What I found interesting is who stands behind the release. While EleutherAI’s Pythia and Big Science’s BLOOM previously set a precedent for releasing fully open-source models, the distinction with OLMo is its release by a true nonprofit organization — the Allen Institute for AI (AI2). AI2 was founded in 2014 by philanthropist and Microsoft co-founder Paul G. Allen, who committed to conducting high-impact research and engineering in artificial intelligence. He was also very interested in teaching machines “common sense.”

And he funded this cause well. Once, I had a conversation with one of the top executives of AI2; the person said that thanks to Paul Allen’s financing structure, AI2 is well-funded, has no influence from large companies, and has no pressure to make money.

AI2 is famous for conducting cutting-edge research in AI and aiming to influence the broader AI research community by releasing open-source software, datasets, and research findings. Projects like the Semantic Scholar academic search engine democratize access to information and accelerate scientific breakthroughs.

Why OLMo is special

The OLMo framework includes multiple training checkpoints, logs, exact datasets used, and a permissive license, establishing a new standard for openness in the field. They also don’t mind this model being used for commercial purposes. Unlike others, the researchers readily embrace openness, believing it outweighs the low misuse risk, as their models, not designed as chatbots, contribute to science rather than commercial products.

Furthermore, they released ‘Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research.’ According to Luca Soldani, ‘the name of the pretraining corpus, “Dolma,” stands for Data to feed OLMo’s Appetite.’

What also surprised me was that the authors highlighted the environmental impact of training large LMs, providing estimates of power consumption and carbon emissions. They advocate for transparency in reporting these impacts and emphasize the potential for open models like OLMo to mitigate future emissions by minimizing redundant model training.

Great start to the year of open-source!

News from The Usual Suspects ©

Google

Meta

“There are several strategic benefits. First, open source software is typically safer and more secure, as well as more compute efficient to operate due to all the ongoing feedback, scrutiny, and development from the community. This is a big deal because safety is one of the most important issues in AI. Efficiency improvements and lowering the compute costs also benefit everyone including us. Second, open source software often becomes an industry standard, and when companies standardize on building with our stack, that then becomes easier to integrate new innovations into our products.

That’s subtle, but the ability to learn and improve quickly is a huge advantage and being an industry standard enables that. Third, open source is hugely popular with developers and researchers. We know that people want to work on open systems that will be widely adopted, so this helps us recruit the best people at Meta, which is a very big deal for leading in any new technology area. And again, we typically have unique data and build unique product integrations anyway, so providing infrastructure like Llama as open source doesn’t reduce our main advantages. This is why our long-standing strategy has been to open source general infrastructure and why I expect it to continue to be the right approach for us going forward.” — Mark Zuckerberg

The freshest research papers, categorized for your convenience

Language Modeling and Efficiency

Advanced Reasoning and Contextual Understanding

Enhancements in AI Frameworks and Methodologies

Novel Applications and Security Insights

In other newsletters


Also published here.