sia.hackernoon.com

Recently, Lex Fridman released a five-hour podcast with Dario Amodei, Amanda Askell, and Chris Olah of Anthropic AI.

https://www.youtube.com/watch?v=ugvHCXCOmm4&t=9578s&embedable=true

After publishing his pretty lengthy article about the vision for AI development “Machines of Loving Grace”, Dario has been expanding on it since, likewise, it was the main focus of his conversation with Lex with other Anthropic members expanding on those topics.

For those less nerdy than me, I thought it’d be nice to summarise key ideas that this leading AI team had to share. Since the release of Claude 3.5 Sonnet (New) this autumn, it is clear that Anthropic’s progress in LLMs is on par with what openAI has achieved with their o1 preview model. They are amongst the leaders in this AI race which gives them good authority on the topic.

Apart from restating what has been said by the Anthropic team, I’d like to also fantasize about what each point implies for the future of practical AI application 1 year from now and 5 years from now as two important timelines. I am expecting to be wrong with my predictions (simply too many factors at play) but I think it’s fun mental gymnastics to perform and to look at this text when we do appear in that “actual future” time-wise.

We don’t know when the scale-up effect will plateau but NOT YET

One of the key takeaways for me was his perspective on the future effects of continuing down the scaling hypothesis path (the idea that throwing more, better data with more capable computation will accordingly make models smarter). Dario seems to hint that simply using all of the old techniques and adding more data may not be very effective anymore in getting a significant AI boost. The main focus of AI labs right now is to understand which part to scale.

Some of the promising avenues in his view are synthetic data generation (applying the AlphaGo approach to trial and error training for complex tasks) or Adding more of the guard-railed data ie. giving the models examples of good answers and bad answers for specific domains so it understands the general rules and applies them a bit better.
- 2025 - the AlphaGo AI self-training approach will become more common and models will surpass human ability in additional complex exercises that have a near-immediate feedback loop (maybe trading)
- 2030 - the AlphaGo self-training approach could be generalized in models such that they self-improve on difficult practical tasks when given sufficient time to practice the task.
  
  2. The approach to AI safety will develop alongside model development

Autonomy and misuse pose the biggest risks.

Dario claims that his team is testing both risks every time they train a new model so they can create preventions before releasing it.

ASL-1 (like a chess bot) - does not pose risks

ASL-2 (current AI models) - does not provide much risky information beyond what can be simply googled.

ASL-3 (can increase capabilities of wrongdoers) - cyber, nuclear, bio weapon enablement via those systems will have to be seriously nerfed before models can be released.

ASL-4+ (Smarter than ASL-3 + autonomous) - it is not clear how those will be controlled yet, they will only be de-risked when there are signs of such a model after training.

2025 - Dario expects ASL-3 next year. I believe that human misuse of those systems will happen despite the guardrails as it won’t be possible to catch all bugs before the release (new scams or software viruses).
2030 - multiple & capable robotic applications of AI e.g. Tesla Optimus robots, AI will both be embodied and far smarter than an average human in specific domains. It may be difficult to completely prevent misuse of such complex systems, especially in cases where they perform mundane tasks for criminal actors.

3. AGI (or in Dario’s words “powerful AI”) may arrive by 2027
He repeats multiple times that how smart AI becomes will be domain-dependent and that the blockers to AI development seem to be ceasing continuously. Logically, by using human-generated content correctly, the ability of humans to think should be replicated by AI eventually. By the analogy of chess-playing computers & AlphaGo, it is clear that in specific tasks AI can surpass human abilities and the better documented and rigid this domain is, the higher the performance should be. So, the worst-case scenario of an eventual AGI is human-level reasoning AI which has superb capabilities in specific fields where we were able to advance its training the most.
Likewise, the actual application of AI will depend on how far the specific industry is from AI developers. Clearly, it is easier for them to test and adapt new models to help write code than to make good use of those models in an agricultural environment. By this logic, IT/coding, Science, big city business, and only then the other parts of the economy should feel the impact of AI, in that order.
- 2025 - We will begin seeing more impressive/autonomous applications of AI, especially in coding, where non-technical product managers can perform code-based projects without asking for help from a coder.
- 2030 - Every business will integrate AI in their work-stream one way or another, frontier models would have helped numerous scientific discoveries in fields like Biology, Physics, and Mathematics.
1. Mechanistic Interpretability becomes more important for coherent model development
Models are developing quite rapidly but they remain a black box, it is unclear why they work well and why they work poorly.
Often this means that making changes / scaling such models leads to hallucinations, unpredictable actions, or emergent behaviors which ideally developers would like to understand in advance to make controlled model improvements.

Anthropic dedicates efforts to describe what actually happens inside the “mind” of their model Claude. This approach, theoretically, should explain why Claude spits out certain answers and how different training methods aid the changes in the patterns that get generated within this neural network. On top, it is simply fun to explore.
- 2025 - A more comprehensive descriptive interpretation of the Claude model, with new visualizations and detail (published or not depending on how sensitive this info may be for the competitive advantage of Anthropic).
- 2030 - If Anthropic’s approach is successful, every major AI lab may have generated an internal map of their AI systems (interpreted). However, if this approach proves too descriptive with no real impact on model development, no one will remember about mechanistic interpretability in 2030…

Conclusion

Regardless of the predictions, it is going to be fun to observe the next phase of AI. And, if no revolution actually happens in 5 years, It will at least be refreshing to reread this article as I finally choose to cancel my then $300/month OpenAI subscription.

What's Next for AI: Interpreting Anthropic CEO's Vision

Conclusion