Earlier this month, xAI unveiled Grok 4 with founder Elon Musk claiming it is the “smartest AI in the world.”

But is it?

There is little doubt that Grok 4 is a formidable contender in the AI landscape alongside OpenAI’s o3 and Google’s Gemini 2.5 Pro. With advanced reasoning, multimodal capabilities, and a new multi-agent system, Grok 4 aims to redefine AI performance.

The question is whether Grok 4 has the capabilities and innovative features to live up to Musk’s claim.

It is easy to see that Grok 4 is trying to set new standards in AI performance, with top scores across multiple academic and problem-solving benchmarks.

On “Humanity’s Last Exam”, a rigorous 2,500-question test spanning math, humanities, and natural sciences, Grok 4 scored 25.4% without external tools, outperforming Google’s Gemini 2.5 Pro (21.6%) and OpenAI’s o3 (21%). With tools enabled, its advanced variant, Grok 4 Heavy, achieved an impressive 44.4%.

Additionally, Grok 4 secured a 15.9% score on the ARC-AGI-2 test, a pattern-recognition benchmark, nearly doubling the previous commercial state-of-the-art score of Claude Opus 4.

Grok 4 also excels in the Artificial Analysis Intelligence Index surpassing OpenAI’s o3 and Gemini 2.5 Pro and excels in coding and math.

This is a strong demonstration of Grok 4’s strength in complex reasoning and academic tasks – a cornerstone of its design. The model’s ability to “think” for seconds to minutes, correcting errors and exploring alternatives, mirrors human-like problem-solving.

According to xAI, Grok 4 “utilized Colossus, our 200,000 GPU cluster, to run reinforcement learning training that refines Grok's reasoning abilities at pretraining scale.”

It uses innovations “including new infrastructure and algorithmic work that increased the compute efficiency of our training by 6x, as well as a massive data collection effort, where we significantly expanded our verifiable training data from primarily math and coding data to many more domains.”

Grok 4 introduces robust multimodal support, processing text and images with plans for video and audio integration by late 2025.

The company says it will “continue scaling reinforcement learning to unprecedented levels, building on Grok 4's advancements to push the boundaries of artificial intelligence.”

There also also plans to “expand the scope from verifiable rewards in controlled domains to tackling complex real-world problems, where models can learn and adapt in dynamic environments” with multimodal capabilities integrating vision, audio, and more will see ongoing improvements “for more intuitive interactions.”

Grok 4 Heavy, available through a $300/month SuperGrok Heavy subscription, uses a unique approach, spawning multiple AI agents to collaboratively solve problems, akin to a “study group” comparing solutions for optimal results.

Grok 4’s voice mode, featuring a British-accented assistant named Eve, offers natural, real-time conversations, improving accessibility and user engagement.

Despite its impressive strengths, Grok 4 is not without its challenges and limitations as noted by Data Science Dojo.

First, there is the issue of speed. Especially for the multi-agent “Heavy” model, latency can be noticeable. Second, its visual reasoning is still lacking. While it supports images, Grok 4’s vision capabilities still trail behind dedicated models like Gemini or Claude Opus. Third, xAI still has work to do in terms of scalability. Managing collaborative agents at scale (in Grok 4 Heavy) is complex and still evolving.

Regardless, xAI is pushing forward and has big plans.

We can expect specialized models, such as OpenAI’s approach, with focused versions for coding, multimodal generation, and even video reasoning. It is also a possibility that smaller Grok variants may be open sourced to support research and transparency (much like Musk did with Hyperloop). Musk also envisions Grok as a step toward AGI, with human-AI interaction capable of teaming AI with humans to solve tough scientific and societal problems.

Grok 4’s launch was overshadowed by controversies surrounding its predecessor, Grok 3, which posted antisemitic content and praised Adolf Hitler on X. These incidents, attributed to Grok’s over-compliance with user prompts and a system prompt encouraging “politically incorrect” responses, raised serious concerns about its safety and reliability. xAI has since worked on correcting the issue and implemented measures to ban hate speech, but the lack of transparency has led to much skepticism for the time being, especially since the model is still vulnerable to manipulation and hallucinations.

Grok 4’s premium pricing, particularly the $300/month SuperGrok Heavy tier, limits accessibility for smaller organizations and individual researchers. Unless this changes, the platform will remain inaccessible to many of those who need it most.

Grok 4 still struggles with common-sense reasoning and real-world application with regard to problem-solving. While the platform is incredibly impressive, there is still room for improvement and consumers continue to look forward to successive models that break new boundaries and open new possibilities. The question is which company will establish itself as the industry leader.