In my previous benchmarks [
The original SWE Bench and SWE Bench Verified use Python. I also use Python, but additionally Go, C#, JavaScript, Bash, and others occasionally. So I was naturally interested: how do LLM results vary across languages? My assumption was that models perform better with more popular languages, given the larger volume of publicly available code. That assumption turned out to be likely correct.
This aligns with findings from
Benchmark
As in my previous benchmarks, I used Leetcode online judge, to verify LLM skills on solving algorithmic problems. But this time, I experimented with four different languages, with different levels of popularity.
Languages
There are about 20 languages supported by Leetcode for algorithmic problems at the moment of writing. Leetcode doesn't provide language stats explicitly, but users post their solutions, and the platform provides stats for those posted solutions. So, I was able to derive language popularity. It is based on a few random problems, not the whole Leetcode database.
|
Language |
Published solutions, % |
|---|---|
|
C++ |
26.21% |
|
Java |
25.60% |
|
Python3 |
17.81% |
|
Python |
7.99% |
|
JavaScript |
6.68% |
|
C |
6.45% |
|
Go |
2.17% |
|
C# |
2.12% |
|
TypeScript |
1.44% |
|
Swift |
0.86% |
|
Kotlin |
0.74% |
|
Rust |
0.65% |
|
Ruby |
0.36% |
|
PHP |
0.43% |
|
Dart |
0.25% |
|
Scala |
0.16% |
|
Elixir |
0.05% |
|
Racket |
0.03% |
I picked four languages: Java and Python3, as two of the most popular. Leetcode distinguishes between Python 3 and 2; there are minimal differences between them, and solutions for version 2 will almost always work for version 3. Then I picked Rust, which has 50 times fewer published solutions, but its popularity is rapidly rising among the engineering community, making it an interesting case. And finally, Elixir, a niche language with just a handful of solutions.
The popularity of these four on Leetcode correlates with the
|
Language |
TIOBE Ratings, % |
|---|---|
|
Python |
21.8 |
|
Java |
8.12 |
|
Rust |
1.32 |
|
Elixir |
0.19 |
Additionally, I looked up the number of public GitHub repos for those four:
|
Language |
GitHub Repos, Millions |
|---|---|
|
Java |
20.20 |
|
Python |
26.50 |
|
Rust |
1.00 |
|
Elixir |
0.12 |
In short, Java and Python3 represent the most common programming languages with millions of public projects, and I expected that LLMs would handle them very well. Elixir is on the opposite side of the scale, with orders of magnitude less available code, so LLMs' abilities may diminish with it. Rust is somewhere in the middle — clearly popular, but can LLMs handle it well?
Problem Set
I picked 100 problems, published between Oct 2025 and Feb 2026.
|
Easy |
Medium |
Hard |
Total |
|---|---|---|---|
|
15 |
59 |
26 |
100 |
The intention was to get recent problems, likely "unseen" by LLMs. It is known that solutions for older, and especially popular problems, get into the models’ training sets.
Models
The models used in the benchmark are listed in the table below, with all non-default parameters specified. Release and knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference.
|
Vendor |
Model |
Release date |
Knowledge cutoff date |
"Reasoning" |
Parameters |
|---|---|---|---|---|---|
|
Anthropic |
claude-sonnet-4-5-20250929 |
Sep 2025 |
Jul 2025 |
No |
temperature = 0.0 |
|
|
gemini-3-flash-preview |
Dec 2025 |
unknown |
Yes |
temperature = 0.0 |
|
gemini-2.5-flash |
Apr 2025 |
unknown |
Yes |
temperature = 0.0 |
|
|
xAI |
grok-code-fast-1-0825 |
Aug 2025 |
unknown |
Yes |
seed = 42 |
|
OpenAI |
gpt-5-mini |
Aug 2025 |
May 2024 |
Yes |
seed = 42 |
All models, except Gemini 3 Flash (Preview), were released earlier than the oldest problem in the dataset (Oct 2025).
The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This should be kept in mind when reproducing these results.
All models support "reasoning" or "thinking" modes by default, except for Claude Sonnet 4.5. Other model features (or "tools") like web search were not enabled, even if supported.
Results
A problem is considered "accepted" or "solved" if the solution was accepted by the online judge. All other outcomes, like "wrong answer" or "time limit exceeded," are simply "not accepted" without any differentiation.
|
Model |
python3 |
java |
𝝙 python3 |
rust |
𝝙 python3 |
elixir |
𝝙 python3 |
|---|---|---|---|---|---|---|---|
|
claude-sonnet-4-5-20250929 |
50% |
52% |
+2 |
51% |
+1 |
35% |
-15 |
|
gemini-2.5-flash |
82% |
82% |
+0 |
77% |
-5 |
39% |
-43 |
|
gemini-3-flash-preview |
84% |
93% |
+9 |
78% |
-6 |
83% |
-1 |
|
gpt-5-mini |
93% |
94% |
+1 |
80% |
-13 |
63% |
-30 |
|
grok-code-fast-1-0825 |
73% |
65% |
-8 |
65% |
-8 |
30% |
-43 |
The results show a clear drop for Elixir across most models. But are these differences statistically meaningful?
To assess whether differences in pass rates between languages are statistically significant, I used a two-proportion z-test. For two languages each tested on N=100 problems, the minimum detectable difference at p=0.05 is given by 1.96×√(2p̄(1-p̄)/N), where p̄ is the average acceptance rate across the two languages.
Taking Python as a baseline, the Python-Java and Python-Rust gaps are non-significant for all models (thresholds ~11.7pp and ~12.3pp, respectively).
The Python-Elixir gap, however, well exceeds its threshold of ~13.4pp for all models except Gemini 3 Flash Preview, indicating that they handle Elixir significantly worse.
Database Problems
Interestingly, this pattern holds for SQL as well. I had a collection of 321 Leetcode database problems, published from 2015 to 2025.
|
Easy |
Medium |
Hard |
Total |
|---|---|---|---|
|
114 |
142 |
65 |
321 |
I used the same five LLMs as in the algorithmic benchmark, but for only two languages: MySQL and Oracle SQL. Though those two implementations are mostly interchangeable, there are subtle differences.
For Oracle SQL, there are 15 times fewer published solutions on Leetcode than for MySQL. TIOBE and GitHub don't provide any statistics for those languages — because they are, in fact, not programming languages.
Given that most problems predate the models' knowledge cut-off dates, contamination is possible and should be kept in mind when interpreting these results.
|
Model |
MySQL |
Oracle SQL |
𝝙 |
|---|---|---|---|
|
claude-sonnet-4-5-20250929 |
87.5% |
76.3% |
-11.2 |
|
gemini-2.5-flash |
86.6% |
67.9% |
-18.7 |
|
gemini-3-flash-preview |
95.6% |
85.7% |
-9.9 |
|
gpt-5-mini |
89.1% |
79.4% |
-9.7 |
|
grok-code-fast-1-0825 |
80.4% |
66.7% |
-13.7 |
With N=321 problems and average pass rates around 82%, the significance threshold is approximately 6 percentage points.
That means every tested model shows a significantly higher acceptance rate for MySQL.
Conclusion
We can see that LLM performance on coding problems correlates with language popularity. This is perhaps surprising: algorithmic problems are largely language-agnostic, so one might expect the underlying logic to transfer across languages. Yet, the data shows otherwise — the language you write in matters, even when the algorithm itself does not change.
With Python and Java, the most widely used languages, models outperform Elixir, a niche language. The same trend holds for SQL problems, where LLMs work better in MySQL than in Oracle SQL.
The most likely explanation is training data density: more popular languages generate more code examples, giving models more material to learn from.
The practical implication is straightforward: if you rely on LLMs for coding assistance, your language choice matters — potentially as much as your model choice. Working with uncommon languages means accepting meaningfully weaker AI support, though Gemini 3 Flash Preview is a notable exception, showing near-uniform results across all tested languages for algorithmic problems.
However, it is not clear what the actual popularity relationship is. Rust, despite having much fewer public repositories and published Leetcode solutions, showed no statistically significant difference.
Several directions would be worth exploring. First, expanding the problem set would allow the Rust finding to be confirmed or ruled out. Second, testing additional languages such as Scala, Dart, or Racket would help establish the popularity-performance relationship more precisely. And, as LLMs continue to evolve, it will be worth tracking whether the gap for niche languages narrows over time.
Links
Dataset used for this benchmark:
Tool used for prompting and submitting solutions: