Comparing LLMs' Coding Abilities Across Programming Languages

In my previous benchmarks [1, 2], I showed that LLMs can successfully solve most Leetcode problems. However, they are better at solving well-known problems than novel ones. This can be explained by contaminated training data — solutions to well-known problems are likely to be included in training data (this is partially confirmed by recent OpenAI comments regarding SWE Bench [3]).

The original SWE Bench and SWE Bench Verified use Python. I also use Python, but additionally Go, C#, JavaScript, Bash, and others occasionally. So I was naturally interested: how do LLM results vary across languages? My assumption was that models perform better with more popular languages, given the larger volume of publicly available code. That assumption turned out to be likely correct.

This aligns with findings from SWE-bench Multilingual, which observed similar performance drops on non-Python languages in real-world software engineering tasks. However, real-world issues involve additional complexity — tooling, libraries, pipelines, etc. I wanted to verify the pattern, using a cleaner setup. Leetcode problems isolate the language itself, since the underlying algorithms are largely language-agnostic. This is what makes the finding more surprising: even when the logic doesn't change, the language you write it in still affects whether the model gets it right.

Benchmark

As in my previous benchmarks, I used Leetcode online judge, to verify LLM skills on solving algorithmic problems. But this time, I experimented with four different languages, with different levels of popularity.

Languages

There are about 20 languages supported by Leetcode for algorithmic problems at the moment of writing. Leetcode doesn't provide language stats explicitly, but users post their solutions, and the platform provides stats for those posted solutions. So, I was able to derive language popularity. It is based on a few random problems, not the whole Leetcode database.

Language	Published solutions, %
C++	26.21%
Java	25.60%
Python3	17.81%
Python	7.99%
JavaScript	6.68%
C	6.45%
Go	2.17%
C#	2.12%
TypeScript	1.44%
Swift	0.86%
Kotlin	0.74%
Rust	0.65%
Ruby	0.36%
PHP	0.43%
Dart	0.25%
Scala	0.16%
Elixir	0.05%
Racket	0.03%

I picked four languages: Java and Python3, as two of the most popular. Leetcode distinguishes between Python 3 and 2; there are minimal differences between them, and solutions for version 2 will almost always work for version 3. Then I picked Rust, which has 50 times fewer published solutions, but its popularity is rapidly rising among the engineering community, making it an interesting case. And finally, Elixir, a niche language with just a handful of solutions.

The popularity of these four on Leetcode correlates with the TIOBE index, though it does not match precisely.

Language	TIOBE Ratings, %
Python	21.8
Java	8.12
Rust	1.32
Elixir	0.19

Additionally, I looked up the number of public GitHub repos for those four:

Language	GitHub Repos, Millions
Java	20.20
Python	26.50
Rust	1.00
Elixir	0.12

In short, Java and Python3 represent the most common programming languages with millions of public projects, and I expected that LLMs would handle them very well. Elixir is on the opposite side of the scale, with orders of magnitude less available code, so LLMs' abilities may diminish with it. Rust is somewhere in the middle — clearly popular, but can LLMs handle it well?

Problem Set

I picked 100 problems, published between Oct 2025 and Feb 2026.

Easy	Medium	Hard	Total
15	59	26	100

The intention was to get recent problems, likely "unseen" by LLMs. It is known that solutions for older, and especially popular problems, get into the models’ training sets.

Models

The models used in the benchmark are listed in the table below, with all non-default parameters specified. Release and knowledge cutoff dates are obtained from the vendor's official documentation and provided for reference.

Vendor	Model	Release date	Knowledge cutoff date	"Reasoning"	Parameters
Anthropic	claude-sonnet-4-5-20250929	Sep 2025	Jul 2025	No	temperature = 0.0 max_tokens = 4096
Google	gemini-3-flash-preview	Dec 2025	unknown	Yes	temperature = 0.0
	gemini-2.5-flash	Apr 2025	unknown	Yes	temperature = 0.0
xAI	grok-code-fast-1-0825	Aug 2025	unknown	Yes	seed = 42
OpenAI	gpt-5-mini	Aug 2025	May 2024	Yes	seed = 42

All models, except Gemini 3 Flash (Preview), were released earlier than the oldest problem in the dataset (Oct 2025).

The benchmark aimed to be as deterministic and reproducible as possible; therefore, parameters such as "temperature" or "seed" were used. However, none of the models tested guarantee fully deterministic output. This should be kept in mind when reproducing these results.

All models support "reasoning" or "thinking" modes by default, except for Claude Sonnet 4.5. Other model features (or "tools") like web search were not enabled, even if supported.

Results

A problem is considered "accepted" or "solved" if the solution was accepted by the online judge. All other outcomes, like "wrong answer" or "time limit exceeded," are simply "not accepted" without any differentiation.

Model	python3	java	𝝙 python3	rust	𝝙 python3	elixir	𝝙 python3
claude-sonnet-4-5-20250929	50%	52%	+2	51%	+1	35%	-15
gemini-2.5-flash	82%	82%	+0	77%	-5	39%	-43
gemini-3-flash-preview	84%	93%	+9	78%	-6	83%	-1
gpt-5-mini	93%	94%	+1	80%	-13	63%	-30
grok-code-fast-1-0825	73%	65%	-8	65%	-8	30%	-43

The results show a clear drop for Elixir across most models. But are these differences statistically meaningful?

To assess whether differences in pass rates between languages are statistically significant, I used a two-proportion z-test. For two languages each tested on N=100 problems, the minimum detectable difference at p=0.05 is given by 1.96×√(2p̄(1-p̄)/N), where p̄ is the average acceptance rate across the two languages.

Taking Python as a baseline, the Python-Java and Python-Rust gaps are non-significant for all models (thresholds ~11.7pp and ~12.3pp, respectively).

The Python-Elixir gap, however, well exceeds its threshold of ~13.4pp for all models except Gemini 3 Flash Preview, indicating that they handle Elixir significantly worse.

Database Problems

Interestingly, this pattern holds for SQL as well. I had a collection of 321 Leetcode database problems, published from 2015 to 2025.

Easy	Medium	Hard	Total
114	142	65	321

I used the same five LLMs as in the algorithmic benchmark, but for only two languages: MySQL and Oracle SQL. Though those two implementations are mostly interchangeable, there are subtle differences.

For Oracle SQL, there are 15 times fewer published solutions on Leetcode than for MySQL. TIOBE and GitHub don't provide any statistics for those languages — because they are, in fact, not programming languages.

Given that most problems predate the models' knowledge cut-off dates, contamination is possible and should be kept in mind when interpreting these results.

Model	MySQL	Oracle SQL	𝝙
claude-sonnet-4-5-20250929	87.5%	76.3%	-11.2
gemini-2.5-flash	86.6%	67.9%	-18.7
gemini-3-flash-preview	95.6%	85.7%	-9.9
gpt-5-mini	89.1%	79.4%	-9.7
grok-code-fast-1-0825	80.4%	66.7%	-13.7

With N=321 problems and average pass rates around 82%, the significance threshold is approximately 6 percentage points.

That means every tested model shows a significantly higher acceptance rate for MySQL.

Conclusion

We can see that LLM performance on coding problems correlates with language popularity. This is perhaps surprising: algorithmic problems are largely language-agnostic, so one might expect the underlying logic to transfer across languages. Yet, the data shows otherwise — the language you write in matters, even when the algorithm itself does not change.

With Python and Java, the most widely used languages, models outperform Elixir, a niche language. The same trend holds for SQL problems, where LLMs work better in MySQL than in Oracle SQL.

The most likely explanation is training data density: more popular languages generate more code examples, giving models more material to learn from.

The practical implication is straightforward: if you rely on LLMs for coding assistance, your language choice matters — potentially as much as your model choice. Working with uncommon languages means accepting meaningfully weaker AI support, though Gemini 3 Flash Preview is a notable exception, showing near-uniform results across all tested languages for algorithmic problems.

However, it is not clear what the actual popularity relationship is. Rust, despite having much fewer public repositories and published Leetcode solutions, showed no statistically significant difference.

Several directions would be worth exploring. First, expanding the problem set would allow the Rust finding to be confirmed or ruled out. Second, testing additional languages such as Scala, Dart, or Racket would help establish the popularity-performance relationship more precisely. And, as LLMs continue to evolve, it will be worth tracking whether the gap for niche languages narrows over time.

Links

Dataset used for this benchmark:

https://huggingface.co/datasets/whiskwhite/leetcode-complete

Tool used for prompting and submitting solutions:

https://github.com/whisk/leetgptsolver