Bigger Models Won’t Fix Terminal Agents

The gap between talking and doing

Large language models excel at discussing programming concepts, explaining terminal commands, and reasoning about file systems. Yet when asked to actually accomplish a task in a terminal, they fail spectacularly. They suggest nonsensical commands, misinterpret output, and give up at the first error. This gap between linguistic capability and practical competence has persisted despite rapid advances in model scale and architecture.

The industry's response has been predictable: build bigger models. Deploy models with more parameters, more training tokens, more compute. Yet recent work shows that even substantial models like Qwen3-32B achieve only 3.4% on Terminal-Bench 2.0, a standard benchmark for terminal task completion. This suggests the bottleneck isn't model capacity. It's something more fundamental: the training data itself.

A new paper approaches terminal agent capabilities through a different lens. Rather than chasing model scale or architectural innovations, the authors conducted a systematic study of data engineering practices for terminal agents. The conclusion challenges conventional wisdom: a carefully constructed dataset combined with strategic filtering and curriculum learning can teach an 8B parameter model to match the performance of models four to ten times larger trained on standard data.

The unsexy truth about capability

The conventional story about AI progress emphasizes algorithmic breakthroughs and computational scale. What actually happens in practice is less glamorous. For embodied tasks, where models need to execute sequences of actions rather than simply generate text, what you train on matters far more than how much compute you throw at the problem.

This paper introduces three key contributions that make this shift possible. First, Terminal-Task-Gen, a lightweight synthetic task generation pipeline that supports both seed-based and skill-based task construction. Second, a comprehensive analysis of filtering strategies, curriculum learning approaches, and scaling behavior. Third, Terminal-Corpus, a large-scale open-source dataset of terminal interactions that demonstrates these principles work in practice.

The results vindicate this approach. Nemotron-Terminal models, trained on Terminal-Corpus and initialized from Qwen base models, achieve substantial performance jumps: the 8B version improves from 2.5% to 13.0%, the 14B version from 4.0% to 20.2%, and the 32B version from 3.4% to 27.4%. These aren't incremental improvements. They represent fundamental shifts in efficiency.

Where does high-quality training data come from

Manually creating thousands of high-quality terminal interactions would be prohibitively expensive. A human expert writing terminal task trajectories might produce a few per day. Building a dataset with enough diversity to teach genuine capability would require months of expert time and substantial cost. So the paper takes a different approach: systematize the process of generating diverse, realistic terminal tasks.

Terminal-Task-Gen operates in two phases. The first phase, Dataset Adaptation, takes existing benchmarks and task descriptions from sources like Terminal-Bench, then reformulates them as interactive terminal interactions. This provides a foundation but is limited in coverage. Few benchmarks exist for terminal tasks, and even those that do capture only a fraction of possible terminal operations.

The second phase, Synthetic Task Generation, is where the real leverage appears. The pipeline defines a Skill Taxonomy, a structured breakdown of terminal operations and concepts. These skills range from basic navigation (moving between directories, listing files) to more complex operations (understanding command output, iterating based on errors, chaining operations together). By combining skills from this taxonomy in different ways, the system generates novel terminal tasks that teach these skills systematically.

Overview of Terminal-Task-Gen combining Dataset Adaptation and Synthetic Task Generation. The pipeline takes benchmark data and a skill taxonomy, producing diverse terminal interaction trajectories.

The output is Terminal-Corpus, a dataset containing thousands of terminal interaction sequences. Unlike static benchmarks, these trajectories capture the dynamic nature of terminal interaction: the user issues a command, observes output, interprets that output, and adjusts their approach accordingly. This mimics how humans actually use terminals, which is critical because models trained on static problem-solution pairs often fail to handle unexpected outputs or errors.

Curating signal from noise

Not all synthetic data improves model performance. Some generated tasks might be trivially easy, offering no learning signal. Others might be internally inconsistent, teaching the model to hallucinate plausible-sounding but incorrect commands. Still others might be so convoluted that they confuse rather than clarify patterns.

The paper systematically studies filtering strategies to distinguish high-signal examples from low-signal ones. The analysis reveals which filtering criteria actually correlate with downstream performance on Terminal-Bench 2.0. This matters because naive scaling, where you simply generate enormous amounts of data and train on all of it, typically underperforms careful curation.

Some trajectories might be rejected because they contain errors in their reasoning or incorrect command sequences. Others might be excluded because they're too similar to existing examples, offering little diversity. The filtering process is not arbitrary; it's grounded in empirical analysis of what data actually improves model performance.

This represents a fundamental insight about data engineering: curation is as important as generation. A smaller dataset of high-quality examples outperforms a larger dataset with noise. The specific filtering strategies used here would be context-dependent, but the principle is universal.

Structuring the learning process

Once you have filtered, high-quality data, the question of how to present it during training becomes crucial. Not all orderings are equally effective.

Curriculum learning applies a simple principle: harder material is easier to learn when preceded by foundational material. A model learning terminal tasks benefits from first encountering simple interactions, then gradually progressing to more complex ones. This scaffolding makes learning more efficient than random sampling.

For terminal tasks, natural curriculum structures emerge. Basic navigation (changing directories, listing files) can serve as a foundation. File operations (copying, moving, deleting) build on that foundation. Multi-step reasoning tasks that require chaining commands together come later. Understanding command output and error recovery grow more sophisticated across the curriculum.

The paper studies how these curriculum principles apply to terminal agent training. Strategic ordering of examples during training improves both convergence speed and final performance compared to random shuffling. This is particularly important because terminal tasks have inherent sequential dependencies. You can't reasonably ask a model to debug a complex pipeline if it hasn't yet learned basic piping syntax.

Understanding scaling behavior

Data engineers face a practical reality: training compute is limited. Generating more data costs compute to train on. At some point, marginal improvements from additional data diminish, and that compute would be better spent elsewhere.

The paper includes scaling experiments that reveal how performance improves as training data volume increases. These curves answer a crucial question: have we hit a plateau, or would additional data continue helping?

Impact of training data scale on model performance. Terminal-Bench 2.0 performance increases consistently with training data volume for both Qwen3-8B and Qwen3-14B.

The results show clear improvement patterns for both model sizes. Performance grows consistently with more data, though the growth rate eventually slows. The curves suggest that the models tested haven't yet hit a hard ceiling, but marginal returns are diminishing.

Understanding the composition of these trajectories helps explain the scaling behavior. The token distribution shows what length trajectories look like, while the turn distribution reveals how many interaction steps typical tasks involve.

Distribution of tokens in generated trajectories. This shows the length characteristics of synthetic terminal tasks.

Distribution of turns in generated trajectories. This reveals how many interaction steps are typical.

These statistics matter because they determine training requirements. If typical trajectories require thousands of tokens, then a dataset of several million trajectories becomes gigabytes of data. Understanding these distributions helps practitioners plan data generation, training infrastructure, and budget allocation.

The proof of concept

All of this methodology yields concrete results. An 8B model trained on Terminal-Corpus reaches 13.0% accuracy on Terminal-Bench 2.0, jumping from a baseline of 2.5%. The 14B model reaches 20.2% (from 4.0%), and the 32B model reaches 27.4% (from 3.4%). Scaling the baseline models without better data produces marginal improvements. Scaling the data engineering produces orders of magnitude improvement.

Most strikingly, the 8B model trained on Terminal-Corpus now matches or exceeds the performance of much larger models trained on standard data. This comparison shifts the entire conversation around terminal agents. You don't need a 70B parameter model to build a capable agent. You need thoughtful data engineering.

Data engineering as a fundamental lever

This work reveals something important about AI capabilities that the industry often overlooks. Sometimes the bottleneck isn't compute, it isn't model architecture, and it isn't algorithmic innovation. It's training data engineering.

For tasks where models need to execute, perceive feedback, and adapt, the quality and structure of training data becomes paramount. A model trained on synthetic trajectories that systematically cover the skill space, filtered for signal, and presented in a curriculum that respects task dependencies outperforms larger models trained haphazardly.

This has practical implications. Unlike model architecture research or compute scaling, data engineering is accessible. It doesn't require the largest clusters or the most specialized hardware. It requires systematic thinking about what signals teach capability, how to generate diverse examples, what examples to exclude, and how to present examples during training.

The open-sourcing of Nemotron-Terminal models and Terminal-Corpus accelerates this direction. Future work can build on this foundation, improving the pipeline further. The bottleneck moves from "how do we build capable terminal agents" to "how do we engineer training data even more effectively."

The broader lesson applies beyond terminal agents. Any task where models must execute actions, perceive outcomes, and adjust strategy benefits from this kind of data engineering thinking. As AI systems move from pure language understanding toward embodied AI, systematic approaches to training data quality become not an optimization, but a fundamental requirement.

This is a Plain English Papers summary of a research paper called On Data Engineering for Scaling LLM Terminal Capabilities. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.