The signal nobody's using.

Every time you correct an AI assistant, ask it to try again, or describe what went wrong, you're generating training data that today's systems completely ignore. When a user tells a chatbot "that's not what I meant," or a terminal returns an error, or a GUI element appears in an unexpected place, rich feedback flows back to the system. Yet, almost no existing reinforcement learning framework for agents captures this signal for learning.


The status quo treats each agent domain separately. A conversational AI gets one training pipeline. A robot gets another. A code-writing agent gets yet another. But they all face the identical core challenge: how to learn from interactions in real time, without stopping to run expensive offline training jobs. The friction is real. Organizations deploying multiple agent types must build a separate training infrastructure for each one. Personal assistants that serve millions of users generate constant feedback that goes unused. Production agents become stale between training cycles.


OpenClaw-RL begins with a simple observation: this wastefulness is unnecessary. Next-state signals, the replies and errors, and state changes that follow each action, are universal. They exist in every agent domain. What if you could extract learning from all of them using the same machinery?

Two kinds of feedback are hidden in replies.

When a user corrects an AI assistant, the correction contains two layers of information that work differently for learning. The surface layer is evaluative: the response was wrong. But underneath is something richer, directional guidance about what should have happened instead and why.


Imagine a coach reviewing game footage. A poor coach says, "That was a bad play." A good coach says, "You should have cut left three steps earlier because the defender overplayed right." The second kind of feedback is more useful because it points toward specific improvements, not just toward a judgment.


The framework extracts both layers simultaneously from the same source material. Evaluative signals indicate how well an action is performed. These get converted into scalar rewards through a Process Reward Model (PRM) judge that scores interactions after the fact. Directive signals indicate how the action should have been different. These get recovered through Hindsight-Guided On-Policy Distillation, a technique that constructs an enhanced teacher context from textual hints in the next state and provides token-level directional advantage supervision.


The elegance here is that scalar rewards and directional signals aren't competing alternatives. They're complementary channels of information flowing from the same interaction. When a user says, "That's wrong, I needed you to use the other approach," the system extracts both the negative reward signal and the textual hint about the right approach. The training pipeline can then use both simultaneously, making the learning richer than either alone.

How to extract learning from what already exists.

The system architecture splits into three parallel tracks that operate independently, with zero coordination overhead between them. This matters because it means the framework works in production, not just in lab conditions.


The agent itself stays online and responsive to live requests. A personal assistant answers questions without latency penalties. A terminal agent executes commands immediately. A GUI agent clicks buttons in real time. There's no slowdown because nothing waits for training to complete.


Meanwhile, a PRM judge processes interactions asynchronously, extracting scalar rewards. Think of it as a critic reviewing a recorded performance, scoring how well actions worked in context. At the same time, a trainer consumes both the scalar reward signal and textual hints from next states, running On-Policy Distillation to create token-level advantage supervision.


OpenClaw-RL infrastructure showing interaction streams from personal and general agents flowing through asynchronous evaluation and training pipelines.


OpenClaw-RL infrastructure. Interaction streams come from personal agents (conversational, single-user, hosted on personal devices) and general agents (terminal, GUI, SWE, tool-call, hosted centrally). Both feed into asynchronous evaluation and training without requiring synchronization.


This asynchronous pipeline is crucial to practical deployment. Traditional reinforcement learning systems require tight orchestration: collect data, stop the agent, train, deploy new weights, repeat. OpenClaw-RL inverts this. The agent generates interactions continuously. Feedback gets processed in the background. Training happens at its own pace. The system never stops serving users.


For personal agents, feedback emerges naturally from user behavior. A re-query signals dissatisfaction. An explicit correction is direct training data. Acceptance and use of a response is a positive signal. The agent improves passively, recovering these signals from the interaction stream without any special annotation effort.


For general agents in more structured domains (terminal, GUI, SWE, tool-calling), the framework additionally leverages process rewards that judge intermediate steps, not just final outcomes. This matters when the final state is ambiguous or when you need to guide step-by-step behavior toward better trajectories.

One Framework, Many Agents

The deepest insight is that the same learning infrastructure works across completely different agent types. A personal conversational assistant, a terminal command executor, a GUI automaton, a software engineer agent, and a tool-calling system aren't separate training problems anymore. They're all agents learning from next-state signals.


This isn't obvious. Terminal interactions look completely different from conversational exchanges. GUI states look nothing like tool outputs. The natural assumption would be that each domain needs its own specialized pipeline. But because the feedback structure is universal, the same framework handles all of them.


OpenClaw-RL framework supporting scalable RL for general agents across terminal, GUI, SWE, and tool-call settings

OpenClaw-RL supports scalable RL across terminal, GUI, SWE, and tool-call settings. The same infrastructure enables learning in radically different agent domains.


The practical benefit compounds at scale. Organizations deploying multiple agent types use the same codebase, the same evaluation pipeline, and the same training machinery. Adding a new agent domain doesn't mean building new infrastructure; it means pointing the existing pipeline at new interaction streams.


Work on SpeakRL showed that reasoning and action can be tightly integrated in language models. OpenClaw-RL extends this insight by showing that a single learning framework can simultaneously optimize for multiple agent behaviors without compartmentalization.

Agents That Improve Themselves

The most elegant consequence of this design is that improvement happens passively. An agent improves simply by existing and being used. There's no separate annotation phase. No manual labeling of trajectories. No orchestrated training campaigns.


For personal agents, the user experience is frictionless. People interact naturally, providing corrections and explicit feedback without being asked to fill surveys or rate responses. The system learns from these natural interaction patterns continuously.


Simulation showing agent performance increasing with usage over time as next-state signals accumulate.


Optimize your agent simply by using it. This simulation demonstrates how agent performance increases as usage accumulates, because next-state signals generate training data passively.


For general agents deployed at scale, the framework enables continuous learning from production interactions without manual intervention. Each terminal command that fails generates feedback. Each GUI state that surprises the user contains a corrective signal. Each SWE task that requires iteration produces training material.


The mechanism is straightforward: interactions flow in, the PRM judge extracts evaluations, the trainer processes both rewards and directional hints, and the policy converges toward better behavior over time. This isn't magic, it's the natural consequence of treating every interaction as a teaching opportunity.


Related work on learning to simulate human dialogue explored how dialogue systems can improve through interaction. OpenClaw-RL generalizes this pattern across all agent domains.


There's no retraining phase. The agent doesn't go offline to learn. It learns in the background while serving live requests. That makes deployment radically simpler than traditional RL workflows. You don't need to schedule training windows or manage version control between online and offline models.


The framework is open-sourced, so practitioners can adopt it immediately in production systems. This isn't theoretical infrastructure; it's ready to integrate into deployed agents.

The Unified Picture

OpenClaw-RL collapses a source of complexity in AI systems. For years, different agent types required different learning setups. Conversational agents, robotic agents, coding agents, and UI automation systems all solved the same learning problem in isolation, with isolated infrastructure.


The insight that next-state signals are universal and sufficient for learning unlocks a simpler architecture. Every interaction is already a teacher trying to help the agent improve. You don't need to invent new annotation schemes or manufacture synthetic data. You extract both evaluative and directional signals from feedback that already exists.


Work on natural language actor-critic explored using language models as critics in RL. OpenClaw-RL takes this further by showing that language itself, as it appears in natural feedback, contains the training signal needed.


The result is simpler than today's alternatives, faster than offline training workflows, and more practical because it learns from signals inherent in any agent interaction. An organization deploying personal assistants, terminal agents, GUI automators, and coding systems uses one framework instead of four. Each agent improves the more it's used. Training happens asynchronously in the background. Deployment complexity drops.


That's the power of recognizing a universal pattern hiding in plain sight.


This is a Plain English Papers summary of a research paper called OpenClaw-RL: Train Any Agent Simply by Talking. If you like these kinds of analyses, join AIModels.fyi or follow us on Twitter.