The Day Your LLM Stops Talking and Starts Doing

There’s a moment in every LLM project where you realize the “chat” part is the easy bit.

The hard part is everything that happens between the user request and the final output:

That’s the moment you’re no longer building “an LLM app.”

You’re building an agent.

In software terms, an agent is not a magical model upgrade. It’s a system design pattern:

Agent = LLM + tools + a loop + state

Once you see it this way, “memory” and “planning” stop being buzzwords and become engineering decisions you can reason about, test, and improve.

Let’s break down how it works.

1) What Is an LLM Agent, Actually?

A classic LLM app looks like this:

user_input -> prompt -> model -> answer

An agent adds a control loop:

user_input
  -> (state) -> model -> action
  -> tool/environment -> observation
  -> (state update) -> model -> action
  -> ... repeat ...
  -> final answer

The difference is subtle but massive:

The model is the policy engine; the loop is the runtime.

This means agents are fundamentally about systems: orchestration, state, observability, guardrails, and evaluation.

2) Memory: The Two Buckets You Can’t Avoid

Human-like “memory” in agents usually becomes two concrete buckets:

2.1 Short-Term Memory (Working Memory)

Short-term memory is whatever you stuff into the model’s current context:

Engineering reality check: short-term memory is limited by your context window and by model behavior.

Two classic failure modes show up in production:

  1. Context trimming: you cut earlier messages to save tokens → the agent “forgets” key constraints.
  2. Recency bias: even with long contexts, models over-weight what’s near the end → old-but-important details get ignored.

If you’ve ever watched an agent re-ask for information it already has, you’ve seen both.

2.2 Long-Term Memory (Persistent Memory)

Long-term memory is stored outside the model:

The mainstream pattern is: retrieve → inject → reason.

If that sounds like RAG (Retrieval-Augmented Generation), that’s because it is. Agents just make RAG operational: retrieval isn’t only for answering questions—it’s for deciding what to do next.

The part people miss: memory needs structure

A pile of vector chunks is not “memory.” It’s a landfill.

Practical long-term memory works best when you store:

If you don’t design write/read policies, you’ll build an agent that remembers the wrong things forever.

Planning sounds philosophical, but it maps to one question:

How does the agent choose the next action?

In real tasks, “next action” is rarely obvious. That’s why we plan: to reduce a big problem into smaller moves with checkpoints.

3.1 Task Decomposition: Why It’s Not Optional

When you ask an agent to “plan,” you’re buying:

But planning can be cheap or expensive depending on the technique.

3.2 CoT: Linear Reasoning as a Control Interface

Chain-of-Thought style prompting nudges the model to produce intermediate reasoning before the final output.

From an engineering perspective, the key benefit is not “the model becomes smarter.” It’s that the model becomes more steerable:

CoT ≠ show-the-user-everything

In production, you often want the opposite: use structured reasoning internally, then output a crisp answer.

This is both a UX decision (nobody wants a wall of text) and a safety decision (you don’t want to leak internal deliberations, secrets, or tool inputs).

Linear reasoning fails when:

Tree-of-Thought style reasoning turns “thinking” into search:

If CoT is “one good route,” ToT is “try a few routes, keep the ones that look promising.”

The cost: token burn

Search is expensive. If you expand branches without discipline, cost grows fast.

So ToT tends to shine in:

3.4 GoT: The Engineering Upgrade (Reuse, Merge, Backtrack)

Tree search wastes work when branches overlap.

Graph-of-Thoughts takes a practical step:

If ToT is a tree, GoT is a graph with memory: you don’t re-derive what you already know.

This matters in production where repeated tool calls and repeated reasoning are the real cost drivers.

3.5 XoT: “Everything of Thoughts” as Research Direction

XoT-style approaches try to unify thought paradigms and inject external knowledge and search methods (think: MCTS-style exploration + domain guidance).

It’s promising, but the engineering bar is high:

In practice, many teams implement a lightweight ToT/GoT hybrid without the full research stack.

4) ReAct: The Loop That Makes Agents Feel Real

Planning is what the agent intends to do.

ReAct is what the agent actually does:

  1. Reason about what’s missing / what to do next
  2. Act by calling a tool
  3. Observe the result
  4. Reflect and adjust

Repeat until done.

This solves three real problems:

If you’ve ever debugged a hallucination, you already know why this matters: a believable explanation isn’t the same thing as a correct answer.

5) A Minimal Agent With Memory + Planning (Practical Version)

Below is a deliberately “boring” agent loop. That’s the point.

Most production agents are not sci-fi. They’re well-instrumented control loops with strict budgets.

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
import time
​
# --- Tools (stubs) ---------------------------------------------------------
​
def web_search(query: str) -> str:
    # Replace with your search API call + caching.
    return f"[search-results for: {query}]"
​
def calc(expression: str) -> str:
    # Replace with a safe evaluator.
    return str(eval(expression, {"__builtins__": {}}, {}))
​
# --- Memory ----------------------------------------------------------------
​
@dataclass
class MemoryItem:
    text: str
    ts: float = field(default_factory=lambda: time.time())
    meta: Dict[str, Any] = field(default_factory=dict)
​
@dataclass
class MemoryStore:
    short_term: List[MemoryItem] = field(default_factory=list)
    long_term: List[MemoryItem] = field(default_factory=list)  # stand-in for vector DB
​
    def remember_short(self, text: str, **meta):
        self.short_term.append(MemoryItem(text=text, meta=meta))
​
    def remember_long(self, text: str, **meta):
        self.long_term.append(MemoryItem(text=text, meta=meta))
​
    def retrieve_long(self, hint: str, k: int = 3) -> List[MemoryItem]:
        # Dummy retrieval: filter by substring.
        hits = [m for m in self.long_term if hint.lower() in m.text.lower()]
        return sorted(hits, key=lambda m: m.ts, reverse=True)[:k]
​
# --- Planner (very small ToT-ish idea) ------------------------------------
​
def propose_plans(task: str) -> List[str]:
    # In reality: this is an LLM call producing multiple plan candidates.
    return [
        f"Search key facts about: {task}",
        f"Break task into steps, then execute step-by-step: {task}",
        f"Ask a clarifying question if constraints are missing: {task}",
    ]
​
def score_plan(plan: str) -> int:
    # Heuristic scoring: prefer plans that verify facts.
    if "Search" in plan:
        return 3
    if "Break task" in plan:
        return 2
    return 1
​
# --- Agent Loop ------------------------------------------------------------
​
def run_agent(task: str, memory: MemoryStore, max_steps: int = 6) -> str:
    # 1) Retrieve long-term memory if relevant.
    recalled = memory.retrieve_long(hint=task)
    for item in recalled:
        memory.remember_short(f"Recalled: {item.text}", source="long_term")
​
    # 2) Plan (cheap multi-candidate selection).
    plans = propose_plans(task)
    plan = max(plans, key=score_plan)
    memory.remember_short(f"Chosen plan: {plan}")
​
    # 3) Execute loop.
    for step in range(max_steps):
        # In reality: this is an LLM call that decides "next tool" based on state.
        if "Search" in plan and step == 0:
            obs = web_search(task)
            memory.remember_short(f"Observation: {obs}", tool="web_search")
            continue
​
        # Example: do a small computation if the task contains a calc hint.
        if "calculate" in task.lower() and step == 1:
            obs = calc("6 * 7")
            memory.remember_short(f"Observation: {obs}", tool="calc")
            continue
​
        # Stop condition (simplified).
        if step >= 2:
            break
​
    # 4) Final answer: summarise short-term state.
    notes = "\n".join([f"- {m.text}" for m in memory.short_term[-8:]])
    return f"Task: {task}\n\nWhat I did:\n{notes}\n\nFinal: (produce a user-facing answer here.)"
​
# Demo usage:
mem = MemoryStore()
mem.remember_long("User prefers concise outputs with clear bullets.", tag="preference")
print(run_agent("Write a short guide on LLM agents with memory and planning", mem))

What this toy example demonstrates (and why it matters)

6) Production Notes: Where Agents Actually Fail

If you want this to work outside demos, you’ll spend most of your time on these five areas.

6.1 Tool reliability beats prompt cleverness

Tools fail. Time out. Rate limit. Return weird formats.

Your agent loop needs:

A “smart” agent without robust I/O is just a creative writer with API keys.

6.2 Memory needs permissions and hygiene

If you store user data, you need:

In regulated environments, long-term memory is often the highest-risk component.

6.3 Planning needs evaluation signals

Search-based planning is only as good as its scoring.

You’ll likely need:

6.4 Observability is not optional

If you can’t trace:

you can’t debug. You also can’t measure improvements.

Log everything. Then decide what to retain.

6.5 Security: agents amplify blast radius

When a model can take actions, mistakes become incidents.

Guardrails look like:

7) The Real “Agent Upgrade”: A Better Mental Model

If you remember one thing, make it this:

An agent is an LLM inside a state machine.

Once you build agents this way, you stop chasing “the perfect prompt” and start shipping systems that can survive reality.

And reality is the only benchmark that matters.