From LLM to Agent: How Memory + Planning Turn a Chatbot Into a Doer

The Day Your LLM Stops Talking and Starts Doing

There’s a moment in every LLM project where you realize the “chat” part is the easy bit.

The hard part is everything that happens between the user request and the final output:

gathering missing facts,
choosing which tools to call (and in what order),
handling failures,
remembering prior decisions,
and not spiraling into confident nonsense when the world refuses to match the model’s assumptions.

That’s the moment you’re no longer building “an LLM app.”

You’re building an agent.

In software terms, an agent is not a magical model upgrade. It’s a system design pattern:

Agent = LLM + tools + a loop + state

Once you see it this way, “memory” and “planning” stop being buzzwords and become engineering decisions you can reason about, test, and improve.

Let’s break down how it works.

1) What Is an LLM Agent, Actually?

A classic LLM app looks like this:

user_input -> prompt -> model -> answer

An agent adds a control loop:

user_input
  -> (state) -> model -> action
  -> tool/environment -> observation
  -> (state update) -> model -> action
  -> ... repeat ...
  -> final answer

The difference is subtle but massive:

A chatbot generates text.
An agent executes a policy over time.

The model is the policy engine; the loop is the runtime.

This means agents are fundamentally about systems: orchestration, state, observability, guardrails, and evaluation.

2) Memory: The Two Buckets You Can’t Avoid

Human-like “memory” in agents usually becomes two concrete buckets:

2.1 Short-Term Memory (Working Memory)

Short-term memory is whatever you stuff into the model’s current context:

the current conversation (or the relevant slice of it),
tool results you just fetched,
intermediate notes (“scratchpad”),
temporary constraints (deadlines, budgets, requirements).

Engineering reality check: short-term memory is limited by your context window and by model behavior.

Two classic failure modes show up in production:

Context trimming: you cut earlier messages to save tokens → the agent “forgets” key constraints.
Recency bias: even with long contexts, models over-weight what’s near the end → old-but-important details get ignored.

If you’ve ever watched an agent re-ask for information it already has, you’ve seen both.

2.2 Long-Term Memory (Persistent Memory)

Long-term memory is stored outside the model:

vector DB embeddings,
document stores,
user profiles/preferences (only if you can do it safely and legally),
task history and decisions,
structured records (tickets, orders, logs, CRM entries).

The mainstream pattern is: retrieve → inject → reason.

If that sounds like RAG (Retrieval-Augmented Generation), that’s because it is. Agents just make RAG operational: retrieval isn’t only for answering questions—it’s for deciding what to do next.

The part people miss: memory needs structure

A pile of vector chunks is not “memory.” It’s a landfill.

Practical long-term memory works best when you store:

semantic content (embedding),
metadata (timestamp, source, permissions, owner, reliability score),
a policy for when to write/read (what gets saved, what gets ignored),
decay or TTL for things that stop mattering.

If you don’t design write/read policies, you’ll build an agent that remembers the wrong things forever.

3) Planning: From Decomposition to Search

Planning sounds philosophical, but it maps to one question:

How does the agent choose the next action?

In real tasks, “next action” is rarely obvious. That’s why we plan: to reduce a big problem into smaller moves with checkpoints.

3.1 Task Decomposition: Why It’s Not Optional

When you ask an agent to “plan,” you’re buying:

controllability: you can inspect steps and constraints,
debuggability: you can see where it went wrong,
tool alignment: each step can map to a tool call,
lower hallucination risk: fewer leaps, more verification.

But planning can be cheap or expensive depending on the technique.

3.2 CoT: Linear Reasoning as a Control Interface

Chain-of-Thought style prompting nudges the model to produce intermediate reasoning before the final output.

From an engineering perspective, the key benefit is not “the model becomes smarter.” It’s that the model becomes more steerable:

it externalizes intermediate state,
it decomposes implicitly into substeps,
and you can gate or validate those steps.

CoT ≠ show-the-user-everything

In production, you often want the opposite: use structured reasoning internally, then output a crisp answer.

This is both a UX decision (nobody wants a wall of text) and a safety decision (you don’t want to leak internal deliberations, secrets, or tool inputs).

3.3 ToT: When Reasoning Becomes Search

Linear reasoning fails when:

there are multiple plausible paths,
early choices are hard to reverse,
you need lookahead (trade-offs, planning, puzzles, strategy).

Tree-of-Thought style reasoning turns “thinking” into search:

expand: propose multiple candidate thoughts/steps,
evaluate: score candidates (by heuristics, constraints, or another model call),
select: continue exploring the best branches,
optionally backtrack if a branch collapses.

If CoT is “one good route,” ToT is “try a few routes, keep the ones that look promising.”

The cost: token burn

Search is expensive. If you expand branches without discipline, cost grows fast.

So ToT tends to shine in:

high-value tasks,
problems with clear evaluation signals,
situations where being wrong is more expensive than being slow.

3.4 GoT: The Engineering Upgrade (Reuse, Merge, Backtrack)

Tree search wastes work when branches overlap.

Graph-of-Thoughts takes a practical step:

treat intermediate reasoning as states in a directed graph,
allow merging equivalent states (reuse),
support backtracking to arbitrary nodes,
apply pruning more aggressively.

If ToT is a tree, GoT is a graph with memory: you don’t re-derive what you already know.

This matters in production where repeated tool calls and repeated reasoning are the real cost drivers.

3.5 XoT: “Everything of Thoughts” as Research Direction

XoT-style approaches try to unify thought paradigms and inject external knowledge and search methods (think: MCTS-style exploration + domain guidance).

It’s promising, but the engineering bar is high:

you need reliable evaluation,
tight budgets,
and a clear “why” for using something heavier than a well-designed plan loop.

In practice, many teams implement a lightweight ToT/GoT hybrid without the full research stack.

4) ReAct: The Loop That Makes Agents Feel Real

Planning is what the agent intends to do.

ReAct is what the agent actually does:

Reason about what’s missing / what to do next
Act by calling a tool
Observe the result
Reflect and adjust

Repeat until done.

This solves three real problems:

incomplete information: the agent can fetch what it doesn’t know,
verification: it can check assumptions against reality,
error recovery: it can reroute after failures.

If you’ve ever debugged a hallucination, you already know why this matters: a believable explanation isn’t the same thing as a correct answer.

5) A Minimal Agent With Memory + Planning (Practical Version)

Below is a deliberately “boring” agent loop. That’s the point.

Most production agents are not sci-fi. They’re well-instrumented control loops with strict budgets.

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
import time

# --- Tools (stubs) ---------------------------------------------------------

def web_search(query: str) -> str:
    # Replace with your search API call + caching.
    return f"[search-results for: {query}]"

def calc(expression: str) -> str:
    # Replace with a safe evaluator.
    return str(eval(expression, {"__builtins__": {}}, {}))

# --- Memory ----------------------------------------------------------------

@dataclass
class MemoryItem:
    text: str
    ts: float = field(default_factory=lambda: time.time())
    meta: Dict[str, Any] = field(default_factory=dict)

@dataclass
class MemoryStore:
    short_term: List[MemoryItem] = field(default_factory=list)
    long_term: List[MemoryItem] = field(default_factory=list)  # stand-in for vector DB

    def remember_short(self, text: str, **meta):
        self.short_term.append(MemoryItem(text=text, meta=meta))

    def remember_long(self, text: str, **meta):
        self.long_term.append(MemoryItem(text=text, meta=meta))

    def retrieve_long(self, hint: str, k: int = 3) -> List[MemoryItem]:
        # Dummy retrieval: filter by substring.
        hits = [m for m in self.long_term if hint.lower() in m.text.lower()]
        return sorted(hits, key=lambda m: m.ts, reverse=True)[:k]

# --- Planner (very small ToT-ish idea) ------------------------------------

def propose_plans(task: str) -> List[str]:
    # In reality: this is an LLM call producing multiple plan candidates.
    return [
        f"Search key facts about: {task}",
        f"Break task into steps, then execute step-by-step: {task}",
        f"Ask a clarifying question if constraints are missing: {task}",
    ]

def score_plan(plan: str) -> int:
    # Heuristic scoring: prefer plans that verify facts.
    if "Search" in plan:
        return 3
    if "Break task" in plan:
        return 2
    return 1

# --- Agent Loop ------------------------------------------------------------

def run_agent(task: str, memory: MemoryStore, max_steps: int = 6) -> str:
    # 1) Retrieve long-term memory if relevant.
    recalled = memory.retrieve_long(hint=task)
    for item in recalled:
        memory.remember_short(f"Recalled: {item.text}", source="long_term")

    # 2) Plan (cheap multi-candidate selection).
    plans = propose_plans(task)
    plan = max(plans, key=score_plan)
    memory.remember_short(f"Chosen plan: {plan}")

    # 3) Execute loop.
    for step in range(max_steps):
        # In reality: this is an LLM call that decides "next tool" based on state.
        if "Search" in plan and step == 0:
            obs = web_search(task)
            memory.remember_short(f"Observation: {obs}", tool="web_search")
            continue

        # Example: do a small computation if the task contains a calc hint.
        if "calculate" in task.lower() and step == 1:
            obs = calc("6 * 7")
            memory.remember_short(f"Observation: {obs}", tool="calc")
            continue

        # Stop condition (simplified).
        if step >= 2:
            break

    # 4) Final answer: summarise short-term state.
    notes = "\n".join([f"- {m.text}" for m in memory.short_term[-8:]])
    return f"Task: {task}\n\nWhat I did:\n{notes}\n\nFinal: (produce a user-facing answer here.)"

# Demo usage:
mem = MemoryStore()
mem.remember_long("User prefers concise outputs with clear bullets.", tag="preference")
print(run_agent("Write a short guide on LLM agents with memory and planning", mem))

What this toy example demonstrates (and why it matters)

Memory is state, not vibe. It’s read/write with policy.
Planning can be multi-candidate without going full ToT. Generate a few, pick one, move on.
Tool calls are first-class. Observations update state, not just the transcript.
Budgets exist. max_steps is a real safety and cost control.

6) Production Notes: Where Agents Actually Fail

If you want this to work outside demos, you’ll spend most of your time on these five areas.

6.1 Tool reliability beats prompt cleverness

Tools fail. Time out. Rate limit. Return weird formats.

Your agent loop needs:

retries with backoff,
strict schemas,
parsing + validation,
and fallback strategies.

A “smart” agent without robust I/O is just a creative writer with API keys.

6.2 Memory needs permissions and hygiene

If you store user data, you need:

clear consent and retention rules,
permission checks at retrieval time,
deletion pathways,
and safe defaults.

In regulated environments, long-term memory is often the highest-risk component.

6.3 Planning needs evaluation signals

Search-based planning is only as good as its scoring.

You’ll likely need:

constraint checkers,
unit tests for tool outputs,
or a separate “critic” model call that can reject bad steps.

6.4 Observability is not optional

If you can’t trace:

which tool was called,
with what inputs,
what it returned,
and how it changed the plan,

you can’t debug. You also can’t measure improvements.

Log everything. Then decide what to retain.

6.5 Security: agents amplify blast radius

When a model can take actions, mistakes become incidents.

Guardrails look like:

allowlists (tools, domains, actions),
spend limits,
step limits,
sandboxing,
and human-in-the-loop gates for high-impact actions.

7) The Real “Agent Upgrade”: A Better Mental Model

If you remember one thing, make it this:

An agent is an LLM inside a state machine.

Memory = state
Planning = policy shaping
Tools = actuators
Observations = state transitions
Reflection = error-correcting feedback

Once you build agents this way, you stop chasing “the perfect prompt” and start shipping systems that can survive reality.

And reality is the only benchmark that matters.