Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant document snippets to improve responses. With rlama, you can build a fully local, offline RAG system—no cloud services, no dependencies, and complete data privacy. While rlama supports both large and small LLMs, it is especially optimized for smaller models without sacrificing flexibility for larger ones.

Introduction to RAG and rlama

In RAG, a knowledge store is queried to retrieve pertinent documents added to the LLM prompt. This helps ground the model’s output with factual, up-to-date data. Traditional RAG setups require multiple components (document loaders, text splitters, vector databases, etc.), but rlama streamlines the entire process with a single CLI tool.

It handles:

This local-first approach ensures privacy, speed, and ease of management.

Step-by-Step Guide to Implementing RAG with rlama

1. Installation

Ensure you have Ollama installed. Then, run:

curl -fsSL https://raw.githubusercontent.com/dontizi/rlama/main/install.sh | sh

Verify the installation:

rlama --version

2. Creating a RAG System

Index your documents by creating a RAG store (hybrid vector store):

rlama rag <model> <rag-name> <folder-path>

For example, using a model like deepseek-r1:8b:

rlama rag deepseek-r1:8b mydocs ./docs

This command:

3. Managing Documents

Keep your index updated:

4. Configuring Chunking and Retrieval

5. Running Queries

Launch an interactive session:

rlama run mydocs --context-size=20

In the session, type your question:

> How do I install the project?

rlama:

  1. Converts your question into an embedding.
  2. Retrieves the top matching chunks from the hybrid store.
  3. Uses the local LLM (via Ollama) to generate an answer using the retrieved context.

You can exit the session by typing exit.

6. Using the rlama API

Start the API server for programmatic access:

rlama api --port 11249

Send HTTP queries:

curl -X POST http://localhost:11249/rag \
  -H "Content-Type: application/json" \
  -d '{
        "rag_name": "mydocs",
        "prompt": "How do I install the project?",
        "context_size": 20
      }'

The API returns a JSON response with the generated answer and diagnostic details.

Recent Enhancements and Tests

EnhancedHybridStore

Document Struct Update

RagSystem Upgrade

Router Retrieval Testing

I compared the new version with v0.1.25 using deepseek-r1:8b with the prompt:

"list me all the routers in the code"(as simple and general as possible to verify accurate retrieval)

Optimizations and Performance Tuning

Retrieval Speed:

Retrieval Accuracy:

Local Performance:

Next Steps

Conclusion

rlama simplifies building local RAG systems with a focus on confidentiality, performance, and ease of use. Whether you’re using a small LLM for quick responses or a larger one for in-depth analysis, rlama offers a powerful, flexible solution. With its enhanced hybrid store, improved document metadata, and upgraded RagSystem, it’s now even better at retrieving and presenting accurate answers from your data. Happy indexing and querying!