sia.hackernoon.com

Keeping your sensitive documents safe while leveraging the power of Large Language Models: A complete guide to building a private, offline RAG system.

As the adoption of Artificial Intelligence(AI) increases across industries, the critical need for control, safety, and privacy has never been more apparent. In this post, I explore how to build a complete Retrieval Augmented Generation(RAG) system that runs entirely offline with Ollama, keeping the full power of Large Language Models(LLMs) while keeping sensitive information exactly where it belongs, that is, on your machine.

Cloud-based AI services come with convenience, but there's a fundamental trade-off: your data has to leave your control. Why does this matter? Consider the reality many professionals face: legal documents requiring confidentiality, medical records protected by privacy laws, proprietary research that cannot be shared, or internal company communications that must stay internal. You put them at risk whenever you upload these documents to online RAG systems. Every API call, every embedding request, and every query sends your sensitive content to servers you can't verify for storage, access rights, or actual deletion.

A RAG system answers questions by retrieving relevant information from your documents and using that context to generate accurate responses. The traditional approach relies on cloud services for three critical operations: converting documents to embeddings, storing those embeddings in vector databases, and generating answers through API calls. Each step exposes your data. This hands-on tutorial demonstrates that none of these compromises is necessary.

What is RAG?

Retrieval-Augmented Generation combines three essential processes. Retrieval finds relevant information from your documents through semantic search. Augmented means using that retrieved information to enhance the AI's knowledge beyond its training data. Generation creates accurate, cited answers based specifically on your documents rather than the model's general knowledge. RAG represents a breakthrough in making large language models more accurate and trustworthy. Unlike standard chatbots that rely solely on training data, RAG systems retrieve relevant information from your specific documents before generating answers. This ensures responses are grounded in your actual content rather than the model's general knowledge.

Most implementations depend on online services for document embedding, model inference, and vector storage. This introduces privacy risks, dependency on external APIs, and higher costs. This post eliminates all external dependencies. Using Ollama for both embeddings and language generation, combined with FAISS for vector storage, this system ensures your documents never leave your computer. After the initial setup requiring internet to download models, the entire system operates offline. No API keys, no usage costs, no privacy concerns, and no internet required. You maintain complete control while preserving the full capability of modern LLMs.

1.1 How does it work?

Traditional LLMs answer from memory based on training data, while RAG systems retrieve information from documents first, then generate answers based on that retrieved content. The process has two phases: first, documents are loaded, split into chunks, converted into numerical vectors (embeddings), and stored in a database; second, when you ask a question, it gets converted into a vector, the system finds chunks with similar vectors, and the LLM uses those chunks to generate an answer with citations.

Offline Approach

The system uses five components that run entirely on local machine. The Document Loader extracts text from files while preserving page numbers for citations. The Text Chunker splits documents into segments to maintain context. The Embedder uses Ollama with nomic-embed-text to convert text into vectors that capture semantic meaning. The Vector Database uses FAISS with cosine similarity to store vectors and find semantically similar chunks in milliseconds. Finally, the LLM uses Llama 3.2 to read retrieved chunks and generate factual answers.

1.2 Prerequisites

System Requirements:

Python 3.8 or higher

8GB RAM minimum (16GB recommended)

10GB free disk space (for models)

Windows, macOS, or Linux

Ollama Requirements:

Download package size: 1GB

Windows: Windows 10 or later

macOS: macOS 14 Sonoma or later

Installation

Step 1: Install Ollama

Download from https://ollama.com/download

Or for Linux, run:

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

Step 2: Download Models

I will be using Llama 3.2 for this tutorial because the model size is 2GB, making it suitable for basic local machine use. Other models can also be used.

Download the LLM (takes approximately 4 minutes, 2GB):

ollama pull llama3.2

Download the embedding model (takes about 2 minutes, 274MB):

ollama pull nomic-embed-text

Step 3: Prepare Documents

Create a `documents` folder and add your PDF, Markdown, or HTML files.

For this tutorial, I will be using FLoRA: Fused forward-backward adapters for parameter-efficient fine-tuning and reducing inference-time latencies of LLMs. This is a 10-page paper that explores how FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs) by proposing a family of fused forward-backward adapters (FFBA). The PDF file can be downloaded from https://arxiv.org/pdf/2511.00050

Step 4: Install Python Packages

Create a requirements.txt file with the following packages:

faiss-cpu 
numpy 
PyPDF2 
beautifulsoup4 
markdown

Why these packages? `faiss-cpu` is a fast vector search library that works offline, `numpy` is the industry standard for arrays, `PyPDF2` for PDF text extraction, `beautifulsoup4` for HTML parser and `markdown` for Markdown to text conversion.

Install all packages:

pip install -r requirements.txt

Once the prerequisites have been completed, internet connectivity is no longer required. The system operates entirely offline and locally on your machine. All components run without any external network calls or cloud services.

2.0 Implementation

This section provides the complete step-by-step implementation guide to build the offline RAG system from scratch. The implementation covers all essential components including configuring Ollama with the necessary models and the complete code walkthrough for each component. Each code section includes detailed explanations of design decisions, parameter choices, and the reasoning behind specific implementations that emerged from testing with the FLoRA research paper. The guide is structured to be followed sequentially, ensuring that by the end, you will have a fully functional RAG system capable of answering questions from your documents with accurate citations, all running completely offline on your local machine.

2.1 Import Libraries

import os
import json
import subprocess
import re
from pathlib import Path
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
#for document processing
import PyPDF2
from bs4 import BeautifulSoup
import markdown
#for vector operations
import numpy as np
import faiss
print("all libraries successfully imported")

2.2 Dataclass setup

The Chunk dataclass is a structured container that holds everything needed to represent a piece of document in the RAG system. Each chunk consists of four essential components: an id for unique identification (like "flora.pdf_0" for the first chunk), the actual text content extracted from the document, a vector that stores the 768-dimensional numerical embedding (initially None until processed by the embedder), and metadata containing source information such as filename, page number, and chunk index for accurate citations.

@dataclass
class Chunk:
    """Text chunk with metadata and embedding."""
    id: str
    text: str
    vector: Optional[np.ndarray]
    metadata: Dict

2.3 Document loading

The document class was implemented for page-level tracking, i.e to enable precise citations. The DocumentLoader class handles reading multiple file formats and extracting text while preserving source information for accurate citations. It uses static methods because document loading does not require instance state, making the class a clean organizational tool for related functions. The class supports three formats: PDF files are processed page by page using PyPDF2 to enable precise page-level citations, Markdown files are converted to HTML then to plain text using the markdown and BeautifulSoup libraries to cleanly remove formatting syntax. HTML files have scripts and styles stripped using BeautifulSoup before text extraction. Each loader method returns a list of dictionaries containing the extracted text and metadata (source filename, page number, and document type), with the main load_documents method automatically detecting file types by extension and routing them to the appropriate loader.

class DocumentLoader:
    """load PDF, Markdown, and HTML documents."""
    @staticmethod
    def load_pdf(file_path: str) -> List[Dict]:
        """extract text from PDF,  page by page for citations."""
        chunks = []
        try:
            with open(file_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                for page_num, page in enumerate(pdf_reader.pages):
                    text = page.extract_text()
                    if text.strip():
                        chunks.append({
                            'text': text,
                            'metadata': {
                                'source': os.path.basename(file_path),
                                'page': page_num + 1,
                                'type': 'pdf'
                            }
                        })
        except Exception as e:
            print(f"error loading PDF {file_path}: {e}")
        return chunks
    
    @staticmethod
    def load_markdown(file_path: str) -> List[Dict]:
        """convert markdown to text via HTML."""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                md_content = file.read()
                html = markdown.markdown(md_content)
                soup = BeautifulSoup(html, 'html.parser')
                text = soup.get_text()
                
                return [{
                    'text': text,
                    'metadata': {
                        'source': os.path.basename(file_path),
                        'page': 1,
                        'type': 'markdown'
                    }
                }]
        except Exception as e:
            print(f"error loading markdown {file_path}: {e}")
            return []
    
    @staticmethod
    def load_html(file_path: str) -> List[Dict]:
        """extract text from HTML, removing scripts and styles."""
        try:
            with open(file_path, 'r', encoding='utf-8') as file:
                soup = BeautifulSoup(file.read(), 'html.parser')
                for script in soup(["script", "style"]):
                    script.decompose()
                text = soup.get_text()
                
                return [{
                    'text': text,
                    'metadata': {
                        'source': os.path.basename(file_path),
                        'page': 1,
                        'type': 'html'
                    }
                }]
        except Exception as e:
            print(f"error loading HTML {file_path}: {e}")
            return []
    
    @staticmethod
    def load_documents(directory: str) -> List[Dict]:
        """ load all supported documents from a directory."""
        documents = []
        doc_dir = Path(directory)
        
        if not doc_dir.exists():
            print(f"Creating {directory}...")
            doc_dir.mkdir(parents=True)
            print(f"add documents to {directory} and run again.")
            return documents
        
        for file_path in doc_dir.rglob('*'):
            if file_path.is_file():
                ext = file_path.suffix.lower()
                
                if ext == '.pdf':
                    documents.extend(DocumentLoader.load_pdf(str(file_path)))
                elif ext in ['.md', '.markdown']:
                    documents.extend(DocumentLoader.load_markdown(str(file_path)))
                elif ext in ['.html', '.htm']:
                    documents.extend(DocumentLoader.load_html(str(file_path)))       
        print(f"loaded {len(documents)} document sections")
        return documents
print("document loader ready!")

2.4 Text Chunking

Chunking is the process of splitting text data into smaller segments. It is required because embedding models have token limits and cannot process an entire document at once, which may result in loss of context and failure to capture complete concepts.

Overlapping prevents information loss at boundaries. For example: “…safety protocol. First, wear PPE…”. Without overlap in this instance, “First, wear PPE” loses context. With Overlap, the previous chunk includes “safety protocol”.

Sentence boundary detection breaks text at periods, not mid-sentence, to promote better comprehension by the LLM. For this solution, due to the number of pages of the document that was used for testing, the chunk size was set at 750 and the overlap at 100. 100 characters represent approximately 15-20 words, enough to capture sentence endings and beginnings. This prevents information loss at boundaries while limiting redundancy.

Note that chunk size and overlap can be modified depending on the task at hand. Smaller chunks work for fact-finding and larger chunks are good for broader comprehension.

The `TextChunker` class uses static methods for stateless text processing operations. The `clean_text` method normalizes the input by converting multiple spaces to single spaces and removing special characters while preserving punctuation needed for sentence detection. The `chunk_text` method implements the core algorithm: it starts at position 0 and calculates a tentative end position by adding `chunk_size`, then searches the last 20 percent of the chunk for sentence endings (periods, exclamation points, question marks) using `rfind` to avoid mid-sentence breaks, adjusts the end position to the sentence boundary if found, extracts the chunk text and creates a Chunk object with a unique ID combining source filename and index, then moves the start position forward by `chunk_size` minus overlap to create the next chunk.

The algorithm continues until reaching the end of the text, with each iteration adding chunk metadata, including the index for tracking and the source information for citations. For the FLoRA paper, this produced chunks averaging 6.1 per page, with each chunk maintaining its connection to the original page number for precise citation, like "Source 5, Page 1" in query responses.

class TextChunker:
    """text chunking with overlap and sentence boundaries."""   
    @staticmethod
    def clean_text(text: str) -> str:
        """normalize whitespace and remove special characters."""
        text = re.sub(r'\s+', ' ', text)  #multiple spaces to single space
        text = re.sub(r'[^\w\s\.\,\!\?\-\:\;]', '', text)  #keep punctuation
        return text.strip()
    
    @staticmethod
    def chunk_text(
        text: str,
        chunk_size: int = 750,
        overlap: int = 100,
        metadata: Dict = None
    ) -> List[Chunk]:
        """split text into overlapping chunks at sentence boundaries.      
        Args:
            chunk_size: target size (≈150 tokens for embeddings)
            overlap: overlap size to preserve context
            metadata: source info for citations
        """
        text = TextChunker.clean_text(text)
        chunks = []
        
        if not text:
            return chunks
        
        start = 0
        chunk_index = 0
        
        while start < len(text):
            end = start + chunk_size
            
            #break at sentence boundary(last 20% of chunk)
            if end < len(text):
                search_start = end - int(chunk_size * 0.2)
                sentence_end = max(
                    text.rfind('.', search_start, end),
                    text.rfind('!', search_start, end),
                    text.rfind('?', search_start, end)
                )
                
                if sentence_end != -1 and sentence_end > start:
                    end = sentence_end + 1
            
            chunk_text = text[start:end].strip()
            
            if chunk_text:
                chunk_metadata = metadata.copy() if metadata else {}
                chunk_metadata['chunk_index'] = chunk_index
                chunk_id = f"{chunk_metadata.get('source', 'unknown')}_{chunk_index}"
                
                chunks.append(Chunk(
                    id=chunk_id,
                    text=chunk_text,
                    vector=None,
                    metadata=chunk_metadata
                ))
                
                chunk_index += 1            
            start = end - overlap  #move with overlap          
            if start >= len(text) - overlap:
                break       
        return chunks
print("text chunker ready!")

2.5 Embedding

Embeddings convert text into a list of 768 numbers that capture the meaning of the text, allowing the computer to mathematically compare how similar different pieces of text are. For example, "parameter efficient fine-tuning" and "PEFT methods for LLMs" would produce similar number patterns even though the words are different, because they mean similar things. The system uses `Ollama` with the `nomic-embed-text model`, which runs completely on your computer without needing internet or cloud services, ensuring your documents stay private. The model is lightweight at 274 MB and provides good accuracy for general text, taking about 2-3 minutes to process 61 chunks on a regular CPU. The embedder shows progress every 10 chunks so you can see it is working during the initial setup.

Code:

class OllamaEmbedder:
    """Generate embeddings using Ollama's embedding model."""  
    def __init__(self, model_name: str = "nomic-embed-text"):
        self.model_name = model_name
        self._verify_model()
    
    def _verify_model(self):
        """Check if model is available locally."""
        try:
            result = subprocess.run(
                ['ollama', 'list'],
                capture_output=True,
                text=True,
                check=True
            )
            if self.model_name not in result.stdout:
                raise RuntimeError(
                    f"Model '{self.model_name}' not found locally.\n"
                    f"Please download it first using:\n"
                    f"  ollama pull {self.model_name}\n"
                    f"This is a one-time setup step that requires internet connection."
                )
            print(f"Found embedding model: {self.model_name}")
        except subprocess.CalledProcessError as e:
            raise RuntimeError(
                f"Cannot connect to Ollama service.\n"
                f"Please ensure Ollama is installed and running.\n"
                f"Error: {e}"
            )
        except FileNotFoundError:
            raise RuntimeError(
                "Ollama not found on your system.\n"
                "Please install Ollama from: https://ollama.com/download\n"
                "This is a one-time setup step."
            )
    
    def embed_text(self, text: str) -> np.ndarray:
        """generate embedding vector for text using HTTP API."""
        try:
            import http.client         
            conn = http.client.HTTPConnection("localhost", 11434, timeout=30)
            headers = {'Content-Type': 'application/json'}          
            payload = json.dumps({
                "model": self.model_name,
                "prompt": text
            })
            
            conn.request("POST", "/api/embeddings", payload, headers)
            response = conn.getresponse()
            data = json.loads(response.read().decode())           
            return np.array(data['embedding'], dtype=np.float32)
            
        except Exception as e:
            print(f"Embedding error: {e}")
            return np.zeros(768, dtype=np.float32)  # Fallback
    
    def embed_chunks(self, chunks: List[Chunk]) -> List[Chunk]:
        """generate embeddings for all chunks with progress."""
        print(f"Generating embeddings for {len(chunks)} chunks...")        
        for i, chunk in enumerate(chunks):
            if i % 10 == 0 and i > 0:
                print(f" progress: {i}/{len(chunks)}")
            chunk.vector = self.embed_text(chunk.text)
        
        print("embeddings complete!")
        return chunks

The `OllamaEmbedder` class manages the embedding process through three main methods. When initialized, the `_verify_model` method checks if `nomic-embed-text` is installed by running the "ollama list" command, and notifies the user if missing. The `embed_text` method is the core function that converts text to numbers by sending an HTTP request to localhost port 11434, where `Ollama` runs as a background service, receiving back a list of 768 numbers (the embedding) that represent the text's meaning, with a fallback that returns zeros if something goes wrong. The `embed_chunks` method processes all document chunks by calling `embed_text` for each one, displaying progress every 10 chunks, and storing the resulting vector in each chunk's vector field.

2.6 Vector Database setup

A vector database stores the numerical embeddings and enables fast searching to find similar chunks when you ask a question. The system uses FAISS (Facebook AI Similarity Search), which is a specialized library designed to quickly search through millions of vectors. The critical design choice was using cosine similarity instead of Euclidean distance. The database stores vectors in a binary index file for speed and chunk metadata in a JSON file for readability, and it can save to disk after initial setup, so subsequent runs load in seconds instead of minutes. For the FLoRA paper, the database stored 61 chunk vectors and successfully retrieved the 5 most relevant chunks when queried, with a distance threshold of 0.6 acting as a quality filter to exclude irrelevant content.

class VectorDatabase:
    """FAISS-based vector storage and retrieval with Cosine Similarity."""
    def __init__(self, dimension: int = 768):
        self.dimension = dimension
        #ue IndexFlatIP for cosine similarity
        self.index = faiss.IndexFlatIP(dimension) 
        self.chunks: List[Chunk] = []
    
    def add_chunks(self, chunks: List[Chunk]):
        """add chunk embeddings to the index."""
        vectors = np.array([chunk.vector for chunk in chunks], dtype=np.float32)
        
        #normalize vectors for cosine similarity
        faiss.normalize_L2(vectors)
        
        self.index.add(vectors)
        self.chunks.extend(chunks)
        print(f"added {len(chunks)} chunks (total: {len(self.chunks)})")
    
    def search(self, query_vector: np.ndarray, top_k: int = 5) -> List[Tuple[Chunk, float]]:
        """find top-k most similar chunks using cosine similarity.
        
        Returns:
            List of (chunk, distance) tuples
            Distance is (1 - cosine_similarity), so lower = more similar
        """
        query_vector = query_vector.reshape(1, -1).astype(np.float32)
        
        #normalize query vector for cosine similarity
        faiss.normalize_L2(query_vector)
        
        #search (returns similarity scores, not distances)
        similarities, indices = self.index.search(query_vector, top_k)
        
        results = []
        for idx, similarity in zip(indices[0], similarities[0]):
            if idx < len(self.chunks):
                #convert similarity to distance: distance = 1 - similarity
                distance = 1 - similarity
                results.append((self.chunks[idx], float(distance)))
        
        return results
    
    def save(self, directory: str):
        """persist database to disk."""
        os.makedirs(directory, exist_ok=True)
        
        #save FAISS index
        faiss.write_index(self.index, os.path.join(directory, 'faiss.index'))
        
        #save chunks metadata (JSON)
        chunks_data = [{'id': chunk.id,
            'text': chunk.text,
            'metadata': chunk.metadata
        } for chunk in self.chunks]
        
        with open(os.path.join(directory, 'chunks.json'), 'w', encoding='utf-8') as f:
            json.dump(chunks_data, f, indent=2)
        
        print(f"database saved to {directory}")
    
    def load(self, directory: str, embedder) -> bool:
        """load database from disk."""
        index_path = os.path.join(directory, 'faiss.index')
        chunks_path = os.path.join(directory, 'chunks.json')
        
        if not os.path.exists(index_path) or not os.path.exists(chunks_path):
            print(f"no database found in {directory}")
            return False
        
        #load FAISS index
        self.index = faiss.read_index(index_path)
        
        #load chunks
        with open(chunks_path, 'r', encoding='utf-8') as f:
            chunks_data = json.load(f)
        
        #reconstruct chunks (re-embed for consistency)
        print("reconstructing chunk vectors...")
        self.chunks = []
        for data in chunks_data:
            chunk = Chunk(
                id=data['id'],
                text=data['text'],
                vector=embedder.embed_text(data['text']),
                metadata=data['metadata']
            )
            self.chunks.append(chunk)
        
        print(f"database loaded: {len(self.chunks)} chunks")
        return True

The `VectorDatabase` class initializes with dimension set to 768 to match the embedding size and creates a FAISS IndexFlatIP index for inner product similarity after normalization. The `add_chunks` method takes a list of chunks, extracts their vectors into a numpy array, normalizes them using `faiss.normalize_L2` (which scales each vector to unit length so inner product equals cosine similarity), then adds them to the index and stores the chunks for later retrieval. The search method takes a query vector, reshapes and normalizes it the same way, calls index.search to find the top-k most similar vectors (which returns similarity scores from 0 to 1), then converts those scores to distances by subtracting from 1 so that lower numbers mean better matches, matching the chunk objects with their distance scores. The save method writes the FAISS index to a binary file and the chunk text and metadata to a JSON file for persistence, while the load method reads these files back and reconstructs the chunks by re-embedding their text to ensure vector consistency.

2.7 The Large Language Model

The LLM is the component that reads the retrieved chunks and generates natural language answers in response to questions. The system uses Llama 3.2, a 2GB model that works well on regular computers without requiring a GPU. The implementation uses command-line execution through subprocess instead of HTTP API because this proved more reliable on CPU systems where HTTP requests would timeout. The temperature setting of 0.3 controls how creative or focused the answers are, with lower values producing more factual, deterministic responses that stay close to the retrieved context, which is ideal for question answering. The 5-minute timeout accommodates slow CPU processing, where the first query takes 30-90 seconds to load the model into memory, but subsequent queries complete in 3-10 seconds. For the FLoRA paper, the model generated 416 characters when asked about the main topic and 283 characters for the specific problem question, demonstrating it provides substantive, well-explained answers while remaining concise.

Code:

class OllamaLLM:
    """LLM interface using Ollama CLI (more reliable on CPU)."""
    
    def __init__(self, model_name: str = "llama3.2"):
        self.model_name = model_name
        self._verify_model()
    
    def _verify_model(self):
        """Check if model is available locally."""
        try:
            result = subprocess.run(
                ['ollama', 'list'],
                capture_output=True,
                text=True,
                check=True
            )
            if self.model_name not in result.stdout:
                raise RuntimeError(
                    f"Model '{self.model_name}' not found locally.\n"
                    f"Please download it first using:\n"
                    f"  ollama pull {self.model_name}\n"
                    f"This is a one-time setup step that requires internet connection."
                )
            print(f"Found LLM model: {self.model_name}")
        except subprocess.CalledProcessError as e:
            raise RuntimeError(
                f"Cannot connect to Ollama service.\n"
                f"Please ensure Ollama is installed and running.\n"
                f"Error: {e}"
            )
        except FileNotFoundError:
            raise RuntimeError(
                "Ollama not found on your system.\n"
                "Please install Ollama from: https://ollama.com/download\n"
                "This is a one-time setup step."
            )
    
    def generate(self, prompt: str, temperature: float = 0.3) -> str:
        """generate response using Ollama CLI (more reliable on CPU).        
        Args:
            prompt: Complete prompt with context and question
            temperature: Creativity (0.0=deterministic, 1.0=creative)
        """
        try:
            print(f"  Generating with {self.model_name} ...")
            
            # Use subprocess with CLI - more reliable than HTTP on CPU
            result = subprocess.run(
                ['ollama', 'run', self.model_name],
                input=prompt,
                capture_output=True,
                text=True,
                timeout=300,  # 5 minutes timeout
                encoding='utf-8'
            )
            
            if result.returncode != 0:
                error_msg = result.stderr or "Unknown error"
                print(f" Ollama error: {error_msg}")
                return f"Error: {error_msg}"
            
            answer = result.stdout.strip()
            
            if not answer:
                print(f" Empty response")
                return "Error: Empty response from LLM"
            
            print(f" Generated {len(answer)} characters")
            return answer
            
        except subprocess.TimeoutExpired:
            print(f"  Timeout after 5 minutes")
            return "Error: Generation timed out. Try a simpler question or smaller context."
        except Exception as e:
            error_msg = f"Error: {str(e)}"
            print(f"  {error_msg}")
            return error_msg

The OllamaLLM class initializes with the model name (defaulting to llama3.2) and immediately verifies the model is installed. The _verify_model method runs "ollama list" to check if the model exists locally and notifies the user if missing. The generate method is the core function that takes a prompt containing the context and question, then uses subprocess.run to execute the command "ollama run llama3.2" with the prompt sent through standard input, capturing both output and errors with a 5-minute timeout to handle slow CPU inference. The method checks if the command succeeded (returncode equals 0), extracts the answer from standard output and strips whitespace, handles errors by returning descriptive messages, and prints the character count of generated text for monitoring. For the FLoRA paper queries, when given a prompt like "Answer based on context: [5 chunks about PEFT] Question: What problem does FLoRA address?", the model read through the provided chunks and generated a focused 283-character answer citing Source 5 from page 1, demonstrating it successfully grounded its response in the retrieved context rather than guessing from general knowledge.

2.7 Main RAG System Architecture

The complete RAG system orchestrates all components through three main phases. The Ingest phase loads documents, splits them into chunks, converts chunks to embeddings, and stores them in the vector database (done once during setup). The Query phase takes question, converts it to an embedding, searches the vector database for similar chunks, and retrieves the most relevant ones. The Generate phase builds a structured prompt containing the retrieved chunks as context along with your question and strict instructions, sends it to the LLM, and returns the answer with source citations. The system uses a distance threshold of 0.6 to filter out irrelevant chunks, where lower values are stricter and may miss some relevant information, while higher values are more lenient but may include noise. The prompt engineering includes clear instructions like "Answer only from context" with a structured format of CONTEXT then QUESTION then INSTRUCTIONS, forcing the model to cite sources and refuse to guess when context is insufficient, ensuring answers are grounded in actual documents.

Code:

class RAGSystem:
    """complete RAG orchestration."""
    def __init__(
        self,
        documents_dir: str = "documents",
        db_dir: str = "vector_db",
        llm_model: str = "llama3.2",
        embedding_model: str = "nomic-embed-text"
    ):
        self.documents_dir = documents_dir
        self.db_dir = db_dir
        
        print("initializing RAG System...")
        self.embedder = OllamaEmbedder(embedding_model)
        self.llm = OllamaLLM(llm_model)
        self.vector_db = VectorDatabase()
        print("RAG System initialized!")
    
    def ingest_documents(
        self,
        chunk_size: int = 750,
        overlap: int = 100,
        force_rebuild: bool = False
    ):
        """Build or load vector database."""    
        #try loading existing database
        if not force_rebuild and os.path.exists(self.db_dir):
            print("loading existing database...")
            if self.vector_db.load(self.db_dir, self.embedder):
                return
        
        print(" Building new database...")
        
        #load documents
        documents = DocumentLoader.load_documents(self.documents_dir)
        if not documents:
            print("no documents found!")
            return
        
        #chunk documents
        all_chunks = []
        for doc in documents:
            chunks = TextChunker.chunk_text(
                doc['text'],
                chunk_size=chunk_size,
                overlap=overlap,
                metadata=doc['metadata']
            )
            all_chunks.extend(chunks)
        
        print(f"created {len(all_chunks)} chunks")       
        #generate embeddings
        all_chunks = self.embedder.embed_chunks(all_chunks)       
        #store in vector DB
        self.vector_db.add_chunks(all_chunks)
        #save for future use
        self.vector_db.save(self.db_dir)
    
    def query(
        self,
        question: str,
        top_k: int = 5,
        distance_threshold: float = 1.5
    ) -> Dict:
        """Answer question using RAG.
       Returns:
            {
                'answer': Generated answer,
                'sources': List of source chunks,
                'confidence': 'high'|'medium'|'low'
            }
        """
        print(f"\n Question: {question}")
        
        #embed query
        query_vector = self.embedder.embed_text(question)
        
        #search vector DB
        results = self.vector_db.search(query_vector, top_k=top_k)
        
        #filter by threshold
        filtered_results = [
            (chunk, dist) for chunk, dist in results
            if dist < distance_threshold
        ]
        
        if not filtered_results:
            return {
                'answer': "Insufficient context to answer this question.",
                'sources': [],
                'confidence': 'low'
            }
        
        #build context from chunks
        context_parts = []
        sources = []
        
        for i, (chunk, distance) in enumerate(filtered_results):
            context_parts.append(
                f"[Source {i+1}: {chunk.metadata['source']}, "
                f"Page {chunk.metadata.get('page', 'N/A')}]\n{chunk.text}\n"
            )
            sources.append({
                'id': chunk.id,
                'source': chunk.metadata['source'],
                'page': chunk.metadata.get('page', 'N/A'),
                'distance': distance
            })
        
        context = "\n".join(context_parts)
        
        #build prompt
        prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context.

CONTEXT:
{context}

QUESTION: {question}

INSTRUCTIONS:
1. Answer based only on the context above
2. Cite source numbers (e.g., "According to Source 1...")
3. If context is insufficient, state that clearly
4. Be concise but thorough

ANSWER:"""       
        #generate answer
        print("Generating answer...")
        answer = self.llm.generate(prompt, temperature=0.3)
        
        return {
            'answer': answer,
            'sources': sources,
            'confidence': 'high' if len(filtered_results) >= 3 else 'medium'
        }

print("RAG System class ready!")

The RAGSystem class brings everything together by initializing the embedder, LLM, and vector database when created. The ingest_documents method first checks if a saved database exists and loads it (taking only 2-5 seconds), otherwise builds a new one by calling DocumentLoader to read files, TextChunker to split them (creating 61 chunks for the FLoRA paper), the embedder to convert chunks to vectors (taking 2-3 minutes), adding them to the vector database, and saving everything to disk for next time. The query method implements the complete retrieval and generation flow: it converts the question to a vector using the embedder, searches the vector database for the top-k most similar chunks (5 by default), filters results to keep only chunks below the distance threshold (0.6, keeping chunks with distances like 0.21-0.56 while rejecting higher values), returns "insufficient context" if no chunks pass the threshold, builds a formatted context string with source labels like "[Source 1: flora.pdf, Page 1]" followed by the chunk text, constructs a detailed prompt with the context, question, and instructions to cite sources, sends the prompt to the LLM with temperature 0.3 for focused answers, and returns a dictionary containing the answer, source details with distances, and confidence level (high if 3 or more sources, medium otherwise).

2.8 Initialization and testing

The RAG system requires configuration of several parameters before use. Complete these configurations.

The documents_dir specifies where your PDF, Markdown, or HTML files are stored (defaults to "documents" folder). The db_dir sets where the vector database will be saved for fast loading in future sessions (defaults to "vector_db" folder). The chunk_size determines how many characters each text segment contains, 750 provides balanced context for this document. The overlap parameter should be 10-20% of chunk_size to prevent information loss at boundaries, so 100 characters works well with 750-character chunks. The force_rebuild flag controls whether to rebuild the database from scratch (True) or load the existing saved database if available (False, recommended after first run for speed).

- `documents_dir`: Where your documents are

- `db_dir`: Where vector database is saved

- `chunk_size`: 500-1000 (750 is balanced)

- `overlap`: 10-20% of chunk_size

- `force_rebuild`: Set True to rebuild from scratch

Code:

# Initialize RAG system
rag = RAGSystem(
    documents_dir="documents",
    db_dir="vector_db",
    llm_model="llama3.2",
    embedding_model="nomic-embed-text"
)

# Build/load database
rag.ingest_documents(
    chunk_size=750,
    overlap=100,
    force_rebuild=True  
)

If you have followed till this point, your output should be similar to the output in the image below.

2.8.1 Testing with an example

Parameters used: `top_k`: more value means more context, which makes answer generation slower. Five chunks provide diverse perspectives without overwhelming the context window. For the FLoRA paper, this retrieved approximately 3750 characters of context (750 per chunk). Testing with 3 chunks sometimes missed nuanced information, while 7+ introduced redundancy and slowed generation. The system's ability to assign high confidence when retrieving 5 sources validates this choice. Multiple sources strengthen answers through cross-referencing.

Distance_threshold: Lower value means stricter matching. With cosine similarity, distances range from 0 (identical) to 1 (opposite). The 0.6 threshold emerged from observing query results. Relevant chunks scored 0.2-0.56, while irrelevant content exceeded 0.7. The 0.6 cutoff effectively separates signal from noise.

For unanswerable questions, all chunks exceeded this threshold, correctly triggering the insufficient context response rather than forcing an answer from marginal matches.

#example: Ask a question
question = "What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?"

result = rag.query(
    question=question,
    top_k=5,
    distance_threshold=0.6
)

# Display results
print("\n" + "="*60)
print("ANSWER:")
print("="*60)
print(result['answer'])
print("\n" + "="*60)
print(f"CONFIDENCE: {result['confidence'].upper()}")
print("="*60)
print("\nSOURCES:")
for i, source in enumerate(result['sources'], 1):
    print(f"  {i}. {source['source']} (Page {source['page']}) - Distance: {source['distance']:.4f}")
print("="*60)

Output:

uestion: What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?

Generating answer...

Generating with llama3.2 ...

Generated 649 characters

============================================================

ANSWER:

============================================================

According to Source 3 [Source 3: flora.pdf, Page 1], FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs), with a focus on reducing inference-time latencies. The authors highlight that despite the emergence of various parameter-efficient fine-tuning methods (PEFT) such as LoRA and parallel adapters, there is still a significant degree of unexplored subject matter.

Furthermore, according to Source 4 [Source 4: flora.pdf, Page 1], FLoRA proposes a family of fused forward-backward adapters (FFBA) that combine ideas from popular PEFT methods to improve fine-tuning accuracies and minimize latency.

============================================================

CONFIDENCE: HIGH

============================================================

SOURCES:

1. flora.pdf (Page 9) - Distance: 0.2233

2. flora.pdf (Page 9) - Distance: 0.2614

3. flora.pdf (Page 1) - Distance: 0.2660

4. flora.pdf (Page 1) - Distance: 0.2805

5. flora.pdf (Page 10) - Distance: 0.2863

============================================================

In the example and output above, the system processes the question as thus;

Converted the question to a vector

Found the 5 most similar chunks (distances 0.22-0.28)

These chunks came from pages 1, 9, and 10

LLM read them and answered: "FLoRA proposes a family of fused forward-backward adapters (FFBA) that combine ideas from popular PEFT methods to improve fine-tuning accuracies and minimize latency"

3.0 Conclusion

This post shows how to build a complete RAG system that runs completely on local machine, with no requirements for internet connectivity for operation after initial set up. Using Ollama for embeddings and language generation combined with FAISS for vector storage, I explored how to build a system where documents never have to leave your computer, eliminating any potential for privacy risk while still supporting full intelligence of modern large language models. As shown in the FLoRA paper case study, this is effective, returning accurate retrieval with distance scores ranging from 0.22 to 0.28, properly cited sources, and substantive answers directly addressing technical questions. Medical professionals can ask questions about patient records without HIPAA violations, legal teams can work with confidential case files with complete privilege protection, researchers can use proprietary datasets without risking intellectual property, and users in bandwidth-constrained environments can have access to sophisticated AI capabilities without stable internet access. This system proves that privacy, performance, and intelligence are not competing priorities but complementary aspects of well-designed AI solutions, providing a blueprint for those who are looking to leverage advanced AI capabilities while maintaining complete data sovereignty and control.

4.0 References

l FLoRA Paper: Gowda, D., Song, S., Lee, J., & Goka, H. (2025). FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs. arXiv:2511.00050. Available at: https://arxiv.org/pdf/2511.00050

l Ollama Documentation: https://docs.ollama.com

l FAISS: Facebook AI Similarity Search. https://github.com/facebookresearch/faiss

l Meta AI: Llama 3.2 Model Card. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/

l Nomic AI: nomic-embed-text Embedding Model. https://www.nomic.ai/blog/nomic-embed-text-v1

Full notebook file can be found here: https://github.com/teedonk/Offline-RAG-system

Building a RAG System That Runs Completely Offline

What is RAG?

1.1 How does it work?

1.2 Prerequisites

2.0 Implementation

2.1 Import Libraries

2.2 Dataclass setup

2.3 Document loading

2.4 Text Chunking

2.5 Embedding

2.6 Vector Database setup

2.7 The Large Language Model

2.7 Main RAG System Architecture

2.8 Initialization and testing

3.0 Conclusion

4.0 References