One recurring problem in software teams is onboarding.

You hire a new developer, and suddenly you realize how much knowledge is scattered across:

Even when everything is documented, new developers still ask the same questions:

I wanted to solve this problem for my project OpenSCADA Lite, so I decided to build something interesting: A local AI assistant trained on the entire project.

Not using external APIs.
Not sending code outside the company.
Just a local Retrieval-Augmented Generation (RAG) pipeline.

After some tweaks, it worked even on very modest hardware.


My Main Goals

Instead of telling new developers, "Read these 30 documents and ask me if you have questions."

They can simply ask: "How do I create a new module in this system?"

And the AI answers using our own codebase and documentation

The Data I Used to Train the Assistant

The system indexes three main sources:

1. The Entire Codebase

Modules, classes, and architecture from the project.

2. Documentation

README, notes, and configuration explanations.

3. Development Conversations

All the ChatGPT conversations I had while building the project.

This is actually extremely valuable because it contains:

So instead of losing that knowledge, the AI can use it.

Architecture

The system is a classic RAG pipeline:

Step 1 — Chunking the Information (The Most Important Part)

The biggest mistake people make with RAG systems is bad chunking.

Good chunks = good answers.

I split the project into around:

Example of how ChatGPT conversations were stored:

## Prompt:
My question is: what do we use as rule engine?

## Response:
You're asking which technology or library to use for a rule engine in Python for SCADA systems.

Option A: Custom Lightweight Rule Engine
Why:
- Full control
- Async friendly
- Easy integration with DTOs

How:
Store rules in JSON/YAML and evaluate conditions safely.

This formatting preserves question → reasoning → decision.

Which is gold for an AI assistant.

Step 2 — Generating Embeddings

Each chunk is converted into a vector using:

multi-qa-MiniLM-L6-cos-v1

This produces:

Even on CPU.

This step transforms the project knowledge into something the AI can search.

Step 3 — Building the FAISS Index

All embeddings are stored in a FAISS index.

In my after several tests:

When someone asks a question, the system retrieves the most relevant chunks from this index.

Step 4 — Choosing an LLM That Actually Runs on My Hardware

Here is where things got interesting.

My setup is not exactly cutting edge:

CPU: i7-2600
RAM: 32 GB
GPU: GTX 1050 Ti (CUDA 6.1)

Modern AI stacks don’t like this GPU anymore.

PyTorch dropped support for this architecture in newer CUDA builds.

So I had two problems:

  1. Find a model good with code
  2. Make it run on old hardware

First Attempt: CodeLlama

I started with Code Llama GGUF models.

They were promising, but:

So I kept experimenting.

The Model That Finally Worked

The one that ended up working best was:

DeepSeek Coder 6.7B Instruct (Q5_K_M quantization)

Model file:

deepseek-coder-6.7b-instruct-q5_k_m.gguf

Loaded with:

llama.cpp

This was the key.

Why this worked:

This combination finally made the system stable.

Performance Reality

Is it fast?

No.

But it works.

Query time:

5–10 minutes per question

On this machine.

But the answers are:

Examples of Questions the Model Can Answer

Basic Question

Question

What is the name of the project?

Answer

OpenSCADA-Lite

Simple but correct.

Installation Question

Question

Can I use Docker?

Answer

Yes, Docker can be used to containerize the project and run it consistently across systems.

(The model then explains how Docker works and how to run it.)

Real Developer Question

This is where it becomes powerful.

Question

How do I create a new OPC UA driver?

Answer

The model explains:

And it produces something like this:

from asyncua import Client
from openscada_lite.modules.communication.drivers.server_protocol import ServerProtocol

class OPCUAClientDriver(Protocol):
    def __init__(self, server_url, **kwargs):
        self.server_url = server_url
        self.client = None

    async def start(self):
        self.client = Client(self.server_url)
        await self.client.connect()

This is knowledge extracted directly from the project structure.

What This Means for Engineering Teams

This approach changes onboarding.

Instead of:

Weeks of KT sessions.

You get:

An AI that knows your architecture.

Developers can ask:

And the system answers using your code.

Lessons Learned Building This

Several things surprised me.

Chunking matters more than the model

Bad chunks = bad AI.

Hardware still matters

Modern AI tooling assumes newer GPUs.

Older GPUs require alternative stacks like llama.cpp.

Code-focused models make a huge difference

General LLMs perform worse than models trained for code.

You don’t need a data center to build useful AI

This entire system runs locally.

If you want to try it, I published the full code here: https://github.com/boadadf/rag_scripts