Every AI coding assistant faces an inconvenient truth: it doesn't understand your codebase. It searches.

When you ask Claude Code, Cursor, or Windsurf "how does authentication work in this project?", here's what actually happens behind the scenes:

$ grep -r "authentication" src/
src/auth/login.py:42:def verify_user(username, password):
src/models.py:10:user_email = "[email protected]"
src/config.py:5:# authentication settings
src/utils.py:150:verify_user_input()
... 30+ more results, mostly noise

The agent then reads entire files to understand context. For a 10,000-file codebase, this means burning thousands of tokens and context per query tokens that could be answering your actual question.

I built CodeGrok MCP to fix this.

What CodeGrok Actually Does

CodeGrok MCP takes a fundamentally different approach: AST-based semantic indexing that runs entirely on your machine. No cloud. No API calls. Your code never leaves your device.

Instead of searching text, CodeGrok parses code into Abstract Syntax Trees using Tree-sitter. It extracts semantic symbols functions, classes, methods, variables from 9 languages and 30+ file extensions:

Each symbol becomes a single chunk with rich metadata. Not arbitrary line splits. Not entire files. Just the code you need.

The Embedding Pipeline

Here's where it gets interesting. CodeGrok uses nomic-ai/CodeRankEmbed a model specifically trained for code retrieval to generate 768-dimensional vectors for each symbol:

'coderankembed': {
    'hf_name': 'nomic-ai/CodeRankEmbed',
    'dimensions': 768,
    'max_seq_length': 8192,
    'query_prefix': 'Represent this query for searching relevant code: ',
}

Performance characteristics:

Each symbol gets formatted with everything an AI agent needs:

# src/auth/login.py:42
function: verify_user

def verify_user(username: str, password: str) -> bool:

Verifies user credentials against the database.

def verify_user(username: str, password: str) -> bool:
    user = db.query(User).filter_by(username=username).first()
    return check_password(password, user.password_hash)

Imports: db, check_password
Calls: db.query, check_password

File location, symbol type, signature, docstring, implementation, and dependencies all in one indexed chunk.

How AI Agents Connect

CodeGrok exposes semantic search through the Model Context Protocol (MCP). If you're using Claude Desktop, Cursor, or any MCP-compatible client, integration is straightforward.

Four tools handle everything:

Tool

Purpose

learn

Index a codebase (auto/full/load_only modes)

get_sources

Semantic search with language/symbol filters

get_stats

Return index statistics

list_supported_languages

List supported languages

The get_sources tool is where the magic happens:

@mcp.tool(name="get_sources")
def get_sources(
    question: str,           # "How does user authentication work?"
    n_results: int = 10,     # Top-k results
    language: str = None,    # Filter: "python", "javascript"
    symbol_type: str = None  # Filter: "function", "class", "method"
) -> Dict[str, Any]:

Query "How does authentication work?" and get:

No comment matches. No string literals. No config files mentioning the word "authentication." Just the functions that actually handle authentication.

The Numbers That Matter

Aspect

Grep

CodeGrok MCP

Matching

Keyword/regex

Semantic similarity

False positives

High

Very low

Synonyms

❌ "authenticate" ≠ "verify"

✅ Understands intent

Metadata

None

Line #, signature, type, language

Token usage

Read entire files

Returns exact functions

Persistence

Scan every time

Pre-indexed, instant search

For enterprises, this means code stays on-premises. For solo developers, it means no API keys, no subscriptions, and it works offline after the initial model download.

Getting Started

pip install codegrok-mcp
codegrok-mcp  # Starts MCP server on stdio

Configure your MCP client to connect. Then:

  1. learn your codebase
  2. get_sources with natural language queries
  3. Get precise code references instead of grep noise

Embeddings persist in .codegrok/ within your project directory. Subsequent indexes are near-instant because only changed files get re-processed.

GitHub: github.com/dondetir/CodeGrok_mcp


I'm a Engineer who builds open-source AI tools through DS APPS Inc. CodeGrok MCP came from frustration with watching AI agents burn context windows on irrelevant grep results. The source is MIT licensed contributions welcome.