Everyone is really into RAG (Retrieval-Augmented Generation) right now. People use vector databases, embedding models, semantic similarity scores, and all the usual tools. But here is the surprising truth: if you are building AI tools that generate content, you probably do not need RAG at all.
You spent months handling thousands of documents like PDFs, Word files, and research papers for an AI content generation service. The real problem was not finding the right document. The real problem was how to give a 50-page document to a language model that can only read a limited amount of text at once, without spending too much money on API calls or losing the important parts of the document.
The solutions I found online were either:
- Too complex (full RAG pipelines with vector stores)
- Too naive (just truncate at N tokens)
- Too academic (semantic embeddings for every sentence)
What I needed was something in between. Something practical.
Why Not Just Use RAG?
RAG is designed to answer questions about documents. You embed chunks, store them in a vector database, then retrieve the most relevant chunks based on a query.
But content generation is different. You're not answering a question, you're transforming the entire document into something new. You need:
- The document's structure (headings, sections)
- The key points (not just query-relevant snippets)
- The statistics and data (numbers are gold for presentations)
- The conclusions (often buried at the end)
A semantic similarity search won't find "the third paragraph that contains a compelling statistic" because there's no query to match against.
The Simple Solution: Position-Based Scoring with Overlapping Windows
Here's the approach that worked for me. It's embarrassingly simple, but it reduced token usage by 70-90% while preserving document quality.
Step 1: Split Into Overlapping Chunks
Don't split documents at arbitrary token boundaries. Use overlapping windows:
# Pseudocode
chunk_size = 1000 # words
overlap = 200 # words
chunks = []
position = 0
while position < total_words:
end = min(position + chunk_size, total_words)
chunk_text = words[position:end]
chunks.append({
'text': chunk_text,
'position': position / total_words, # 0.0 to 1.0
})
position = end - overlap # Overlap with previous chunk
The overlap ensures you don't lose context at chunk boundaries. If an important sentence gets split, it appears in both chunks.
Step 2: Score Chunks by Position
Here's the insight that changed everything: document position is a powerful signal.
Research papers, reports, and business documents follow predictable patterns:
- First 15%: Introduction, abstract, problem statement
- Last 15%: Conclusions, recommendations, key takeaways
- Middle 15-35%: Context, background, methodology
def score_by_position(position):
score = 0.0
if position < 0.15:
score += 2.0 # Introduction - usually important
elif position > 0.85:
score += 2.0 # Conclusion - usually important
elif 0.15 <= position <= 0.35:
score += 1.0 # Early middle - often key context
return score
This simple heuristic beats random sampling every time. You're not doing semantic analysis, you're exploiting document structure that authors follow unconsciously.
Step 3: Score by Content Quality
Position isn't everything. Add content-based signals:
Statistics are gold:
# Chunks with numbers + units score higher
if contains_pattern(r'\d+%|\d+\.\d+|\d{1,3}(?:,\d{3})+'):
score += 1.5
Numbers like "73% increase" or "$2.5 million" are presentation gold. They're specific, memorable, and lend credibility.
Comparison language indicates analysis:
if contains_words(['compared', 'versus', 'than', 'outperforms']):
score += 0.5
When authors compare things, they're usually making a point worth keeping.
Paragraph density matters:
word_count = len(chunk.split())
if 100 <= word_count <= 800:
score += 1.0
Very short chunks are usually headers or fragments. Very long chunks are often boilerplate or verbose sections.
Step 4: Filter Out Noise
Academic documents are full of noise that looks like content:
NOISE_PATTERNS = [
r'\[\d+\]', # Citation markers [1], [23]
r'\bet\s+al\.', # "et al." citations
r'doi:\s*\d', # DOI references
r'https?://', # URLs (often in references)
r'ISBN\s*[\d-]+', # ISBN numbers
r'references\s*$', # References section header
]
noise_count = count_matches(text, NOISE_PATTERNS)
if noise_count > 3:
score -= 2.0 # Heavy noise - likely references section
The references section of a research paper can be 10-20% of the document. It's useless for content generation but looks like legitimate text to naive tokenizers.
Step 5: Merge and Deduplicate
Because chunks overlap, you'll extract duplicate content. Handle this with simple substring matching:
def is_similar(text_a, text_b, threshold=0.8):
# Normalize
a = normalize(text_a) # lowercase, remove punctuation
b = normalize(text_b)
# Substring check
if a in b or b in a:
return True
# Word overlap check
words_a = set(a.split())
words_b = set(b.split())
overlap = len(words_a & words_b)
min_len = min(len(words_a), len(words_b))
return overlap / min_len > threshold
This isn't perfect, but it catches 90% of duplicates without the overhead of embedding-based similarity.
The Full Pipeline
Here's how it all fits together:
Real-World Results
After processing thousands of documents with this approach:
| Document Type | Original | After Processing | Reduction |
|---|---|---|---|
| Research Paper (30 pages) | ~12,000 words | ~1,800 words | 85% |
| Business Report (50 pages) | ~18,000 words | ~2,200 words | 88% |
| Technical Manual (100 pages) | ~40,000 words | ~3,500 words | 91% |
The magic isn't just the compression, it's that the compressed version often produces better AI output because it's focused on the important content.
When This Approach Works (and When It Doesn't)
Works well for:
- Structured documents (reports, papers, presentations)
- Content generation tasks (summaries, slides, blog posts)
- Documents under 100 pages
- English-language documents with standard formatting
Doesn't work well for:
- Q&A systems (use RAG instead)
- Highly technical documents where everything matters
- Documents without clear structure (transcripts, chat logs)
- Multi-language documents with mixed formatting
Why Simpler Is Often Better
I could have built a complex pipeline with:
- Sentence embeddings using transformer models
- Vector similarity scoring
- LLM-based chunk importance classification
- Multiple passes with different extraction strategies
But here's what I learned: complexity has a cost beyond compute.
Complex systems are harder to debug. When your AI generates garbage, was it the embedding model? The similarity threshold? The chunk size? The retrieval query?
With position-based scoring, debugging is simple: print the chunk scores, see what got selected, adjust the weights. The logic is transparent and the failure modes are predictable.
Key Takeaways
- Position is a powerful signal. Documents have structure. Exploit it.
- Not all text is equal. Statistics, comparisons, and conclusions carry more weight than filler paragraphs.
- Noise detection is crucial. Citations and references can be 10-20% of academic documents.
- Overlap prevents information loss. A 20% overlap catches content that straddles chunk boundaries.
- Simple deduplication works. Substring matching and word overlap catch most duplicates.
- RAG is for retrieval, not generation. Different problems need different solutions.
The best system isn't the most sophisticated one, it's the one that solves your specific problem with the minimum necessary complexity.
I've been processing documents for AI applications for the past year, learning what works through trial and error. If you're building something similar, I'd love to hear about your approach, especially if you've found something simpler that works even better.