RAG LLM Internal Tools

RAG Knowledge Base Assistant

Internal assistant answers policy/tech questions with citations

45%

Ticket deflection

Employees get instant answers

−60%

Avg. response time

From 4 hours to 1.5 hours

The Problem

A 200-person company was drowning in internal support requests. IT and HR teams received 300+ tickets monthly asking about policies, procedures, and technical configurations. Knowledge was scattered across Notion, Google Docs, and tribal knowledge.

Response times averaged 4 hours, and 60% of questions were repetitive. The team needed a self-service solution that actually worked.

Technical Approach

Built a Retrieval-Augmented Generation (RAG) system that acts as an AI-powered knowledge base assistant.

System Architecture

┌─────────────┐
│ Data Sources│
│ (Notion,    │
│  Docs, PDFs)│
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Ingestion  │
│  Pipeline   │ ← Playwright scraper + PDF parser
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Chunking + │
│  Embeddings │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│  Weaviate   │
│Vector Store │
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Hybrid    │
│   Search    │ ← BM25 + Vector + Reranking
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   GPT-4     │
│  Generator  │ ← With citations
└──────┬──────┘
       │
       ▼
┌─────────────┐
│   Answer    │
│ + Sources   │
└─────────────┘

Key Features

Automated Ingestion: Playwright-based scraper pulls from Notion, Google Docs, PDFs
Hybrid Search: Combines keyword (BM25) and semantic (vector) search for better recall
Answer Reranking: Cross-encoder model reranks results before generation
Citations: Every answer includes source documents with page numbers
Guardrails: Refuses to answer when confidence is low or question is out of scope

Implementation Details

Chunking Strategy

Semantic chunking based on document structure (headings, sections)
500-800 token chunks with 100 token overlap
Metadata preservation (title, last updated, author, department)

Retrieval Pipeline

Query understanding and expansion
Parallel BM25 and vector search
Reciprocal Rank Fusion (RRF) for result merging
Cross-encoder reranking of top 10 candidates
Top 3-5 chunks sent to LLM

Quality Assurance

Eval set: 50 golden questions with expert answers
Metrics: Answer relevance, citation accuracy, hallucination rate
A/B testing: New prompt/retrieval strategies tested against baseline

Business Impact

After 6 months:

45% ticket deflection - employees find answers instantly
60% faster resolution - remaining tickets answered faster with better context
$85K annual savings - reduced support team workload
98% accuracy on eval set - maintains high quality

Lessons Learned

What Worked

Hybrid search outperformed vector-only by 23% on our eval set
Citations build trust - users validate answers against sources
Semantic chunking better than fixed-size chunks for our docs

What Didn’t

Notion API rate limits - had to implement caching and incremental updates
PDF tables - required custom parsing logic for structured data
Generic embeddings - fine-tuned embeddings improved relevance by 15%

Future Improvements

Add conversational follow-up questions
Implement feedback loop for continuous learning
Expand to external customer support
Multi-modal support for images and diagrams

This system transformed internal knowledge management from a bottleneck into a competitive advantage.

Technical Architecture

Ingestion pipeline (web, PDFs)
Hybrid search (BM25 + vector)
Answer re-ranking
JSON logging

Technology Stack

Python FastAPI OpenAI Weaviate Playwright

← Back to Projects