Skip to main content
RAG LLM Internal Tools

RAG Knowledge Base Assistant

Internal assistant answers policy/tech questions with citations

45%
Ticket deflection

Employees get instant answers

−60%
Avg. response time

From 4 hours to 1.5 hours

The Problem

A 200-person company was drowning in internal support requests. IT and HR teams received 300+ tickets monthly asking about policies, procedures, and technical configurations. Knowledge was scattered across Notion, Google Docs, and tribal knowledge.

Response times averaged 4 hours, and 60% of questions were repetitive. The team needed a self-service solution that actually worked.

Technical Approach

Built a Retrieval-Augmented Generation (RAG) system that acts as an AI-powered knowledge base assistant.

System Architecture

┌─────────────┐
│ Data Sources│
│ (Notion,    │
│  Docs, PDFs)│
└──────┬──────┘


┌─────────────┐
│  Ingestion  │
│  Pipeline   │ ← Playwright scraper + PDF parser
└──────┬──────┘


┌─────────────┐
│  Chunking + │
│  Embeddings │
└──────┬──────┘


┌─────────────┐
│  Weaviate   │
│Vector Store │
└──────┬──────┘


┌─────────────┐
│   Hybrid    │
│   Search    │ ← BM25 + Vector + Reranking
└──────┬──────┘


┌─────────────┐
│   GPT-4     │
│  Generator  │ ← With citations
└──────┬──────┘


┌─────────────┐
│   Answer    │
│ + Sources   │
└─────────────┘

Key Features

  1. Automated Ingestion: Playwright-based scraper pulls from Notion, Google Docs, PDFs
  2. Hybrid Search: Combines keyword (BM25) and semantic (vector) search for better recall
  3. Answer Reranking: Cross-encoder model reranks results before generation
  4. Citations: Every answer includes source documents with page numbers
  5. Guardrails: Refuses to answer when confidence is low or question is out of scope

Implementation Details

Chunking Strategy

  • Semantic chunking based on document structure (headings, sections)
  • 500-800 token chunks with 100 token overlap
  • Metadata preservation (title, last updated, author, department)

Retrieval Pipeline

  1. Query understanding and expansion
  2. Parallel BM25 and vector search
  3. Reciprocal Rank Fusion (RRF) for result merging
  4. Cross-encoder reranking of top 10 candidates
  5. Top 3-5 chunks sent to LLM

Quality Assurance

  • Eval set: 50 golden questions with expert answers
  • Metrics: Answer relevance, citation accuracy, hallucination rate
  • A/B testing: New prompt/retrieval strategies tested against baseline

Business Impact

After 6 months:

  • 45% ticket deflection - employees find answers instantly
  • 60% faster resolution - remaining tickets answered faster with better context
  • $85K annual savings - reduced support team workload
  • 98% accuracy on eval set - maintains high quality

Lessons Learned

What Worked

  • Hybrid search outperformed vector-only by 23% on our eval set
  • Citations build trust - users validate answers against sources
  • Semantic chunking better than fixed-size chunks for our docs

What Didn’t

  • Notion API rate limits - had to implement caching and incremental updates
  • PDF tables - required custom parsing logic for structured data
  • Generic embeddings - fine-tuned embeddings improved relevance by 15%

Future Improvements

  • Add conversational follow-up questions
  • Implement feedback loop for continuous learning
  • Expand to external customer support
  • Multi-modal support for images and diagrams

This system transformed internal knowledge management from a bottleneck into a competitive advantage.

Technical Architecture

  • Ingestion pipeline (web, PDFs)
  • Hybrid search (BM25 + vector)
  • Answer re-ranking
  • JSON logging

Technology Stack

Python FastAPI OpenAI Weaviate Playwright