RAG Chatbot Architecture: From Context to Conversation

Introduction

Large Language Models (LLMs) like GPT-4 and Claude are incredible at conversation, but they have a critical limitation: they don't know about your specific data. If you ask ChatGPT about your company's product specifications, it will hallucinate confidently wrong answers.

Retrieval-Augmented Generation (RAG) solves this by combining:

A knowledge base (your documents)
Semantic search (finding relevant context)
An LLM (generating answers from that context)

In this tutorial, I'll walk through building a production-ready RAG chatbot using:

Groq API for ultra-fast LLM inference (600+ tokens/sec)
Vector embeddings for semantic search
Next.js 14 for the frontend
Streaming responses for better UX

Live Demo: Try the chatbot Code: GitHub Repository

What is RAG? A Visual Explanation

Traditional LLM (Limited to Training Data)

User: "What are Nicolas's skills?"
              ↓
         [GPT-4 Model]
              ↓
Answer: "I don't have information about Nicolas" ❌

RAG System (Augmented with Custom Knowledge)

User: "What are Nicolas's skills?"
              ↓
     [Semantic Search Engine]
              ↓
   Retrieves: "Nicolas has expertise in:
   - Python (5 years)
   - Power BI (advanced)
   - Machine Learning (TensorFlow, scikit-learn)"
              ↓
    [Context + Query] → [Groq LLM]
              ↓
Answer: "Nicolas has strong skills in Python (5 years),
Power BI at an advanced level, and Machine Learning
using frameworks like TensorFlow and scikit-learn." ✅

Key Insight: The LLM doesn't memorize your data—it reads relevant excerpts in real-time.

Architecture Overview

High-Level Flow

// 1. Preprocessing: Convert documents to embeddings
Documents → Chunking → Embeddings → Vector Store

// 2. Runtime: User query
User Query → Query Embedding → Semantic Search → Top K Chunks

// 3. Generation: LLM with context
Context + Query + System Prompt → Groq API → Streaming Response

Tech Stack

{
  "dependencies": {
    "next": "14.2.x",           // Frontend framework
    "groq-sdk": "^0.3.0",       // Groq API client
    "@xenova/transformers": "^2.10.0",  // Browser-based embeddings
    "ai": "^3.0.0",             // Vercel AI SDK for streaming
    "react-markdown": "^9.0.0"  // Render formatted responses
  }
}

Why Groq?

Speed: 600 tokens/sec vs. OpenAI's 40 tokens/sec (15x faster)
Cost: $0.27/1M tokens vs. OpenAI's $3/1M (11x cheaper)
Streaming: Native support for real-time responses

Trade-off: Groq uses quantized models (slightly lower quality) and has smaller context windows (8K tokens vs. GPT-4's 128K). For chatbots, speed > perfection.

Part 1: Building the Knowledge Base

Step 1: Document Chunking

Large documents must be split into chunks that fit in the LLM's context window.

interface DocumentChunk {
  id: string;
  text: string;
  metadata: {
    source: string;
    category: string;
  };
}

function chunkDocument(text: string, chunkSize = 500, overlap = 50): string[] {
  const chunks: string[] = [];
  let startIndex = 0;

  while (startIndex < text.length) {
    const endIndex = Math.min(startIndex + chunkSize, text.length);
    chunks.push(text.slice(startIndex, endIndex));
    startIndex += chunkSize - overlap; // Overlap prevents context loss
  }

  return chunks;
}

Example:

const resume = `
Nicolas Avril is a Data Scientist with 2 years of experience...
[3000 words total]
`;

const chunks = chunkDocument(resume, 500, 50);
// Result: 7 chunks of ~500 characters each

Why overlap? If a sentence is split mid-thought, the LLM loses context. Overlap ensures each chunk is coherent.

Step 2: Generate Embeddings

Embeddings convert text into vectors (arrays of numbers) that capture semantic meaning.

import { pipeline } from '@xenova/transformers';

// Initialize embedding model (runs in browser with WASM)
const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2'  // Lightweight model (80MB)
);

async function generateEmbedding(text: string): Promise<number[]> {
  const output = await embedder(text, {
    pooling: 'mean',
    normalize: true,
  });

  return Array.from(output.data);
}

// Example
const query = "What is Nicolas's experience?";
const embedding = await generateEmbedding(query);
// Result: [0.023, -0.145, 0.089, ...] (384 dimensions)

Model Choice:

all-MiniLM-L6-v2: Fast, small (80MB), good for general queries
bge-small-en-v1.5: Better accuracy, larger (120MB)
OpenAI text-embedding-3-small: Best quality, requires API calls ($)

Trade-off: I chose all-MiniLM-L6-v2 for client-side privacy—no data sent to third parties.

Step 3: Build Vector Store

interface VectorStore {
  chunks: DocumentChunk[];
  embeddings: number[][];
}

async function buildVectorStore(documents: string[]): Promise<VectorStore> {
  const chunks: DocumentChunk[] = [];
  const embeddings: number[][] = [];

  for (const doc of documents) {
    const docChunks = chunkDocument(doc);

    for (const chunk of docChunks) {
      const embedding = await generateEmbedding(chunk);

      chunks.push({
        id: crypto.randomUUID(),
        text: chunk,
        metadata: { source: 'resume', category: 'experience' },
      });

      embeddings.push(embedding);
    }
  }

  return { chunks, embeddings };
}

Optimization: For production, use a vector database:

Pinecone: Managed, scales to billions of vectors
Weaviate: Open-source, self-hosted
Chroma: Lightweight, embedded in Python

For this demo, in-memory storage works (< 100 chunks).

Part 2: Semantic Search Engine

Cosine Similarity

To find relevant chunks, we compare the query embedding to chunk embeddings using cosine similarity.

function cosineSimilarity(vecA: number[], vecB: number[]): number {
  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
  const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));

  return dotProduct / (magnitudeA * magnitudeB);
}

Intuition: Vectors pointing in similar directions (high cosine similarity) represent semantically similar text.

Retrieve Top-K Chunks

async function retrieveContext(
  query: string,
  vectorStore: VectorStore,
  topK = 3
): Promise<string[]> {
  const queryEmbedding = await generateEmbedding(query);

  // Calculate similarities
  const scores = vectorStore.embeddings.map((chunkEmbedding) =>
    cosineSimilarity(queryEmbedding, chunkEmbedding)
  );

  // Get top K indices
  const topIndices = scores
    .map((score, index) => ({ score, index }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((item) => item.index);

  // Return corresponding chunks
  return topIndices.map((i) => vectorStore.chunks[i].text);
}

Example:

const query = "What programming languages does Nicolas know?";
const context = await retrieveContext(query, vectorStore, 3);

// Result:
[
  "Nicolas has 5 years of experience with Python...",
  "Programming skills: Python (advanced), Java, C++...",
  "Developed Python scripts that improved efficiency by 35%..."
]

Part 3: LLM Integration with Groq

Prompt Engineering

The prompt structure is critical for quality responses.

function buildPrompt(query: string, context: string[]): string {
  return `You are a helpful assistant answering questions about Nicolas Avril, a Data Scientist and AI Engineer.

Context (retrieved from Nicolas's resume and portfolio):
${context.map((chunk, i) => `[${i + 1}] ${chunk}`).join('\n\n')}

User Question: ${query}

Instructions:
- Answer based ONLY on the provided context
- If the context doesn't contain the answer, say "I don't have that information"
- Be concise but complete
- Use specific examples from the context

Answer:`;
}

Key Techniques:

System role definition: "You are a helpful assistant..."
Explicit context boundaries: Number each chunk [1], [2]
Grounding instructions: "Answer based ONLY on the provided context"
Fallback behavior: "If the context doesn't contain..."

Groq API Integration

import Groq from 'groq-sdk';

const groq = new Groq({
  apiKey: process.env.GROQ_API_KEY,
});

async function generateResponse(prompt: string): Promise<string> {
  const completion = await groq.chat.completions.create({
    model: 'mixtral-8x7b-32768',  // Fast, good reasoning
    messages: [
      {
        role: 'system',
        content: 'You are a helpful assistant for Nicolas Avril\'s portfolio.',
      },
      {
        role: 'user',
        content: prompt,
      },
    ],
    temperature: 0.3,  // Lower = more deterministic
    max_tokens: 500,
  });

  return completion.choices[0]?.message?.content || '';
}

Model Selection:

Model	Speed (tokens/sec)	Context Window	Best For
`llama3-8b-8192`	800	8K	Simple Q&A
`mixtral-8x7b-32768`	600	32K	Complex reasoning
`llama3-70b-8192`	300	8K	Highest quality

I use Mixtral for the best balance of speed and reasoning.

Part 4: Streaming Responses

Users expect real-time feedback, not 10-second waits.

Vercel AI SDK for Streaming

import { OpenAIStream, StreamingTextResponse } from 'ai';

export async function POST(req: Request) {
  const { query } = await req.json();

  // 1. Retrieve context
  const context = await retrieveContext(query, vectorStore, 3);

  // 2. Build prompt
  const prompt = buildPrompt(query, context);

  // 3. Stream response from Groq
  const response = await groq.chat.completions.create({
    model: 'mixtral-8x7b-32768',
    messages: [{ role: 'user', content: prompt }],
    stream: true,  // Enable streaming
  });

  // 4. Convert to readable stream
  const stream = OpenAIStream(response);

  return new StreamingTextResponse(stream);
}

Frontend: Consuming the Stream

'use client';

import { useChat } from 'ai/react';

export default function Chatbot() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
    api: '/api/chat',
  });

  return (
    <div className="flex flex-col h-screen">
      <div className="flex-1 overflow-y-auto p-4 space-y-4">
        {messages.map((message) => (
          <div
            key={message.id}
            className={`p-4 rounded-lg ${
              message.role === 'user'
                ? 'bg-blue-100 ml-auto max-w-[80%]'
                : 'bg-slate-100 mr-auto max-w-[80%]'
            }`}
          >
            <ReactMarkdown>{message.content}</ReactMarkdown>
          </div>
        ))}
        {isLoading && <div className="animate-pulse">Thinking...</div>}
      </div>

      <form onSubmit={handleSubmit} className="p-4 border-t">
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask about Nicolas's experience..."
          className="w-full p-3 border rounded-lg"
        />
      </form>
    </div>
  );
}

Magic: useChat handles all streaming logic—tokens appear word-by-word automatically.

Part 5: Advanced Optimizations

1. Context Window Management

Groq's Mixtral has a 32K token context limit. With top-3 chunks (500 chars each), we use ~2K tokens—safe. But what if a user asks 10 questions in a row?

Solution: Sliding window context

function manageConversationHistory(
  messages: Message[],
  maxTokens = 4000
): Message[] {
  let tokenCount = 0;
  const recentMessages: Message[] = [];

  // Keep most recent messages within token budget
  for (let i = messages.length - 1; i >= 0; i--) {
    const message = messages[i];
    const estimatedTokens = message.content.length / 4; // Rough estimate

    if (tokenCount + estimatedTokens > maxTokens) break;

    recentMessages.unshift(message);
    tokenCount += estimatedTokens;
  }

  return recentMessages;
}

2. Caching Embeddings

Generating embeddings is slow (200ms per chunk). Cache them:

const embeddingCache = new Map<string, number[]>();

async function getCachedEmbedding(text: string): Promise<number[]> {
  if (embeddingCache.has(text)) {
    return embeddingCache.get(text)!;
  }

  const embedding = await generateEmbedding(text);
  embeddingCache.set(text, embedding);
  return embedding;
}

Impact: Repeated queries 10x faster (20ms vs. 200ms).

3. Hybrid Search (Keyword + Semantic)

Sometimes exact keyword matches outperform semantic search.

function hybridSearch(
  query: string,
  vectorStore: VectorStore,
  topK = 3
): string[] {
  // Semantic search
  const semanticResults = retrieveContext(query, vectorStore, topK);

  // Keyword search (BM25-like)
  const keywords = query.toLowerCase().split(' ');
  const keywordResults = vectorStore.chunks
    .filter((chunk) =>
      keywords.some((kw) => chunk.text.toLowerCase().includes(kw))
    )
    .slice(0, topK);

  // Merge and deduplicate
  const combined = [...semanticResults, ...keywordResults.map((c) => c.text)];
  return Array.from(new Set(combined)).slice(0, topK);
}

4. Answer Quality Monitoring

Track when the chatbot says "I don't have that information."

// In API route
if (response.includes("I don't have that information")) {
  await logMissingInfo(query);  // Store for knowledge base improvements
}

Production Considerations

Security

Rate Limiting: Prevent abuse

import { Ratelimit } from '@upstash/ratelimit';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '10 s'),
});

// In API route
const { success } = await ratelimit.limit(userId);
if (!success) return new Response('Rate limit exceeded', { status: 429 });

Input Sanitization: Prevent prompt injection

function sanitizeQuery(query: string): string {
  return query
    .replace(/[<>]/g, '')  // Remove potential HTML
    .slice(0, 500);        // Limit length
}

Cost Optimization

Groq Pricing: $0.27 per 1M tokens

Average conversation:

Query: 20 tokens
Context (3 chunks): 400 tokens
Response: 150 tokens
Total: 570 tokens per message

Cost per 1000 messages: $0.15 🎉

Comparison:

OpenAI GPT-4 Turbo: $4.50 per 1000 messages (30x more expensive)

Monitoring

Track these metrics:

interface ChatMetrics {
  queryLatency: number;      // Time to retrieve context
  llmLatency: number;        // Time for LLM response
  tokensUsed: number;
  userSatisfaction: boolean; // Thumbs up/down
}

Lessons Learned

What Worked

Groq's speed is a game-changer: Users notice the difference
Client-side embeddings protect privacy: No data leaks to OpenAI
Streaming UX matters: 80% of users prefer progressive responses

What Was Challenging

Chunking strategy: Too small = loss of context; too large = irrelevant info included
Prompt engineering: Took 12 iterations to prevent hallucinations
Edge cases: "Tell me a joke" (not in knowledge base) needs graceful handling

Future Improvements

Multi-modal RAG: Include images from projects
Conversation summarization: Remember previous questions
Fine-tuned embeddings: Train on domain-specific data for better retrieval

Try It Yourself

Starter Template: GitHub - RAG Chatbot Starter

git clone https://github.com/nicolasavril/rag-chatbot.git
cd rag-chatbot
npm install
echo "GROQ_API_KEY=your_key_here" > .env.local
npm run dev

Customize for your use case:

Replace resume.txt with your documents
Adjust chunkSize and topK parameters
Modify system prompt for your domain

Conclusion

RAG transforms LLMs from "knows everything but nothing specific" to "expert in your domain." By combining:

Semantic search (embeddings + cosine similarity)
Context injection (retrieval before generation)
Fast inference (Groq's optimized models)

...we built a chatbot that answers domain-specific questions in <2 seconds while costing pennies per 1000 interactions.

Key Takeaway: RAG is not just for chatbots—apply it to:

Customer support (search docs + generate answers)
Legal research (case law retrieval + summarization)
Internal knowledge bases (company wiki + Q&A)

Resources

Code & Demo:

Further Reading:

About the Author

Nicolas Avril is a Data Scientist & AI Engineer specializing in NLP, Machine Learning, and Business Intelligence. He builds production-ready AI tools with a focus on performance, cost-efficiency, and user experience.

Connect: LinkedIn | Portfolio | GitHub

Questions about RAG implementation? Drop a comment below or reach out on LinkedIn. Follow for more deep dives into practical AI engineering.