Introduction
Large Language Models (LLMs) like GPT-4 and Claude are incredible at conversation, but they have a critical limitation: they don't know about your specific data. If you ask ChatGPT about your company's product specifications, it will hallucinate confidently wrong answers.
Retrieval-Augmented Generation (RAG) solves this by combining:
- A knowledge base (your documents)
- Semantic search (finding relevant context)
- An LLM (generating answers from that context)
In this tutorial, I'll walk through building a production-ready RAG chatbot using:
- Groq API for ultra-fast LLM inference (600+ tokens/sec)
- Vector embeddings for semantic search
- Next.js 14 for the frontend
- Streaming responses for better UX
Live Demo: Try the chatbot Code: GitHub Repository
What is RAG? A Visual Explanation
Traditional LLM (Limited to Training Data)
User: "What are Nicolas's skills?"
↓
[GPT-4 Model]
↓
Answer: "I don't have information about Nicolas" ❌
RAG System (Augmented with Custom Knowledge)
User: "What are Nicolas's skills?"
↓
[Semantic Search Engine]
↓
Retrieves: "Nicolas has expertise in:
- Python (5 years)
- Power BI (advanced)
- Machine Learning (TensorFlow, scikit-learn)"
↓
[Context + Query] → [Groq LLM]
↓
Answer: "Nicolas has strong skills in Python (5 years),
Power BI at an advanced level, and Machine Learning
using frameworks like TensorFlow and scikit-learn." ✅
Key Insight: The LLM doesn't memorize your data—it reads relevant excerpts in real-time.
Architecture Overview
High-Level Flow
// 1. Preprocessing: Convert documents to embeddings
Documents → Chunking → Embeddings → Vector Store
// 2. Runtime: User query
User Query → Query Embedding → Semantic Search → Top K Chunks
// 3. Generation: LLM with context
Context + Query + System Prompt → Groq API → Streaming Response
Tech Stack
{
"dependencies": {
"next": "14.2.x", // Frontend framework
"groq-sdk": "^0.3.0", // Groq API client
"@xenova/transformers": "^2.10.0", // Browser-based embeddings
"ai": "^3.0.0", // Vercel AI SDK for streaming
"react-markdown": "^9.0.0" // Render formatted responses
}
}
Why Groq?
- Speed: 600 tokens/sec vs. OpenAI's 40 tokens/sec (15x faster)
- Cost: $0.27/1M tokens vs. OpenAI's $3/1M (11x cheaper)
- Streaming: Native support for real-time responses
Trade-off: Groq uses quantized models (slightly lower quality) and has smaller context windows (8K tokens vs. GPT-4's 128K). For chatbots, speed > perfection.
Part 1: Building the Knowledge Base
Step 1: Document Chunking
Large documents must be split into chunks that fit in the LLM's context window.
interface DocumentChunk {
id: string;
text: string;
metadata: {
source: string;
category: string;
};
}
function chunkDocument(text: string, chunkSize = 500, overlap = 50): string[] {
const chunks: string[] = [];
let startIndex = 0;
while (startIndex < text.length) {
const endIndex = Math.min(startIndex + chunkSize, text.length);
chunks.push(text.slice(startIndex, endIndex));
startIndex += chunkSize - overlap; // Overlap prevents context loss
}
return chunks;
}
Example:
const resume = `
Nicolas Avril is a Data Scientist with 2 years of experience...
[3000 words total]
`;
const chunks = chunkDocument(resume, 500, 50);
// Result: 7 chunks of ~500 characters each
Why overlap? If a sentence is split mid-thought, the LLM loses context. Overlap ensures each chunk is coherent.
Step 2: Generate Embeddings
Embeddings convert text into vectors (arrays of numbers) that capture semantic meaning.
import { pipeline } from '@xenova/transformers';
// Initialize embedding model (runs in browser with WASM)
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2' // Lightweight model (80MB)
);
async function generateEmbedding(text: string): Promise<number[]> {
const output = await embedder(text, {
pooling: 'mean',
normalize: true,
});
return Array.from(output.data);
}
// Example
const query = "What is Nicolas's experience?";
const embedding = await generateEmbedding(query);
// Result: [0.023, -0.145, 0.089, ...] (384 dimensions)
Model Choice:
- all-MiniLM-L6-v2: Fast, small (80MB), good for general queries
- bge-small-en-v1.5: Better accuracy, larger (120MB)
- OpenAI text-embedding-3-small: Best quality, requires API calls ($)
Trade-off: I chose all-MiniLM-L6-v2 for client-side privacy—no data sent to third parties.
Step 3: Build Vector Store
interface VectorStore {
chunks: DocumentChunk[];
embeddings: number[][];
}
async function buildVectorStore(documents: string[]): Promise<VectorStore> {
const chunks: DocumentChunk[] = [];
const embeddings: number[][] = [];
for (const doc of documents) {
const docChunks = chunkDocument(doc);
for (const chunk of docChunks) {
const embedding = await generateEmbedding(chunk);
chunks.push({
id: crypto.randomUUID(),
text: chunk,
metadata: { source: 'resume', category: 'experience' },
});
embeddings.push(embedding);
}
}
return { chunks, embeddings };
}
Optimization: For production, use a vector database:
- Pinecone: Managed, scales to billions of vectors
- Weaviate: Open-source, self-hosted
- Chroma: Lightweight, embedded in Python
For this demo, in-memory storage works (< 100 chunks).
Part 2: Semantic Search Engine
Cosine Similarity
To find relevant chunks, we compare the query embedding to chunk embeddings using cosine similarity.
function cosineSimilarity(vecA: number[], vecB: number[]): number {
const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
Intuition: Vectors pointing in similar directions (high cosine similarity) represent semantically similar text.
Retrieve Top-K Chunks
async function retrieveContext(
query: string,
vectorStore: VectorStore,
topK = 3
): Promise<string[]> {
const queryEmbedding = await generateEmbedding(query);
// Calculate similarities
const scores = vectorStore.embeddings.map((chunkEmbedding) =>
cosineSimilarity(queryEmbedding, chunkEmbedding)
);
// Get top K indices
const topIndices = scores
.map((score, index) => ({ score, index }))
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map((item) => item.index);
// Return corresponding chunks
return topIndices.map((i) => vectorStore.chunks[i].text);
}
Example:
const query = "What programming languages does Nicolas know?";
const context = await retrieveContext(query, vectorStore, 3);
// Result:
[
"Nicolas has 5 years of experience with Python...",
"Programming skills: Python (advanced), Java, C++...",
"Developed Python scripts that improved efficiency by 35%..."
]
Part 3: LLM Integration with Groq
Prompt Engineering
The prompt structure is critical for quality responses.
function buildPrompt(query: string, context: string[]): string {
return `You are a helpful assistant answering questions about Nicolas Avril, a Data Scientist and AI Engineer.
Context (retrieved from Nicolas's resume and portfolio):
${context.map((chunk, i) => `[${i + 1}] ${chunk}`).join('\n\n')}
User Question: ${query}
Instructions:
- Answer based ONLY on the provided context
- If the context doesn't contain the answer, say "I don't have that information"
- Be concise but complete
- Use specific examples from the context
Answer:`;
}
Key Techniques:
- System role definition: "You are a helpful assistant..."
- Explicit context boundaries: Number each chunk
[1],[2] - Grounding instructions: "Answer based ONLY on the provided context"
- Fallback behavior: "If the context doesn't contain..."
Groq API Integration
import Groq from 'groq-sdk';
const groq = new Groq({
apiKey: process.env.GROQ_API_KEY,
});
async function generateResponse(prompt: string): Promise<string> {
const completion = await groq.chat.completions.create({
model: 'mixtral-8x7b-32768', // Fast, good reasoning
messages: [
{
role: 'system',
content: 'You are a helpful assistant for Nicolas Avril\'s portfolio.',
},
{
role: 'user',
content: prompt,
},
],
temperature: 0.3, // Lower = more deterministic
max_tokens: 500,
});
return completion.choices[0]?.message?.content || '';
}
Model Selection:
| Model | Speed (tokens/sec) | Context Window | Best For |
|---|---|---|---|
llama3-8b-8192 | 800 | 8K | Simple Q&A |
mixtral-8x7b-32768 | 600 | 32K | Complex reasoning |
llama3-70b-8192 | 300 | 8K | Highest quality |
I use Mixtral for the best balance of speed and reasoning.
Part 4: Streaming Responses
Users expect real-time feedback, not 10-second waits.
Vercel AI SDK for Streaming
import { OpenAIStream, StreamingTextResponse } from 'ai';
export async function POST(req: Request) {
const { query } = await req.json();
// 1. Retrieve context
const context = await retrieveContext(query, vectorStore, 3);
// 2. Build prompt
const prompt = buildPrompt(query, context);
// 3. Stream response from Groq
const response = await groq.chat.completions.create({
model: 'mixtral-8x7b-32768',
messages: [{ role: 'user', content: prompt }],
stream: true, // Enable streaming
});
// 4. Convert to readable stream
const stream = OpenAIStream(response);
return new StreamingTextResponse(stream);
}
Frontend: Consuming the Stream
'use client';
import { useChat } from 'ai/react';
export default function Chatbot() {
const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
api: '/api/chat',
});
return (
<div className="flex flex-col h-screen">
<div className="flex-1 overflow-y-auto p-4 space-y-4">
{messages.map((message) => (
<div
key={message.id}
className={`p-4 rounded-lg ${
message.role === 'user'
? 'bg-blue-100 ml-auto max-w-[80%]'
: 'bg-slate-100 mr-auto max-w-[80%]'
}`}
>
<ReactMarkdown>{message.content}</ReactMarkdown>
</div>
))}
{isLoading && <div className="animate-pulse">Thinking...</div>}
</div>
<form onSubmit={handleSubmit} className="p-4 border-t">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask about Nicolas's experience..."
className="w-full p-3 border rounded-lg"
/>
</form>
</div>
);
}
Magic: useChat handles all streaming logic—tokens appear word-by-word automatically.
Part 5: Advanced Optimizations
1. Context Window Management
Groq's Mixtral has a 32K token context limit. With top-3 chunks (500 chars each), we use ~2K tokens—safe. But what if a user asks 10 questions in a row?
Solution: Sliding window context
function manageConversationHistory(
messages: Message[],
maxTokens = 4000
): Message[] {
let tokenCount = 0;
const recentMessages: Message[] = [];
// Keep most recent messages within token budget
for (let i = messages.length - 1; i >= 0; i--) {
const message = messages[i];
const estimatedTokens = message.content.length / 4; // Rough estimate
if (tokenCount + estimatedTokens > maxTokens) break;
recentMessages.unshift(message);
tokenCount += estimatedTokens;
}
return recentMessages;
}
2. Caching Embeddings
Generating embeddings is slow (200ms per chunk). Cache them:
const embeddingCache = new Map<string, number[]>();
async function getCachedEmbedding(text: string): Promise<number[]> {
if (embeddingCache.has(text)) {
return embeddingCache.get(text)!;
}
const embedding = await generateEmbedding(text);
embeddingCache.set(text, embedding);
return embedding;
}
Impact: Repeated queries 10x faster (20ms vs. 200ms).
3. Hybrid Search (Keyword + Semantic)
Sometimes exact keyword matches outperform semantic search.
function hybridSearch(
query: string,
vectorStore: VectorStore,
topK = 3
): string[] {
// Semantic search
const semanticResults = retrieveContext(query, vectorStore, topK);
// Keyword search (BM25-like)
const keywords = query.toLowerCase().split(' ');
const keywordResults = vectorStore.chunks
.filter((chunk) =>
keywords.some((kw) => chunk.text.toLowerCase().includes(kw))
)
.slice(0, topK);
// Merge and deduplicate
const combined = [...semanticResults, ...keywordResults.map((c) => c.text)];
return Array.from(new Set(combined)).slice(0, topK);
}
4. Answer Quality Monitoring
Track when the chatbot says "I don't have that information."
// In API route
if (response.includes("I don't have that information")) {
await logMissingInfo(query); // Store for knowledge base improvements
}
Production Considerations
Security
- Rate Limiting: Prevent abuse
import { Ratelimit } from '@upstash/ratelimit';
const ratelimit = new Ratelimit({
redis: Redis.fromEnv(),
limiter: Ratelimit.slidingWindow(10, '10 s'),
});
// In API route
const { success } = await ratelimit.limit(userId);
if (!success) return new Response('Rate limit exceeded', { status: 429 });
- Input Sanitization: Prevent prompt injection
function sanitizeQuery(query: string): string {
return query
.replace(/[<>]/g, '') // Remove potential HTML
.slice(0, 500); // Limit length
}
Cost Optimization
Groq Pricing: $0.27 per 1M tokens
Average conversation:
- Query: 20 tokens
- Context (3 chunks): 400 tokens
- Response: 150 tokens
- Total: 570 tokens per message
Cost per 1000 messages: $0.15 🎉
Comparison:
- OpenAI GPT-4 Turbo: $4.50 per 1000 messages (30x more expensive)
Monitoring
Track these metrics:
interface ChatMetrics {
queryLatency: number; // Time to retrieve context
llmLatency: number; // Time for LLM response
tokensUsed: number;
userSatisfaction: boolean; // Thumbs up/down
}
Lessons Learned
What Worked
- Groq's speed is a game-changer: Users notice the difference
- Client-side embeddings protect privacy: No data leaks to OpenAI
- Streaming UX matters: 80% of users prefer progressive responses
What Was Challenging
- Chunking strategy: Too small = loss of context; too large = irrelevant info included
- Prompt engineering: Took 12 iterations to prevent hallucinations
- Edge cases: "Tell me a joke" (not in knowledge base) needs graceful handling
Future Improvements
- Multi-modal RAG: Include images from projects
- Conversation summarization: Remember previous questions
- Fine-tuned embeddings: Train on domain-specific data for better retrieval
Try It Yourself
Starter Template: GitHub - RAG Chatbot Starter
git clone https://github.com/nicolasavril/rag-chatbot.git
cd rag-chatbot
npm install
echo "GROQ_API_KEY=your_key_here" > .env.local
npm run dev
Customize for your use case:
- Replace
resume.txtwith your documents - Adjust
chunkSizeandtopKparameters - Modify system prompt for your domain
Conclusion
RAG transforms LLMs from "knows everything but nothing specific" to "expert in your domain." By combining:
- Semantic search (embeddings + cosine similarity)
- Context injection (retrieval before generation)
- Fast inference (Groq's optimized models)
...we built a chatbot that answers domain-specific questions in <2 seconds while costing pennies per 1000 interactions.
Key Takeaway: RAG is not just for chatbots—apply it to:
- Customer support (search docs + generate answers)
- Legal research (case law retrieval + summarization)
- Internal knowledge bases (company wiki + Q&A)
Resources
Code & Demo:
Further Reading:
About the Author
Nicolas Avril is a Data Scientist & AI Engineer specializing in NLP, Machine Learning, and Business Intelligence. He builds production-ready AI tools with a focus on performance, cost-efficiency, and user experience.
Connect: LinkedIn | Portfolio | GitHub
Questions about RAG implementation? Drop a comment below or reach out on LinkedIn. Follow for more deep dives into practical AI engineering.