Document Chunking & Embedding
Learn to split documents into optimal chunks and generate vector embeddings — the foundation of any RAG tutorial prompt engineering workflow.
Project Overview
Project
advanced15 minRAG Pipeline — Phase 1: Chunking & Embedding
Large language models have a context window limit and no built-in access to your private data. RAG solves both problems: it retrieves the most relevant chunks of your documents at query time and feeds them into the LLM as context. The result is grounded, accurate answers with citations.
Step 1: Document Chunking
Chunking strategy has the single largest impact on RAG quality. Chunks that are too small lose context; chunks that are too large dilute relevance. The sweet spot for most use cases is 400-600 tokens with 50-100 token overlap to preserve sentence boundaries.
from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken
def chunk_document(text: str, chunk_size: int = 500, overlap: int = 80) -> list[str]:
"""Split a document into overlapping chunks optimized for embedding.
Uses tiktoken to count tokens accurately for OpenAI models.
RecursiveCharacterTextSplitter tries to split on paragraphs first,
then sentences, then words — preserving semantic coherence.
"""
enc = tiktoken.encoding_for_model("text-embedding-3-small")
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=lambda t: len(enc.encode(t)),
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(text)
print(f"Split into {len(chunks)} chunks (target: {chunk_size} tokens, overlap: {overlap})")
return chunks
# Example usage
with open("company_handbook.txt", "r") as f:
raw_text = f.read()
chunks = chunk_document(raw_text)
print(f"First chunk preview: {chunks[0][:200]}...")- RecursiveCharacterTextSplitter cascades through separator types, keeping paragraphs intact when possible.
- Overlap prevents information loss at chunk boundaries — critical for questions that span two paragraphs.
- Use tiktoken (not len()) to count tokens, since character count does not map linearly to token count.
Step 2: Generate Embeddings
Embeddings convert text chunks into high-dimensional vectors that capture semantic meaning. Similar chunks end up close together in vector space, which is what makes retrieval work. OpenAI's text-embedding-3-small model offers the best balance of cost, speed, and quality for most RAG applications.
from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY env var
def embed_chunks(chunks: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
"""Generate embeddings for a list of text chunks.
Batches requests for efficiency. OpenAI allows up to 2048 inputs
per request for the small model.
"""
batch_size = 512
all_embeddings = []
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
response = client.embeddings.create(input=batch, model=model)
batch_embeddings = [item.embedding for item in response.data]
all_embeddings.extend(batch_embeddings)
print(f"Embedded batch {i // batch_size + 1}: {len(batch)} chunks")
print(f"Total embeddings: {len(all_embeddings)}, dimensions: {len(all_embeddings[0])}")
return all_embeddings
embeddings = embed_chunks(chunks)Step 3: Store in a Vector Database
ChromaDB is an open-source, lightweight vector database that runs in-process — no server required. It handles storage, indexing, and similarity search. For production, you can swap in Pinecone, Weaviate, or pgvector without changing the rest of the pipeline.
import chromadb
import uuid
def create_vector_store(
chunks: list[str],
embeddings: list[list[float]],
collection_name: str = "documents",
metadata_source: str = "company_handbook.txt",
) -> chromadb.Collection:
"""Store chunks and embeddings in ChromaDB with metadata."""
client = chromadb.PersistentClient(path="./chroma_db")
# Delete existing collection if re-indexing
try:
client.delete_collection(collection_name)
except ValueError:
pass
collection = client.create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}, # cosine similarity
)
# Add chunks with metadata
ids = [str(uuid.uuid4()) for _ in chunks]
metadatas = [
{"source": metadata_source, "chunk_index": i, "token_count": len(chunk.split())}
for i, chunk in enumerate(chunks)
]
collection.add(
ids=ids,
documents=chunks,
embeddings=embeddings,
metadatas=metadatas,
)
print(f"Stored {len(chunks)} chunks in collection '{collection_name}'")
return collection
collection = create_vector_store(chunks, embeddings)Test Your Knowledge
Knowledge Check
1 / 3
What is the recommended token size range for chunks in most RAG applications?
Key Takeaways
- ✓Chunk size is the single most impactful RAG parameter — start with 400-600 tokens and 50-100 overlap.
- ✓Use RecursiveCharacterTextSplitter to preserve paragraph and sentence boundaries.
- ✓Batch embedding requests for efficiency and always use the correct token counter (tiktoken).
- ✓ChromaDB provides a zero-config local vector store suitable for development and small-scale production.
Continue Learning
Retrieval & Answer Synthesis
Complete the RAG pipeline: retrieve relevant chunks, craft synthesis prompts, and generate grounded answers with citations.
Build a Complete AI Content Creation Workflow
Design and execute a multi-step content pipeline: research, outline, draft, edit, and SEO optimize — all powered by AI prompts.
Design a Complete AI Customer Support System Prompt
Build a professional system prompt for a customer support chatbot that handles tone, boundaries, escalation, and common questions gracefully.