Why is chunk overlap important?

When information spans two adjacent chunks, overlap ensures that the boundary region appears in both chunks, so retrieval can find it regardless of which chunk is matched.

Why should you use tiktoken instead of len() to measure chunk size?

LLMs process tokens, not characters. A 500-character chunk might be 100 tokens or 200 tokens depending on the vocabulary. tiktoken gives you the real token count for the target model.

Module 6Lesson 1

Document Chunking & Embedding

Learn to split documents into optimal chunks and generate vector embeddings — the foundation of any RAG tutorial prompt engineering workflow.

15 min read

3 quiz questions

Project Overview

Project

advanced15 min

RAG Pipeline — Phase 1: Chunking & Embedding

Build a fully functional Retrieval-Augmented Generation (RAG) pipeline in Python. Phase 1 covers document ingestion: loading files, splitting them into semantically coherent chunks, generating embeddings via OpenAI, and storing them in a vector database.

PythonOpenAI APIChromaDB

Large language models have a context window limit and no built-in access to your private data. RAG solves both problems: it retrieves the most relevant chunks of your documents at query time and feeds them into the LLM as context. The result is grounded, accurate answers with citations.

This project requires Python 3.10+, an OpenAI API key, and the packages: openai, chromadb, tiktoken, and langchain-text-splitters. Install them with: pip install openai chromadb tiktoken langchain-text-splitters

Step 1: Document Chunking

Chunking strategy has the single largest impact on RAG quality. Chunks that are too small lose context; chunks that are too large dilute relevance. The sweet spot for most use cases is 400-600 tokens with 50-100 token overlap to preserve sentence boundaries.

from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 80) -> list[str]:
    """Split a document into overlapping chunks optimized for embedding.
    
    Uses tiktoken to count tokens accurately for OpenAI models.
    RecursiveCharacterTextSplitter tries to split on paragraphs first,
    then sentences, then words — preserving semantic coherence.
    """
    enc = tiktoken.encoding_for_model("text-embedding-3-small")
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=lambda t: len(enc.encode(t)),
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    
    chunks = splitter.split_text(text)
    print(f"Split into {len(chunks)} chunks (target: {chunk_size} tokens, overlap: {overlap})")
    return chunks


# Example usage
with open("company_handbook.txt", "r") as f:
    raw_text = f.read()

chunks = chunk_document(raw_text)
print(f"First chunk preview: {chunks[0][:200]}...")

RecursiveCharacterTextSplitter cascades through separator types, keeping paragraphs intact when possible.
Overlap prevents information loss at chunk boundaries — critical for questions that span two paragraphs.
Use tiktoken (not len()) to count tokens, since character count does not map linearly to token count.

Step 2: Generate Embeddings

Embeddings convert text chunks into high-dimensional vectors that capture semantic meaning. Similar chunks end up close together in vector space, which is what makes retrieval work. OpenAI's text-embedding-3-small model offers the best balance of cost, speed, and quality for most RAG applications.

from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def embed_chunks(chunks: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Generate embeddings for a list of text chunks.
    
    Batches requests for efficiency. OpenAI allows up to 2048 inputs
    per request for the small model.
    """
    batch_size = 512
    all_embeddings = []
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        response = client.embeddings.create(input=batch, model=model)
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded batch {i // batch_size + 1}: {len(batch)} chunks")
    
    print(f"Total embeddings: {len(all_embeddings)}, dimensions: {len(all_embeddings[0])}")
    return all_embeddings


embeddings = embed_chunks(chunks)

Step 3: Store in a Vector Database

ChromaDB is an open-source, lightweight vector database that runs in-process — no server required. It handles storage, indexing, and similarity search. For production, you can swap in Pinecone, Weaviate, or pgvector without changing the rest of the pipeline.

import chromadb
import uuid

def create_vector_store(
    chunks: list[str],
    embeddings: list[list[float]],
    collection_name: str = "documents",
    metadata_source: str = "company_handbook.txt",
) -> chromadb.Collection:
    """Store chunks and embeddings in ChromaDB with metadata."""
    client = chromadb.PersistentClient(path="./chroma_db")
    
    # Delete existing collection if re-indexing
    try:
        client.delete_collection(collection_name)
    except ValueError:
        pass
    
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},  # cosine similarity
    )
    
    # Add chunks with metadata
    ids = [str(uuid.uuid4()) for _ in chunks]
    metadatas = [
        {"source": metadata_source, "chunk_index": i, "token_count": len(chunk.split())}
        for i, chunk in enumerate(chunks)
    ]
    
    collection.add(
        ids=ids,
        documents=chunks,
        embeddings=embeddings,
        metadatas=metadatas,
    )
    
    print(f"Stored {len(chunks)} chunks in collection '{collection_name}'")
    return collection


collection = create_vector_store(chunks, embeddings)

Set hnsw:space to "cosine" for OpenAI embeddings. The default (L2 / Euclidean distance) works but cosine similarity is the standard for normalized embeddings and gives more intuitive similarity scores.

Test Your Knowledge

Knowledge Check

1 / 3

What is the recommended token size range for chunks in most RAG applications?

Key Takeaways

✓Chunk size is the single most impactful RAG parameter — start with 400-600 tokens and 50-100 overlap.
✓Use RecursiveCharacterTextSplitter to preserve paragraph and sentence boundaries.
✓Batch embedding requests for efficiency and always use the correct token counter (tiktoken).
✓ChromaDB provides a zero-config local vector store suitable for development and small-scale production.

Previous Lesson Next Lesson

Continue Learning

Retrieval & Answer Synthesis

Complete the RAG pipeline: retrieve relevant chunks, craft synthesis prompts, and generate grounded answers with citations.

15 min

Build a Complete AI Content Creation Workflow

Design and execute a multi-step content pipeline: research, outline, draft, edit, and SEO optimize — all powered by AI prompts.

18 min

Design a Complete AI Customer Support System Prompt

Build a professional system prompt for a customer support chatbot that handles tone, boundaries, escalation, and common questions gracefully.

16 min