Document Chunking & Embedding

Learn to split documents into optimal chunks and generate vector embeddings — the foundation of any RAG tutorial prompt engineering workflow.

15 min read
3 quiz questions

Project Overview

Project

advanced15 min

RAG Pipeline — Phase 1: Chunking & Embedding

Build a fully functional Retrieval-Augmented Generation (RAG) pipeline in Python. Phase 1 covers document ingestion: loading files, splitting them into semantically coherent chunks, generating embeddings via OpenAI, and storing them in a vector database.
PythonOpenAI APIChromaDB

Large language models have a context window limit and no built-in access to your private data. RAG solves both problems: it retrieves the most relevant chunks of your documents at query time and feeds them into the LLM as context. The result is grounded, accurate answers with citations.

This project requires Python 3.10+, an OpenAI API key, and the packages: openai, chromadb, tiktoken, and langchain-text-splitters. Install them with: pip install openai chromadb tiktoken langchain-text-splitters

Step 1: Document Chunking

Chunking strategy has the single largest impact on RAG quality. Chunks that are too small lose context; chunks that are too large dilute relevance. The sweet spot for most use cases is 400-600 tokens with 50-100 token overlap to preserve sentence boundaries.

from langchain_text_splitters import RecursiveCharacterTextSplitter
import tiktoken

def chunk_document(text: str, chunk_size: int = 500, overlap: int = 80) -> list[str]:
    """Split a document into overlapping chunks optimized for embedding.
    
    Uses tiktoken to count tokens accurately for OpenAI models.
    RecursiveCharacterTextSplitter tries to split on paragraphs first,
    then sentences, then words — preserving semantic coherence.
    """
    enc = tiktoken.encoding_for_model("text-embedding-3-small")
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=lambda t: len(enc.encode(t)),
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    
    chunks = splitter.split_text(text)
    print(f"Split into {len(chunks)} chunks (target: {chunk_size} tokens, overlap: {overlap})")
    return chunks


# Example usage
with open("company_handbook.txt", "r") as f:
    raw_text = f.read()

chunks = chunk_document(raw_text)
print(f"First chunk preview: {chunks[0][:200]}...")
  • RecursiveCharacterTextSplitter cascades through separator types, keeping paragraphs intact when possible.
  • Overlap prevents information loss at chunk boundaries — critical for questions that span two paragraphs.
  • Use tiktoken (not len()) to count tokens, since character count does not map linearly to token count.

Step 2: Generate Embeddings

Embeddings convert text chunks into high-dimensional vectors that capture semantic meaning. Similar chunks end up close together in vector space, which is what makes retrieval work. OpenAI's text-embedding-3-small model offers the best balance of cost, speed, and quality for most RAG applications.

from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def embed_chunks(chunks: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Generate embeddings for a list of text chunks.
    
    Batches requests for efficiency. OpenAI allows up to 2048 inputs
    per request for the small model.
    """
    batch_size = 512
    all_embeddings = []
    
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        response = client.embeddings.create(input=batch, model=model)
        batch_embeddings = [item.embedding for item in response.data]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded batch {i // batch_size + 1}: {len(batch)} chunks")
    
    print(f"Total embeddings: {len(all_embeddings)}, dimensions: {len(all_embeddings[0])}")
    return all_embeddings


embeddings = embed_chunks(chunks)

Step 3: Store in a Vector Database

ChromaDB is an open-source, lightweight vector database that runs in-process — no server required. It handles storage, indexing, and similarity search. For production, you can swap in Pinecone, Weaviate, or pgvector without changing the rest of the pipeline.

import chromadb
import uuid

def create_vector_store(
    chunks: list[str],
    embeddings: list[list[float]],
    collection_name: str = "documents",
    metadata_source: str = "company_handbook.txt",
) -> chromadb.Collection:
    """Store chunks and embeddings in ChromaDB with metadata."""
    client = chromadb.PersistentClient(path="./chroma_db")
    
    # Delete existing collection if re-indexing
    try:
        client.delete_collection(collection_name)
    except ValueError:
        pass
    
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},  # cosine similarity
    )
    
    # Add chunks with metadata
    ids = [str(uuid.uuid4()) for _ in chunks]
    metadatas = [
        {"source": metadata_source, "chunk_index": i, "token_count": len(chunk.split())}
        for i, chunk in enumerate(chunks)
    ]
    
    collection.add(
        ids=ids,
        documents=chunks,
        embeddings=embeddings,
        metadatas=metadatas,
    )
    
    print(f"Stored {len(chunks)} chunks in collection '{collection_name}'")
    return collection


collection = create_vector_store(chunks, embeddings)
Set hnsw:space to "cosine" for OpenAI embeddings. The default (L2 / Euclidean distance) works but cosine similarity is the standard for normalized embeddings and gives more intuitive similarity scores.

Test Your Knowledge

Knowledge Check

1 / 3

What is the recommended token size range for chunks in most RAG applications?

Key Takeaways

  • Chunk size is the single most impactful RAG parameter — start with 400-600 tokens and 50-100 overlap.
  • Use RecursiveCharacterTextSplitter to preserve paragraph and sentence boundaries.
  • Batch embedding requests for efficiency and always use the correct token counter (tiktoken).
  • ChromaDB provides a zero-config local vector store suitable for development and small-scale production.