Build an AI Document Summarizer: Condense 100 Pages Into 5 Minutes of Reading

Research papers average 8,000 words. Business reports stretch to 50 pages. Legal contracts run to hundreds of pages. And the average person reads 250 words per minute. The math doesn’t work — there’s more to read than time to read it.

AI document summarization compresses hours of reading into minutes without losing the essential information. But simple “summarize this” approaches miss nuance, lose important details, and produce generic summaries that could apply to any document. This tutorial builds a sophisticated summarization system that handles multiple formats, offers multiple summary levels, and maintains citation tracking so you can verify every claim.

What We’re Building

Chapter 1: What We're Building

A document summarization tool that:

Handles PDFs, Word documents, web articles, and plain text
Generates multi-level summaries (TL;DR, executive summary, detailed summary)
Extracts key findings, arguments, and data points
Maintains citations linking summary claims to source paragraphs
Supports batch processing for multiple documents
Compares and synthesizes across multiple documents

Tech Stack

Python 3.11+
PyMuPDF (fitz) for PDF processing
Claude API for summarization
Streamlit for the web interface
python-docx for Word document processing

Step 1: Document Ingestion

Chapter 2: Ingestion

import fitz  # PyMuPDF
from docx import Document
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass

@dataclass
class DocumentContent:
    title: str
    text: str
    sections: list
    page_count: int
    word_count: int
    source: str

def extract_from_pdf(filepath: str) -> DocumentContent:
    doc = fitz.open(filepath)
    sections = []
    full_text = []

    for page_num, page in enumerate(doc):
        text = page.get_text()
        full_text.append(text)
        sections.append({
            "page": page_num + 1,
            "text": text
        })

    text = "\n".join(full_text)
    return DocumentContent(
        title=doc.metadata.get("title", filepath),
        text=text,
        sections=sections,
        page_count=len(doc),
        word_count=len(text.split()),
        source=filepath
    )

def extract_from_url(url: str) -> DocumentContent:
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(resp.text, "html.parser")

    # Remove script and style elements
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()

    title = soup.title.string if soup.title else url
    text = soup.get_text(separator="\n", strip=True)

    return DocumentContent(
        title=title, text=text, sections=[],
        page_count=1, word_count=len(text.split()), source=url
    )

def extract_from_docx(filepath: str) -> DocumentContent:
    doc = Document(filepath)
    paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
    text = "\n".join(paragraphs)

    return DocumentContent(
        title=filepath, text=text, sections=[],
        page_count=1, word_count=len(text.split()), source=filepath
    )

Step 2: Chunked Summarization for Long Documents

Chapter 3: Chunking

For documents that exceed the context window, use a hierarchical chunking approach:

from anthropic import Anthropic

client = Anthropic()

def chunk_text(text: str, max_chunk_words: int = 3000) -> list:
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_chunk_words):
        chunk = " ".join(words[i:i + max_chunk_words])
        chunks.append(chunk)
    return chunks

def summarize_long_document(doc: DocumentContent) -> dict:
    if doc.word_count <= 4000:
        return summarize_single(doc.text)

    # Hierarchical summarization
    chunks = chunk_text(doc.text)
    chunk_summaries = []

    for i, chunk in enumerate(chunks):
        summary = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system="""Summarize this section of a document. Preserve key facts,
arguments, data points, and quotes. Include section/page references where visible.
Return structured JSON: {"summary": "...", "key_points": [...], "data_points": [...]}""",
            messages=[{"role": "user",
                      "content": f"Section {i+1}/{len(chunks)}:\n\n{chunk}"}]
        )
        chunk_summaries.append(summary.content[0].text)

    # Synthesize chunk summaries into final summary
    combined = "\n\n---\n\n".join(chunk_summaries)
    return synthesize_summaries(combined, doc.title)

Step 3: Multi-Level Summaries

Chapter 4: Multi-Level

def generate_multi_level_summary(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=3000,
        system="""Generate three levels of summary for this document.

Return JSON:
{
    "tldr": "1-2 sentence summary (max 50 words)",
    "executive_summary": "3-5 paragraph summary covering main points (200-300 words)",
    "detailed_summary": "Comprehensive summary preserving key arguments, evidence, and conclusions (500-800 words)",
    "key_findings": ["list of 5-10 most important findings or arguments"],
    "key_data": ["any specific numbers, statistics, or data points mentioned"],
    "methodology": "how the research/analysis was conducted (if applicable)",
    "limitations": "noted limitations or caveats",
    "citations_needed": ["claims that should be verified against the source"]
}""",
        messages=[{"role": "user", "content": text}]
    )

    text_resp = response.content[0].text
    if "```json" in text_resp:
        text_resp = text_resp.split("```json")[1].split("```")[0]
    return json.loads(text_resp.strip())

Step 4: Citation Tracking

Chapter 5: Citations

def add_citations(summary: str, source_text: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""For each claim in the summary, find the corresponding passage
in the source text and add a citation marker. Return the summary with [1], [2], etc.
markers and a list of source passages.

Return JSON:
{"annotated_summary": "summary with [1] [2] markers",
 "citations": [{"id": 1, "claim": "...", "source_passage": "exact quote from source"}]}""",
        messages=[{"role": "user",
                  "content": f"Summary:\n{summary}\n\nSource:\n{source_text[:8000]}"}]
    )
    text = response.content[0].text
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    return json.loads(text.strip())

Step 5: Multi-Document Synthesis

Chapter 6: Synthesis

def synthesize_documents(summaries: list[dict]) -> dict:
    combined = "\n\n===\n\n".join([
        f"Document: {s['title']}\nSummary: {s['summary']}"
        for s in summaries
    ])

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""Synthesize multiple document summaries into a coherent analysis.
Identify:
1. Common themes across documents
2. Contradictions or disagreements
3. Gaps in coverage
4. Overall narrative

Return JSON with synthesis, themes, contradictions, and recommendations.""",
        messages=[{"role": "user", "content": combined}]
    )
    text = response.content[0].text
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    return json.loads(text.strip())

Step 6: Streamlit Interface

Chapter 7: Interface

Build a clean interface with:

File upload (PDF, DOCX, TXT) or URL input
Tab-based display for TL;DR, Executive Summary, and Detailed Summary
Collapsible citation references
Key findings highlighted in a sidebar
Multi-document mode with synthesis tab
Download summary as formatted PDF or Markdown

Step 7: Batch Processing

Chapter 8: Batch

For processing multiple documents:

Upload a ZIP file or folder of documents
Process in parallel with progress tracking
Generate individual summaries plus a cross-document synthesis
Export results as a single report

Performance and Cost Optimization

Chapter 9: Optimization

Use Claude Haiku for chunk-level summaries (cheaper, faster) and Sonnet for final synthesis (better quality)
Cache summaries for previously processed documents
Implement smart chunking that respects section boundaries instead of splitting mid-paragraph
For very long documents (100+ pages), use progressive summarization with user-guided focus areas

The Bottom Line

An AI document summarizer saves hours per document and enables knowledge workers to stay on top of far more material than humanly possible. The multi-level approach ensures that readers can go as deep as they need — from a 10-second TL;DR to a thorough detailed summary.

Build time: 3-4 hours. Cost: $0.05-0.50 per document depending on length. Impact: reading 10x more material in the same amount of time, with better retention of key points.

Build an AI Document Summarizer: Condense 100 Pages Into 5 Minutes of Reading

What We’re Building

Tech Stack

Step 1: Document Ingestion

Step 2: Chunked Summarization for Long Documents

Step 3: Multi-Level Summaries

Step 4: Citation Tracking

Step 5: Multi-Document Synthesis

Step 6: Streamlit Interface

Step 7: Batch Processing

Performance and Cost Optimization

The Bottom Line

Sources

Share this article

> Want more like this?

> Related Articles

Web Scraping with AI: Build a Smart Data Extraction Pipeline

Create an AI Art Portfolio: From Generation to Gallery in One Weekend

Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes

Tags

> Stay in the loop