TUTORIALS 12 min read

Build an AI Document Summarizer: Condense 100 Pages Into 5 Minutes of Reading

Build a document summarization tool that handles PDFs, articles, and research papers. Multi-level summaries with citation tracking.

By EgoistAI ·
Build an AI Document Summarizer: Condense 100 Pages Into 5 Minutes of Reading

Research papers average 8,000 words. Business reports stretch to 50 pages. Legal contracts run to hundreds of pages. And the average person reads 250 words per minute. The math doesn’t work — there’s more to read than time to read it.

AI document summarization compresses hours of reading into minutes without losing the essential information. But simple “summarize this” approaches miss nuance, lose important details, and produce generic summaries that could apply to any document. This tutorial builds a sophisticated summarization system that handles multiple formats, offers multiple summary levels, and maintains citation tracking so you can verify every claim.

What We’re Building

Chapter 1: What We're Building

A document summarization tool that:

  1. Handles PDFs, Word documents, web articles, and plain text
  2. Generates multi-level summaries (TL;DR, executive summary, detailed summary)
  3. Extracts key findings, arguments, and data points
  4. Maintains citations linking summary claims to source paragraphs
  5. Supports batch processing for multiple documents
  6. Compares and synthesizes across multiple documents

Tech Stack

  • Python 3.11+
  • PyMuPDF (fitz) for PDF processing
  • Claude API for summarization
  • Streamlit for the web interface
  • python-docx for Word document processing

Step 1: Document Ingestion

Chapter 2: Ingestion

import fitz  # PyMuPDF
from docx import Document
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass

@dataclass
class DocumentContent:
    title: str
    text: str
    sections: list
    page_count: int
    word_count: int
    source: str

def extract_from_pdf(filepath: str) -> DocumentContent:
    doc = fitz.open(filepath)
    sections = []
    full_text = []

    for page_num, page in enumerate(doc):
        text = page.get_text()
        full_text.append(text)
        sections.append({
            "page": page_num + 1,
            "text": text
        })

    text = "\n".join(full_text)
    return DocumentContent(
        title=doc.metadata.get("title", filepath),
        text=text,
        sections=sections,
        page_count=len(doc),
        word_count=len(text.split()),
        source=filepath
    )

def extract_from_url(url: str) -> DocumentContent:
    resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(resp.text, "html.parser")

    # Remove script and style elements
    for tag in soup(["script", "style", "nav", "footer", "header"]):
        tag.decompose()

    title = soup.title.string if soup.title else url
    text = soup.get_text(separator="\n", strip=True)

    return DocumentContent(
        title=title, text=text, sections=[],
        page_count=1, word_count=len(text.split()), source=url
    )

def extract_from_docx(filepath: str) -> DocumentContent:
    doc = Document(filepath)
    paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
    text = "\n".join(paragraphs)

    return DocumentContent(
        title=filepath, text=text, sections=[],
        page_count=1, word_count=len(text.split()), source=filepath
    )

Step 2: Chunked Summarization for Long Documents

Chapter 3: Chunking

For documents that exceed the context window, use a hierarchical chunking approach:

from anthropic import Anthropic

client = Anthropic()

def chunk_text(text: str, max_chunk_words: int = 3000) -> list:
    words = text.split()
    chunks = []
    for i in range(0, len(words), max_chunk_words):
        chunk = " ".join(words[i:i + max_chunk_words])
        chunks.append(chunk)
    return chunks

def summarize_long_document(doc: DocumentContent) -> dict:
    if doc.word_count <= 4000:
        return summarize_single(doc.text)

    # Hierarchical summarization
    chunks = chunk_text(doc.text)
    chunk_summaries = []

    for i, chunk in enumerate(chunks):
        summary = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system="""Summarize this section of a document. Preserve key facts,
arguments, data points, and quotes. Include section/page references where visible.
Return structured JSON: {"summary": "...", "key_points": [...], "data_points": [...]}""",
            messages=[{"role": "user",
                      "content": f"Section {i+1}/{len(chunks)}:\n\n{chunk}"}]
        )
        chunk_summaries.append(summary.content[0].text)

    # Synthesize chunk summaries into final summary
    combined = "\n\n---\n\n".join(chunk_summaries)
    return synthesize_summaries(combined, doc.title)

Step 3: Multi-Level Summaries

Chapter 4: Multi-Level

def generate_multi_level_summary(text: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=3000,
        system="""Generate three levels of summary for this document.

Return JSON:
{
    "tldr": "1-2 sentence summary (max 50 words)",
    "executive_summary": "3-5 paragraph summary covering main points (200-300 words)",
    "detailed_summary": "Comprehensive summary preserving key arguments, evidence, and conclusions (500-800 words)",
    "key_findings": ["list of 5-10 most important findings or arguments"],
    "key_data": ["any specific numbers, statistics, or data points mentioned"],
    "methodology": "how the research/analysis was conducted (if applicable)",
    "limitations": "noted limitations or caveats",
    "citations_needed": ["claims that should be verified against the source"]
}""",
        messages=[{"role": "user", "content": text}]
    )

    text_resp = response.content[0].text
    if "```json" in text_resp:
        text_resp = text_resp.split("```json")[1].split("```")[0]
    return json.loads(text_resp.strip())

Step 4: Citation Tracking

Chapter 5: Citations

def add_citations(summary: str, source_text: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""For each claim in the summary, find the corresponding passage
in the source text and add a citation marker. Return the summary with [1], [2], etc.
markers and a list of source passages.

Return JSON:
{"annotated_summary": "summary with [1] [2] markers",
 "citations": [{"id": 1, "claim": "...", "source_passage": "exact quote from source"}]}""",
        messages=[{"role": "user",
                  "content": f"Summary:\n{summary}\n\nSource:\n{source_text[:8000]}"}]
    )
    text = response.content[0].text
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    return json.loads(text.strip())

Step 5: Multi-Document Synthesis

Chapter 6: Synthesis

def synthesize_documents(summaries: list[dict]) -> dict:
    combined = "\n\n===\n\n".join([
        f"Document: {s['title']}\nSummary: {s['summary']}"
        for s in summaries
    ])

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="""Synthesize multiple document summaries into a coherent analysis.
Identify:
1. Common themes across documents
2. Contradictions or disagreements
3. Gaps in coverage
4. Overall narrative

Return JSON with synthesis, themes, contradictions, and recommendations.""",
        messages=[{"role": "user", "content": combined}]
    )
    text = response.content[0].text
    if "```json" in text:
        text = text.split("```json")[1].split("```")[0]
    return json.loads(text.strip())

Step 6: Streamlit Interface

Chapter 7: Interface

Build a clean interface with:

  • File upload (PDF, DOCX, TXT) or URL input
  • Tab-based display for TL;DR, Executive Summary, and Detailed Summary
  • Collapsible citation references
  • Key findings highlighted in a sidebar
  • Multi-document mode with synthesis tab
  • Download summary as formatted PDF or Markdown

Step 7: Batch Processing

Chapter 8: Batch

For processing multiple documents:

  • Upload a ZIP file or folder of documents
  • Process in parallel with progress tracking
  • Generate individual summaries plus a cross-document synthesis
  • Export results as a single report

Performance and Cost Optimization

Chapter 9: Optimization

  • Use Claude Haiku for chunk-level summaries (cheaper, faster) and Sonnet for final synthesis (better quality)
  • Cache summaries for previously processed documents
  • Implement smart chunking that respects section boundaries instead of splitting mid-paragraph
  • For very long documents (100+ pages), use progressive summarization with user-guided focus areas

The Bottom Line

An AI document summarizer saves hours per document and enables knowledge workers to stay on top of far more material than humanly possible. The multi-level approach ensures that readers can go as deep as they need — from a 10-second TL;DR to a thorough detailed summary.

Build time: 3-4 hours. Cost: $0.05-0.50 per document depending on length. Impact: reading 10x more material in the same amount of time, with better retention of key points.

Share this article

> Want more like this?

Get the best AI insights delivered weekly.

> Related Articles

Tags

document summarizerPDF processingPythonNLPAI toolstutorial

> Stay in the loop

Weekly AI tools & insights.