Build an AI Document Summarizer: Condense 100 Pages Into 5 Minutes of Reading
Build a document summarization tool that handles PDFs, articles, and research papers. Multi-level summaries with citation tracking.
Research papers average 8,000 words. Business reports stretch to 50 pages. Legal contracts run to hundreds of pages. And the average person reads 250 words per minute. The math doesn’t work — there’s more to read than time to read it.
AI document summarization compresses hours of reading into minutes without losing the essential information. But simple “summarize this” approaches miss nuance, lose important details, and produce generic summaries that could apply to any document. This tutorial builds a sophisticated summarization system that handles multiple formats, offers multiple summary levels, and maintains citation tracking so you can verify every claim.
What We’re Building

A document summarization tool that:
- Handles PDFs, Word documents, web articles, and plain text
- Generates multi-level summaries (TL;DR, executive summary, detailed summary)
- Extracts key findings, arguments, and data points
- Maintains citations linking summary claims to source paragraphs
- Supports batch processing for multiple documents
- Compares and synthesizes across multiple documents
Tech Stack
- Python 3.11+
- PyMuPDF (fitz) for PDF processing
- Claude API for summarization
- Streamlit for the web interface
- python-docx for Word document processing
Step 1: Document Ingestion

import fitz # PyMuPDF
from docx import Document
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass
@dataclass
class DocumentContent:
title: str
text: str
sections: list
page_count: int
word_count: int
source: str
def extract_from_pdf(filepath: str) -> DocumentContent:
doc = fitz.open(filepath)
sections = []
full_text = []
for page_num, page in enumerate(doc):
text = page.get_text()
full_text.append(text)
sections.append({
"page": page_num + 1,
"text": text
})
text = "\n".join(full_text)
return DocumentContent(
title=doc.metadata.get("title", filepath),
text=text,
sections=sections,
page_count=len(doc),
word_count=len(text.split()),
source=filepath
)
def extract_from_url(url: str) -> DocumentContent:
resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(resp.text, "html.parser")
# Remove script and style elements
for tag in soup(["script", "style", "nav", "footer", "header"]):
tag.decompose()
title = soup.title.string if soup.title else url
text = soup.get_text(separator="\n", strip=True)
return DocumentContent(
title=title, text=text, sections=[],
page_count=1, word_count=len(text.split()), source=url
)
def extract_from_docx(filepath: str) -> DocumentContent:
doc = Document(filepath)
paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
text = "\n".join(paragraphs)
return DocumentContent(
title=filepath, text=text, sections=[],
page_count=1, word_count=len(text.split()), source=filepath
)
Step 2: Chunked Summarization for Long Documents

For documents that exceed the context window, use a hierarchical chunking approach:
from anthropic import Anthropic
client = Anthropic()
def chunk_text(text: str, max_chunk_words: int = 3000) -> list:
words = text.split()
chunks = []
for i in range(0, len(words), max_chunk_words):
chunk = " ".join(words[i:i + max_chunk_words])
chunks.append(chunk)
return chunks
def summarize_long_document(doc: DocumentContent) -> dict:
if doc.word_count <= 4000:
return summarize_single(doc.text)
# Hierarchical summarization
chunks = chunk_text(doc.text)
chunk_summaries = []
for i, chunk in enumerate(chunks):
summary = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="""Summarize this section of a document. Preserve key facts,
arguments, data points, and quotes. Include section/page references where visible.
Return structured JSON: {"summary": "...", "key_points": [...], "data_points": [...]}""",
messages=[{"role": "user",
"content": f"Section {i+1}/{len(chunks)}:\n\n{chunk}"}]
)
chunk_summaries.append(summary.content[0].text)
# Synthesize chunk summaries into final summary
combined = "\n\n---\n\n".join(chunk_summaries)
return synthesize_summaries(combined, doc.title)
Step 3: Multi-Level Summaries

def generate_multi_level_summary(text: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=3000,
system="""Generate three levels of summary for this document.
Return JSON:
{
"tldr": "1-2 sentence summary (max 50 words)",
"executive_summary": "3-5 paragraph summary covering main points (200-300 words)",
"detailed_summary": "Comprehensive summary preserving key arguments, evidence, and conclusions (500-800 words)",
"key_findings": ["list of 5-10 most important findings or arguments"],
"key_data": ["any specific numbers, statistics, or data points mentioned"],
"methodology": "how the research/analysis was conducted (if applicable)",
"limitations": "noted limitations or caveats",
"citations_needed": ["claims that should be verified against the source"]
}""",
messages=[{"role": "user", "content": text}]
)
text_resp = response.content[0].text
if "```json" in text_resp:
text_resp = text_resp.split("```json")[1].split("```")[0]
return json.loads(text_resp.strip())
Step 4: Citation Tracking

def add_citations(summary: str, source_text: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="""For each claim in the summary, find the corresponding passage
in the source text and add a citation marker. Return the summary with [1], [2], etc.
markers and a list of source passages.
Return JSON:
{"annotated_summary": "summary with [1] [2] markers",
"citations": [{"id": 1, "claim": "...", "source_passage": "exact quote from source"}]}""",
messages=[{"role": "user",
"content": f"Summary:\n{summary}\n\nSource:\n{source_text[:8000]}"}]
)
text = response.content[0].text
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
return json.loads(text.strip())
Step 5: Multi-Document Synthesis

def synthesize_documents(summaries: list[dict]) -> dict:
combined = "\n\n===\n\n".join([
f"Document: {s['title']}\nSummary: {s['summary']}"
for s in summaries
])
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="""Synthesize multiple document summaries into a coherent analysis.
Identify:
1. Common themes across documents
2. Contradictions or disagreements
3. Gaps in coverage
4. Overall narrative
Return JSON with synthesis, themes, contradictions, and recommendations.""",
messages=[{"role": "user", "content": combined}]
)
text = response.content[0].text
if "```json" in text:
text = text.split("```json")[1].split("```")[0]
return json.loads(text.strip())
Step 6: Streamlit Interface

Build a clean interface with:
- File upload (PDF, DOCX, TXT) or URL input
- Tab-based display for TL;DR, Executive Summary, and Detailed Summary
- Collapsible citation references
- Key findings highlighted in a sidebar
- Multi-document mode with synthesis tab
- Download summary as formatted PDF or Markdown
Step 7: Batch Processing

For processing multiple documents:
- Upload a ZIP file or folder of documents
- Process in parallel with progress tracking
- Generate individual summaries plus a cross-document synthesis
- Export results as a single report
Performance and Cost Optimization

- Use Claude Haiku for chunk-level summaries (cheaper, faster) and Sonnet for final synthesis (better quality)
- Cache summaries for previously processed documents
- Implement smart chunking that respects section boundaries instead of splitting mid-paragraph
- For very long documents (100+ pages), use progressive summarization with user-guided focus areas
The Bottom Line
An AI document summarizer saves hours per document and enables knowledge workers to stay on top of far more material than humanly possible. The multi-level approach ensures that readers can go as deep as they need — from a 10-second TL;DR to a thorough detailed summary.
Build time: 3-4 hours. Cost: $0.05-0.50 per document depending on length. Impact: reading 10x more material in the same amount of time, with better retention of key points.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.