Create an AI Content Moderator: Automate Trust and Safety at Scale
Build a content moderation system that classifies text, images, and user reports with AI. Production patterns for trust and safety.
Every platform with user-generated content faces the same problem: moderate too aggressively and you kill engagement; moderate too lightly and you become a cesspool. Manual moderation doesn’t scale. Simple keyword filters catch legitimate content while missing creative workarounds. And the psychological toll on human moderators is well-documented.
AI content moderation offers a middle path. It scales infinitely, handles nuance better than keyword filters, and reserves human review for the cases that actually need human judgment. This tutorial builds a production-grade content moderation system that classifies text, handles edge cases, and integrates with your existing platform.
What We’re Building

A content moderation system that:
- Classifies text content across multiple policy categories
- Assigns confidence scores and severity levels
- Routes low-confidence decisions to human review
- Handles appeals and feedback loops
- Provides moderation analytics and reporting
- Integrates via REST API
Tech Stack
- Python 3.11+ with FastAPI
- Claude API for content classification
- PostgreSQL for moderation logs
- Redis for rate limiting and caching
- Streamlit for the moderation dashboard
Step 1: Define Your Content Policy

Before writing any code, define your content policy categories. Here’s a common set:
POLICY_CATEGORIES = {
"harassment": "Targeted harassment, bullying, or intimidation of individuals",
"hate_speech": "Content promoting hatred against protected groups",
"violence": "Graphic violence, threats, or incitement",
"sexual_content": "Sexually explicit material or solicitation",
"spam": "Unsolicited commercial content, repetitive posting, or manipulation",
"misinformation": "Demonstrably false claims about health, safety, or elections",
"self_harm": "Content promoting self-harm or suicide",
"illegal_activity": "Content promoting illegal activities",
"personal_info": "Sharing others' private information without consent",
"clean": "Content that doesn't violate any policies"
}
Step 2: AI Classification Engine

from anthropic import Anthropic
import json
client = Anthropic()
MODERATION_PROMPT = """You are a content moderation system. Analyze the provided text and classify it.
Content Policy Categories:
{categories}
For the given text, return JSON:
{{
"primary_category": "category_name or clean",
"confidence": 0.0 to 1.0,
"severity": "none|low|medium|high|critical",
"explanation": "brief explanation of classification",
"action": "approve|flag_review|remove",
"secondary_categories": ["any additional relevant categories"]
}}
Rules:
- Consider context and intent, not just surface-level keywords
- Sarcasm, humor, and educational content should not be flagged unless genuinely harmful
- Confidence below 0.7 should recommend flag_review
- Be specific in explanations to help human reviewers
"""
def moderate_content(text: str, context: dict = None) -> dict:
categories_text = "\n".join(
[f"- {k}: {v}" for k, v in POLICY_CATEGORIES.items()]
)
user_content = f"Text to moderate:\n\n{text}"
if context:
user_content += f"\n\nContext: {json.dumps(context)}"
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
system=MODERATION_PROMPT.format(categories=categories_text),
messages=[{"role": "user", "content": user_content}]
)
result_text = response.content[0].text
if "```json" in result_text:
result_text = result_text.split("```json")[1].split("```")[0]
return json.loads(result_text.strip())
Step 3: REST API

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import time
app = FastAPI(title="AI Content Moderator")
class ModerationRequest(BaseModel):
text: str
user_id: str = None
content_type: str = "comment"
context: dict = None
class ModerationResponse(BaseModel):
decision: str
category: str
confidence: float
severity: str
explanation: str
moderation_id: str
@app.post("/moderate", response_model=ModerationResponse)
async def moderate(request: ModerationRequest):
if len(request.text) > 10000:
raise HTTPException(400, "Text exceeds maximum length")
result = moderate_content(request.text, request.context)
moderation_id = f"mod_{int(time.time()*1000)}"
# Log the decision
log_moderation(moderation_id, request, result)
return ModerationResponse(
decision=result["action"],
category=result["primary_category"],
confidence=result["confidence"],
severity=result["severity"],
explanation=result["explanation"],
moderation_id=moderation_id
)
Step 4: Human Review Queue

When AI confidence is below the threshold, items go to a human review queue. Build a Streamlit interface that shows flagged content with the AI’s classification, confidence, and explanation. Human moderators can approve, remove, or escalate, and their decisions feed back into the system to improve future classifications.
def get_review_queue(limit: int = 50):
"""Get items pending human review, sorted by severity."""
conn = get_db()
return conn.execute("""
SELECT moderation_id, text, category, confidence, severity, explanation
FROM moderation_log
WHERE action = 'flag_review' AND human_decision IS NULL
ORDER BY
CASE severity
WHEN 'critical' THEN 1
WHEN 'high' THEN 2
WHEN 'medium' THEN 3
WHEN 'low' THEN 4
END,
created_at DESC
LIMIT ?
""", (limit,)).fetchall()
Step 5: Feedback Loop

Track where AI and human decisions diverge. This data is invaluable for identifying policy gaps, edge cases, and areas where the AI prompt needs refinement.
def analyze_disagreements():
"""Find patterns where AI and humans disagree."""
conn = get_db()
disagreements = conn.execute("""
SELECT category, action, human_decision, COUNT(*) as count
FROM moderation_log
WHERE human_decision IS NOT NULL
AND action != human_decision
GROUP BY category, action, human_decision
ORDER BY count DESC
""").fetchall()
return disagreements
Step 6: Performance Optimization

Caching
Cache moderation results for identical or near-identical content. Use a hash of the content as the cache key.
Batching
For bulk moderation (importing historical content, processing queued posts), batch multiple items per API call:
def moderate_batch(items: list[str]) -> list[dict]:
numbered = "\n---\n".join([f"[{i}] {text}" for i, text in enumerate(items)])
# Single API call for multiple items
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="Classify each numbered text item...",
messages=[{"role": "user", "content": numbered}]
)
return parse_batch_response(response)
Pre-filtering
Use fast, local checks before AI classification:
- URL/link density (high link count = likely spam)
- Banned word lists (obvious violations don’t need AI)
- Rate limiting (too many posts = likely spam)
- Length checks (single characters, overly long posts)
Step 7: Moderation Dashboard

Build analytics showing:
- Moderation volume over time
- Category distribution
- AI accuracy vs. human decisions
- Average response time
- Top flagged users
- False positive/negative rates
Step 8: Production Considerations

Latency Requirements
Most platforms need moderation decisions in under 2 seconds. Claude Sonnet typically responds in 500-1500ms for classification tasks. Pre-filtering catches 30-50% of obvious cases instantly.
Cost Management
At approximately $0.003 per moderation (Sonnet pricing for typical classification), moderating 100,000 items/day costs about $300/month. Use Haiku ($0.0003 per moderation) for initial screening and Sonnet only for ambiguous cases.
Legal Compliance
Different jurisdictions have different content moderation requirements (EU DSA, US Section 230). Log all moderation decisions, provide appeal mechanisms, and maintain transparency about your moderation policies.
The Bottom Line
AI content moderation isn’t just faster than manual moderation — it’s more consistent, more scalable, and frees human moderators to focus on genuinely difficult decisions. The system in this tutorial handles the 80% of cases that are clearly clean or clearly violating, routing only the ambiguous 20% to human review.
Build time: 5-6 hours. Cost: $100-300/month for a medium-traffic platform. The alternative — hiring a team of human moderators — costs 10-100x more.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.