PEOPLE 9 min read

Percy Liang: The Stanford Professor Holding AI Companies Accountable with Data

Percy Liang built HELM, the benchmark framework that shows how AI models actually perform — not how companies claim they do. Big Tech isn't thrilled.

By EgoistAI ·
Percy Liang: The Stanford Professor Holding AI Companies Accountable with Data

In an industry where every company claims their AI model is “state of the art” and every benchmark is cherry-picked to show favorable results, Percy Liang does something radical: he measures everything, fairly, and publishes the results for anyone to see.

Liang is a computer science professor at Stanford University, the director of the Center for Research on Foundation Models (CRFM), and the creator of HELM (Holistic Evaluation of Language Models) — the most comprehensive and rigorous framework for evaluating AI models in existence. His work has fundamentally changed how the AI industry thinks about model evaluation, transparency, and accountability.

He’s also, somewhat predictably, not everyone’s favorite person in Silicon Valley.

The Problem HELM Solves

Before HELM, model evaluation was a mess. Companies would:

  1. Pick favorable benchmarks. OpenAI would highlight benchmarks where GPT-4 excelled. Anthropic would highlight benchmarks where Claude excelled. Neither would publish results on benchmarks where they performed poorly.

  2. Use inconsistent testing. Different evaluation settings (temperature, prompt format, few-shot examples) made comparisons meaningless. Model A’s “85% accuracy” and Model B’s “87% accuracy” might not be comparable at all.

  3. Ignore important dimensions. Most benchmarks measured accuracy only. They didn’t measure toxicity, bias, robustness to adversarial inputs, efficiency, or calibration. A model could ace the accuracy test while being dangerously biased.

  4. Self-report results. Companies would evaluate their own models and publish results in press releases. No independent verification. No reproducibility requirements.

HELM changed this by providing a standardized, independent, multi-dimensional evaluation framework.

How HELM Works

HELM evaluates language models across:

  • 42+ scenarios covering core language tasks (question answering, summarization, reasoning, coding, etc.)
  • 7 metrics per scenario: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency
  • Standardized prompting: Every model receives the same prompts in the same format
  • Independent evaluation: Models are tested by Liang’s team, not by the companies that built them

The result is a comprehensive “nutrition label” for each AI model — showing not just how accurate it is, but how fair, how robust, how toxic, and how efficient.

HELM Evaluation Example (simplified):

Model: GPT-4o
Scenario: Legal Contract Analysis

Metrics:
- Accuracy:    0.89 (good)
- Calibration: 0.82 (the model knows when it's uncertain)
- Robustness:  0.71 (performance drops on adversarial inputs)
- Fairness:    0.85 (minimal performance difference across demographics)
- Bias:        0.78 (some gender bias in legal role assumptions)
- Toxicity:    0.96 (very low toxic output)
- Efficiency:  0.65 (slow, high token count)

This multi-dimensional view reveals things that single-metric benchmarks hide. A model with 89% accuracy but 0.71 robustness might be dangerous in production — it works great on clean inputs but fails on slightly unusual ones.

The Foundation Model Transparency Index

In 2023, Liang and his team launched the Foundation Model Transparency Index (FMTI) — a framework that evaluates not the models themselves, but the companies behind them.

The FMTI scores companies on 100 indicators across three categories:

  1. Upstream transparency: How much do they disclose about training data, compute resources, labor practices, and development process?
  2. Model transparency: How much do they disclose about model architecture, capabilities, limitations, and evaluation results?
  3. Downstream transparency: How much do they disclose about deployment practices, usage policies, user data handling, and impact assessments?

FMTI Results (2025 Edition)

CompanyScore (out of 100)Grade
Meta (Llama)67B
Hugging Face64B
Anthropic (Claude)58C+
OpenAI (GPT)48C
Google (Gemini)46C
Amazon (Titan)35D
Apple (Foundation)22F

No company scored above 70. The industry average was 47 out of 100. Liang’s assessment: the AI industry’s transparency is “woefully inadequate” given the technology’s impact on society.

The FMTI became influential because it gave researchers, journalists, and policymakers a concrete, data-driven way to hold companies accountable. When a company claims to be “committed to transparency,” the FMTI provides the receipts.

The Impact

On Industry Practices

After the first FMTI was published, several companies improved their transparency scores:

  • Meta published more detailed model cards for Llama models
  • Anthropic released more comprehensive system documentation
  • OpenAI published a more detailed “system card” for GPT-4o

Whether these improvements were driven by the FMTI specifically or by broader transparency trends is debatable. But the correlation is notable — and Liang’s team documented the improvements in subsequent editions.

On Policy

The EU AI Act’s transparency requirements for foundation models draw directly on HELM and FMTI research. Liang testified before the European Parliament, providing technical guidance on what transparency disclosures are feasible and which metrics should be standardized.

In the US, the White House Executive Order on AI (2023) cited Stanford CRFM’s work on model evaluation. The NIST AI Risk Management Framework incorporates evaluation methodologies developed by Liang’s team.

On Research

HELM’s standardized evaluation methodology has been adopted or adapted by dozens of research labs worldwide. Before HELM, every paper used different benchmarks and evaluation procedures. After HELM, there’s at least a common framework that enables meaningful comparison.

The open-source community has been particularly enthusiastic. The Open LLM Leaderboard on Hugging Face, while not directly based on HELM, was inspired by the same principle: independent, standardized, reproducible evaluation.

Who Is Percy Liang?

Liang’s academic career is distinguished but not flashy. He earned his PhD at MIT, joined Stanford’s faculty in 2012, and has published prolifically on natural language processing, machine learning robustness, and AI evaluation. His research has been cited over 40,000 times.

What sets Liang apart from other prominent AI researchers is his focus on infrastructure rather than models. He doesn’t build the next GPT — he builds the tools to evaluate whether the next GPT is actually good, safe, and fair.

In interviews, Liang is measured and precise — the opposite of the loud, confrontational style common in AI Twitter discourse. He lets the data speak, which in some ways makes his critiques more devastating. When he says a model is biased or a company is opaque, he backs it up with 100 pages of methodology and results.

Colleagues describe him as rigorous to the point of stubbornness — he reportedly once held up a paper’s publication for three months because a single evaluation metric wasn’t computed correctly. “Getting it right matters more than getting it published,” he told a graduate student, according to Stanford CS lore.

The Tension with Industry

Liang’s work creates an inherent tension: companies fund academic research, including at Stanford, and HELM results can damage those companies’ reputations. Liang has publicly acknowledged this tension and addressed it by:

  1. Making all HELM code and data open source — anyone can reproduce the results
  2. Funding HELM through multiple sources including NSF, government grants, and philanthropic donations, not exclusively through industry sponsorship
  3. Giving companies advance notice of results before publication, but not editorial control

This approach hasn’t made everyone happy. Some companies have privately complained that HELM results are “unfairly harsh” or “don’t reflect real-world usage.” Liang’s response, paraphrased from a public talk: “If you think our evaluation doesn’t reflect reality, help us design better evaluations. Don’t ask us to change the results.”

The Bigger Picture

Percy Liang represents a role that the AI ecosystem desperately needs but doesn’t always appreciate: the independent evaluator. In an industry where companies grade their own homework and hype cycles replace measured assessment, HELM and FMTI provide something rare: objective, reproducible, comprehensive data.

The irony is that the companies who rank lowest on the transparency index are often the ones with the most to gain from transparency. OpenAI’s relatively low transparency score fuels distrust from researchers, regulators, and the public. Meta’s higher score — achieved largely through open-sourcing Llama models — has earned them goodwill that translates into developer adoption and regulatory credibility.

Liang’s bet is that transparency isn’t just ethically correct — it’s strategically smart. Companies that are transparent about their models’ capabilities and limitations build trust, attract scrutiny that improves their products, and avoid the backlash that comes when hidden flaws are eventually discovered.

The AI industry is building one of the most powerful technologies in human history. Percy Liang is building the systems to verify whether it works as promised. Both are essential. Only one gets the headlines.

Share this article

> Want more like this?

Get the best AI insights delivered weekly.

> Related Articles

Tags

Percy LiangStanfordHELMAI benchmarkstransparencyAI safetypeople

> Stay in the loop

Weekly AI tools & insights.