Moral Theory Benchmark Cross-Model Comparison

SFOM v1 assessed against 14 competing moral theories across 3 independent LLM evaluations (2025–2026)

SFOM (Subjective-Frame Objective Morality) is my theory. — suntzugi

SFOM Rank
#1
Across all 3 evaluations

Models Tested

Grok 3 (2025) · Claude Opus 4.6 (2026)

SFOM Best Score

27/31

87% — achieved twice

Lead Over #2

+3 to +5

Consistent gap across all runs

SFOM Stability

—

Rank std dev (0 = perfect)

Methodology

How the benchmark works: Two files are uploaded to an LLM — (1) an assessment framework defining 31 evaluation criteria, and (2) a catalogue of 15 moral theories written in comparable depth. The LLM is prompted to evaluate every theory against every criterion and produce a scored ranking. The process is then repeated with different models to test for consistency.

Anonymization: SFOM was submitted without any author name or attribution attached — as can be verified by inspecting the theory catalogue files directly. It appears as just another numbered theory ("Subjective-Frame Objective Morality Model") alongside the other 14. This eliminates the sycophancy factor — the model has no reason to favor it over any other entry.

Why this matters (and its limits): LLM-based evaluation is an imperfect but useful signal. These models can't "do philosophy" the way a domain expert can — they can't truly judge originality or depth of argument. But they can assess structural properties: internal consistency, scope of applicability, how well a theory addresses known edge cases, compatibility with empirical findings. When the same theory wins across different models (Grok 3, Claude Opus 4.6), different conditions (normal, steelmanned), and different time periods (2025, 2026) — all without the theory being updated — that's a meaningful signal, even if it's not a substitute for peer review.

Three evaluation runs:
Grok 3 Normal (Feb 2025): Base theory summaries assessed by Grok 3.
Grok 3 Steelman (Feb 2025): LLM-steelmanned versions of each theory (except SKL, used as control — its text stayed identical).
Claude Opus 4.6 (Mar 2026): Same v1 files, one year later, different model with Extended Thinking enabled. No theory updates.

All theories were assessed on equal footing using the same v1 benchmark framework. SFOM received no special treatment — it competed under the same rules as all 14 other theories.

Important context: The SFOM entry in the benchmark is only a condensed summary of my full moral theory, last updated in this system in Feb 2025. The actual theory is significantly more detailed and has improved dramatically since then. I plan to input more comprehensive versions of SFOM and the competing theories, improve the assessment framework, expand the criteria, add more LLMs, and ultimately automate the entire pipeline so anyone can reproduce and extend these results. Even in its current state — a year-old summary competing against 14 other theories — SFOM still wins consistently.

Known Limitations & Open Questions

This benchmark is a preliminary signal, not a settled result. If SFOM can't survive honest scrutiny, it doesn't deserve the top spot.

I wrote SFOM from scratch; the other 14 were outlined by me and expanded by LLMs. A purpose-built entry will naturally read more coherently than an LLM summary of someone else's life work. I used LLMs over domain experts for speed and cost; funding would fix this by commissioning experts to write or vet each tradition's entry.

The 31 criteria were mostly LLM-generated but I curated the final set, which could unconsciously favor the kind of theory I built. The scoring is also purely qualitative; I plan to develop quantitative measures or add them alongside the current criteria.

The benchmarked entry is a small summary of my theory as of early 2025; the core ideas remain but have been refined and formalized dramatically since. The benchmark is testing a compressed, year-old version — and it's still winning.

LLMs tend to reward well-structured, comprehensive-sounding documents over deeper but less neatly packaged ones; the benchmark may partly measure document quality rather than theory quality.

Three runs — two on Grok 3, one on Claude Opus 4.6 — is thin. More models, multiple runs, temperature/prompt variation, and entry-order randomization are planned.

An expert-rewritten competing entry overtaking SFOM; independently selected criteria producing different rankings; 10+ models showing the lead is within noise; or someone identifying a structural feature that mechanically advantages SFOM. If you find any of these, I want to hear about it.

Full Cross-Model Comparison

Theory	Grok 3 Normal Feb 2025	Grok 3 Steelman Feb 2025	Claude Opus 4.6 Mar 2026	Avg	Rank Δ Grok→Claude	Score Δ Grok Norm→Claude

Visualizations

Rank & Score Consistency

Rank Std Dev measures how much a theory's rank fluctuates across the 3 evaluations (0 = perfectly stable). Score Std Dev measures raw score volatility. Scrutiny Δ shows the score change from the weakest evaluator (Grok Normal) to the strongest (Claude Opus 4.6) — positive means the theory holds up or improves under smarter scrutiny, negative means it crumbles. Z-score shows how far each theory's score is from that model's average — positive = above average for that model, negative = below.

Theory	Rank Std Dev	Score Std Dev	Scrutiny Δ	Stability	Z (Grok Norm)	Z (Grok Steel)	Z (Claude)

Ranking Movement: Grok 3 Normal → Claude Opus 4.6

Biggest Climbers

Biggest Fallers

Source Files

All input and output files used in the benchmark. Click each tab to see details, or open directly on GitHub to inspect.

moral-theory-assessment-framework-v.1.md View on GitHub

Type: Input — Evaluation criteria
Role: Defines the 31 criteria each theory is scored against, covering internal consistency, practical applicability, explanatory power, and philosophical robustness. Also contains the evaluation prompt at the bottom.
Key detail: The 31 criteria were mostly invented and fleshed out by LLMs, with some curation on my part, in order to reduce my own bias when choosing criteria. Criteria are theory-agnostic — they don't favor any particular moral framework.

moral-theories-norm-v1.txt View on GitHub

Type: Input — Theory descriptions (base version)
Role: Contains all 15 moral theories written in comparable depth and format. SFOM appears as theory #9 with no author attribution — fully anonymized. LLMs fleshed out the outlines of the 14 established theories to reduce my personal bias in how they were described.
Key detail: SFOM is the only theory I wrote from scratch. The others were outlined by me and expanded by LLMs for neutral wording.

moral-theories-steelman-v1.txt View on GitHub

Type: Input — Theory descriptions (steelmanned version)
Role: Same 15 theories, but each one (except SKL) was steelmanned by an LLM to present the strongest possible version of each argument. SKL was kept identical as a control to see if the steelman boost affected relative rankings.
Key detail: All theories including SFOM were steelmanned — giving competitors every advantage. SFOM still ranked #1.

grok-3-normal-test_19:02:25.jpg View on GitHub

Type: Output — Screenshot evidence
Date: Feb 19, 2025
Model: Grok 3 (base theory versions)
Result: SFOM ranked #1 with 25/31

grok3-steelman-test_19:02:25.jpg View on GitHub

Type: Output — Screenshot evidence
Date: Feb 19, 2025
Model: Grok 3 (steelmanned theory versions)
Result: SFOM ranked #1 with 27/31

claude-opus-extended-thinking-test_2026-03-06.html View on GitHub

Type: Output — Full conversation export
Date: Mar 6, 2026
Model: Claude Opus 4.6 with Extended Thinking
Result: SFOM ranked #1 with 27/31. Full conversation showing the model's reasoning process.

claude-opus-extended-thinking-assessment_2026-03-06.docx View on GitHub

Type: Output — Complete assessment document
Date: Mar 6, 2026
Model: Claude Opus 4.6 with Extended Thinking
Result: All 15 theories scored, with detailed failure analysis for each theory and stress-tests of the top 3.

Reproduce / Corroborate / Test the Results

How to run it yourself:

Download the two input files from this repo: Assessment Framework and Theory Catalogue (or the steelmanned version).
Open any capable LLM (Claude, Grok, GPT, Gemini, DeepSeek, etc.).
Upload both files and prompt: "Answer the question in the theory list document using the framework in the framework document, in detail." (The full prompt is at the bottom of both files.)
The model will evaluate all 15 theories against the 31 criteria and produce a scored ranking.

Expect some variation: LLM outputs are non-deterministic, so your exact scores will differ slightly from run to run. These results are all first-pass, single-run evaluations on my end. In the near term, I plan to run each model multiple times and in different configurations (temperature, prompting style, order randomization) to minimize variation and bias and produce averaged results with confidence intervals.

Add your own theory: You can add any moral theory that's missing — or your own — to the theory catalogue file, then run it through the same framework. The system is designed to be open: just add a new numbered entry in the same format as the existing 15 and re-run. Others have tried this during the 2025 hackathon and no submitted theory managed to beat SFOM.

Challenge it: If you think a theory should score higher, or that the framework criteria are biased, I want to hear about it. The whole point of making this open is to invite scrutiny. If SFOM can't survive challenge, it doesn't deserve the top spot.

The 31 Assessment Criteria

Each theory is scored against these 31 criteria. Click any criterion to see its core question and explanation. View full framework on GitHub.

Does the theory avoid positing entities or properties beyond those required by our best scientific theories?
Evaluates whether a moral theory introduces unnecessary metaphysical commitments. A parsimonious theory aligns with natural properties or constructed agreements, avoiding non-natural facts or divine commands.

Is the theory free from internal contradictions and paradoxes?
Assesses whether the theory's components form a logically coherent whole, avoiding paradoxes or self-undermining principles across all applications.

Does the theory accurately capture the first-person experience of moral judgment?
Evaluates how well the theory reflects the lived experience of morality, explaining why moral judgments feel objective and distinct from preferences.

Can the theory accommodate diverse cultural moral frameworks while maintaining core principles?
Assesses whether the theory explains moral diversity across cultures without collapsing into relativism, identifying universal elements amid variation.

Does the theory provide specific guidance for resolving moral dilemmas?
Evaluates the theory's ability to offer clear principles for real-world moral decision-making, resolving uncertainty in novel situations.

Does the theory explain why moral judgments inherently motivate action?
Assesses whether the theory accounts for the motivational force of moral judgments, linking recognition of duty to the will to act.

Are moral truths discoverable through ordinary human faculties?
Evaluates whether moral knowledge is attainable via reflection, empathy, or observation, without requiring special faculties or revelation.

Can the theory handle complex moral dilemmas without counterintuitive results?
Assesses the theory's ability to address tough cases (e.g., trolley problems) with principled, non-absurd conclusions.

Does the theory explain why diverse cultures develop similar core values?
Evaluates whether the theory accounts for common moral prohibitions (e.g., murder) and values (e.g., reciprocity) across societies.

Does the theory account for how people actually talk about morality?
Assesses whether the theory explains why moral discourse involves notions of truth, obligation, and rights rather than mere preferences.

Does the theory explain the phenomenology of moral certainty?
Evaluates whether the theory accounts for why some moral judgments (e.g., "torturing innocents is wrong") feel self-evident.

Does the theory explain how moral judgments can be mistaken?
Assesses whether the theory explains moral error and enables progress through correcting mistaken views.

Is the theory compatible with our best scientific understanding of human psychology and evolution?
Evaluates alignment with evolutionary psychology, neuroscience, and other sciences, avoiding conflicts with established facts.

Can the theory address moral questions involving non-human animals and future entities?
Assesses whether the theory extends to animals, AI, and future generations without ad hoc adjustments.

Does the theory provide a coherent account of how moral views improve over time?
Evaluates whether the theory distinguishes moral improvement (e.g., abolition of slavery) from mere change, providing standards for progress.

Does the theory bridge the gap between descriptive facts and normative claims?
Assesses whether the theory justifies normative conclusions from facts without committing the naturalistic fallacy.

Can the theory explain persistent disagreement among moral experts?
Evaluates whether the theory accounts for moral disagreement without undermining the possibility of moral truth.

Does the theory maintain moral authority across contexts?
Assesses whether the theory preserves the binding force of moral judgments, avoiding pure relativism.

Does the theory provide a simple, unified account of morality?
Evaluates the theory's simplicity and unity, avoiding complexity or disconnected explanations.

Does the theory explain moral disagreement without undermining the possibility of moral knowledge?
Assesses whether disagreement is explained via bias or incomplete information, not systematic unreliability.

Does the theory demonstrate interaction effects between its components that create stronger explanations when combined?
Evaluates whether the theory's principles reinforce each other, enhancing explanatory power through integration.

Does the theory connect seemingly disparate moral phenomena under a common framework?
Assesses whether the theory unifies diverse intuitions (e.g., harm, fairness) under shared principles.

Does the theory generate non-obvious predictions about moral situations not previously considered?
Evaluates whether the theory anticipates judgments in new contexts, demonstrating generative capability.

Does the theory provide explanations that are not just logically consistent but compelling and intuitively satisfying?
Assesses whether the theory's accounts create insight and satisfy deep moral curiosity.

Does the theory maintain its explanatory and practical power across widely varying contexts and scenarios?
Evaluates consistency across historical, cultural, and technological contexts while retaining core principles.

Does the theory align with widely shared moral intuitions as a primary benchmark?
Assesses how well the theory matches core intuitive judgments (e.g., "murder is wrong") as a foundational test.

Does the theory offer guidance that is not only specific but accessible and user-friendly for decision-making?
Refines action guidance to ensure it's practically clear to ordinary people, not just theoretically precise.

Can the theory withstand any foundational philosophical objection, known or emerging?
Evaluates the theory's overall defensibility against broad critique, beyond specific challenges like is-ought or relativism.

Does the theory apply effectively across a wide range of practical situations, from everyday to extreme?
Assesses coverage of diverse scenarios (e.g., mundane choices vs. crises), complementing entity-based scope.

Does the theory define right and wrong with unambiguous precision?
Evaluates the sharpness of moral boundaries, ensuring the theory avoids vagueness in its core prescriptions.

Does the theory clearly define and defend its stance on whether morality is objective or subjective, successfully addressing arguments from the opposing view?
Evaluates whether the theory articulates a position on the objective vs. subjective nature of moral truths and provides a reasoned defense, directly engaging and refuting key counterarguments from the opposing side.