What MyCorum.ai claims.
What it cannot claim.
What it is trying to measure.
This page exists for one reason: because a platform that positions itself as a decision-quality improvement system has an obligation to be precise about the evidence behind its claims. We document what is verified, what is partial, what is genuinely unknown, and where our research agenda begins.
1. Four structural properties — verified in code
MyCorum.ai's deliberative engine, Le Corum, rests on four architectural properties that distinguish it from multi-agent systems, prompt chaining, and model ensembling. Each property is verifiable in the production codebase, not asserted through prompting.
Property 1 — R1 Isolation
results = await asyncio.gather(*[analyze(persona) for persona in personas])
# All 5 calls complete before any result is passed to another model
# SHA-256 hash of all 5 independent outputs logged as R1_ISOLATION_PROOF
Round 1 analysis is fully parallel. No model sees another's response until all five have completed. This is enforced at the Python coroutine level — not through prompt instructions, which could be violated. The SHA-256 proof in the telemetry provides an auditable record that all five outputs existed before any cross-model interaction occurred.
Property 2 — Algorithmic divergence measurement
kle = kernel_language_entropy(embeddings, bandwidth=0.5)
# Based on: Farquhar et al. (2024) "Detecting Hallucinations in LLMs"
# Nature, doi:10.1038/s41586-024-07421-0
biodiversity_index = entropy(0.5) + inverted_agreement(0.3) + cluster_count(0.2)
Semantic divergence is measured using Kernel Language Entropy on text embeddings — not self-evaluated by the models. The confidence score reflects actual geometric distance between model outputs in embedding space, not how confident each model reports being. The KLE method is adapted from Farquhar et al., 2024 — the first published application of this measure to deliberation stopping criteria.
Property 3 — Anti-convergence enforcement
if biodiversity_index < 0.25 or agreement_score > 0.90:
trigger_devil_advocate_round()
# Triggered by condition, not by prompt instruction
# The Contrarian persona cannot opt out
When consensus forms too rapidly, the orchestrator triggers an adversarial round. This is an algorithmic condition — the threshold values (0.25 / 0.90) are hyperparameters, not prompts. Premature agreement is treated as a system failure, not a success. The Contrarian persona is activated whether or not the deliberation appears to be converging naturally.
Property 4 — Mandatory dissent (Minority Report)
SYNTHESIS_SCHEMA = {
"minority_positions": { "type": "array", "minItems": 0 },
"required": ["recommendation","confidence","minority_positions","..."],
"strict": True
}
# synthesis_verifier.py: check_minority_quality()
# Rule: minority content > 50 words, overlap with consensus < 70%
The Minority Report is a required field in the JSON output schema — not optional, not generated only when divergence is high. A synthesis that omits minority positions fails schema validation and is retried. The synthesis verifier additionally checks that the minority content is substantive (minimum 50 words) and genuinely distinct from the consensus position (overlap threshold < 70%).
2. The Decision-Maker — the 1 in 1 + 5
It is 5 models + 1 human.
The distinction between The A-Team and The Dream Team is not the number of rounds. It is the presence of the Decision-Maker — the human — inside the deliberation loop, holding final authority at each inflection point.
In The A-Team, Le Corum deliberates for you. You receive the Corum Synthesis when deliberation is complete. The five minds have disagreed, challenged each other, and converged — or preserved their dissent — without your intervention.
In The Dream Team, the architecture is fundamentally different. You — the Decision-Maker — are present between rounds. You can pause the deliberation, redirect a line of inquiry, introduce new information, or signal that a particular position requires deeper scrutiny before the next round begins. The five minds respond to you. You are not a prompt. You are the authority the deliberation is structured around.
The research question this raises: does Decision-Maker presence improve the calibration of the final recommendation, or does it introduce anchoring bias — the human steering toward a preferred conclusion? This is one of the open empirical questions we intend to measure with VIP cohort data.
| Service | Architecture | Human role | Optimal use |
|---|---|---|---|
| The Expert | 1 model selected by MyPilot via benchmark scoring. Smart routing, not deliberation. | Passive — receives output | Single-domain questions requiring the best available expert |
| The A-Team | 5 models, adversarial deliberation, up to 4 adaptive rounds. Fully automated. | Passive — receives Corum Synthesis | Complex decisions where breadth of perspective matters, no time for interaction |
| The Dream Team | 5 models + 1 human Decision-Maker in the loop. Up to 6 rounds. Human pauses between rounds. | Active — the Decision-Maker. Steers, redirects, holds final authority. | Highest-stakes decisions where the human brings irreplaceable context, judgment, or authority |
3. What we claim — and what we do not
No model in Round 1 has access to another model's output. This is a structural guarantee of the asyncio.gather architecture, auditable via the SHA-256 proof in telemetry logs.
KLE on embeddings produces a divergence score independent of model self-assessment. The method is academically grounded (Farquhar et al., Nature 2024).
Reference: Farquhar et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630, 625–630. doi:10.1038/s41586-024-07421-0JSON schema with strict: true and synthesis_verifier.py quality checks (minimum 50 words, <70% overlap with consensus) ensure substantive dissent is always present in the output.
The five models are architecturally distinct, trained by different teams on partially different corpora. However, a significant portion of their training data overlaps (Common Crawl, Wikipedia, code repositories). Full Condorcet-style independence is not achieved. What is achieved is architectural diversity combined with adversarial structural pressure.
The score is a composite of KLE divergence (45%), structured agreement assessment (35%), and verbalized confidence (20%). It measures the quality and depth of the deliberation process. It does not yet have empirical validation against real-world decision outcomes — that calibration is part of the research agenda below.
This is the core claim. It is architecturally motivated and theoretically grounded. It is not yet empirically validated at scale. The first calibration data will come from the VIP cohort (March–April 2026). We will publish the results regardless of what they show.
Theoretical case: the human brings irreplaceable contextual knowledge that the models cannot access. Counter-case: human presence introduces anchoring and confirmation bias. We do not yet have data to distinguish these effects. This is a primary research question.
4. Known limitations — stated honestly
Training data overlap. The models in Le Corum share significant training data. Their "independence" is architectural and instructional — not statistical in the Condorcet sense. A deliberation between five models all trained heavily on Wikipedia does not guarantee five genuinely independent beliefs about a question rooted in Wikipedia facts.
KLE measures linguistic divergence, not epistemic divergence. Two models can produce linguistically diverse outputs while holding the same underlying belief. The epistemic_extractor.py module (GO/PIVOT/STOP extraction) partially addresses this, but the gap between linguistic and epistemic divergence is real and not fully closed.
The confidence score is not calibrated against outcomes. A score of 8.2/10 does not mean the decision is 82% likely to succeed. It means the deliberation was deep and the synthesis was well-supported by the panel. Until we have outcome data, interpreting the score as a probability is incorrect.
The KS stopping criterion is statistically weak at n=5. Kolmogorov-Smirnov tests on five samples have low statistical power. The novelty tracker (novelty_tracker.py) partially compensates, but the stopping criterion remains heuristic at this panel size. A bootstrap/permutation test approach is planned for J+45.
The Dream Team introduces human-in-loop bias risk. A Decision-Maker who steers the deliberation toward a preferred conclusion can produce a Corum Synthesis that confirms rather than challenges their prior. This is a known risk of any advisory system. MyPilot's neutrality principle (never recommend a service that exceeds the actual complexity of the question) is a partial mitigation — not a solution.
5. Open questions
These are the questions MyCorum.ai cannot answer today — and is committed to trying to answer with data:
The core hypothesis. Controlled comparison: same question, same user, The Expert (best single model) vs. The A-Team (5 models, adversarial). Outcome tracked over 30–90 days. Requires user consent to outcome reporting.
The Condorcet boundary for this architecture. At what divergence level does the confidence score lose predictive validity? This is empirically measurable once outcome data exists.
A-Team vs. Dream Team on comparable decisions with comparable stakes. Controlling for question complexity. Primary variable: does human steering increase or decrease calibration of the final confidence score?
EU vs. MENA vs. CN presets on identical strategic questions with regional context. Does Mercury-2 (G42, UAE) produce structurally different positions than the US-centric default panel on questions involving MENA regulatory or cultural context? KLE measurement across preset conditions.
In cases where the consensus recommendation was later judged incorrect by the Decision-Maker, was the correct position present in the Minority Report? Retrospective analysis on outcomes data. If yes, this is a novel finding about the informational value of structured dissent.
6. Empirical research agenda
50 VIP users, first 30 days of deliberations. Confidence score vs. self-reported outcome quality at 30 days. First empirical dataset for confidence calibration in multi-model deliberation. Results published regardless of outcome — positive or negative.
Replace the current KS stopping criterion with a bootstrap/permutation test approach, which is statistically valid at n=5. Quantify improvement in stopping precision on the calibration dataset.
First publication of the BI = entropy(50%) + inverted_agreement(30%) + cluster_count(20%) composite measure and its behavior on deliberation data. Open dataset accompanying the note for independent replication.
First published stopping criterion for multi-model deliberation. Comparison of KS, bootstrap, and heuristic approaches on the VIP dataset. Includes the known limitations section from this page as part of the methodology.
Controlled study on Q3 above. Requires sufficient scale (n ≥ 200 comparable decision pairs) and outcome tracking. First empirical evidence on whether human-in-the-loop deliberation outperforms automated deliberation on high-stakes decisions.
7. For researchers — contact and collaboration
We are interested in collaboration with researchers working on multi-agent AI, epistemic calibration, decision quality under uncertainty, and human-AI teaming. If you are studying any of the open questions above — or if you find errors in our claims — we want to hear from you.
We will share anonymized deliberation data with researchers under appropriate data agreements once the VIP cohort has produced a statistically meaningful dataset. The first dataset will be available no earlier than May 2026.
For technical questions, collaboration proposals, or to challenge any claim on this page — reach the founders directly.
contact@mycorum.ai