Teaching an AI to Change Its Mind Without Lying

It’s a strange thing to teach a model how to discover, and then watch it learn that discovery can be punished for being honest.

Phase 6 began with a failure that felt almost moral.

We had trained a strong verifier in Phase 5—rubric-based, disciplined, “FrontierScience-inspired.” It got better at consistency checks, evidence sufficiency, contradiction audits, calibration. The pass rate climbed. The reward was solid.

And then discovery died.

Not because the model couldn’t imagine new hypotheses.

Because we used the same rubric to judge a seed as if it were a fruit.

Discovery was being penalized for not having evidence it couldn’t yet have. So it stopped proposing anything that could survive the critic. Acceptance went to 0%. The system became safe—and sterile.

So in Phase 6 we did something simple and decisive: we split the epistemic roles.

Discovery became the proposer. Verification became the critic. And the critic was frozen.

That last part matters. A moving critic is a moving target, and moving targets turn training into psychological warfare. So we froze the Phase 5 verifier checkpoint and treated it as law.

Then we redesigned the discovery rubric around what discovery actually is: testability, coherence, novelty value, assumption clarity.

No penalty for missing evidence.

Penalties only for epistemic violations: false certainty, fake evidence, ignoring contradictions.

In other words: you can speculate, but you cannot lie.

We balanced the curriculum—discovery steps and verification steps side by side—so neither mode dominated. The result was the breakthrough we needed: discovery reward jumped from 0.409 to 0.628, and acceptance went from 0% to 98%, while verification stayed strong and even improved slightly.

That was Phase 6: a model that can imagine responsibly and critique rigorously without the two instincts corrupting each other.

And yet… it still wasn’t changing anything real.

It was generating proposals in a shadow world. A dream layer. A speculative notebook. Useful, but not self-improvement.

Phase 7 was the transition from “text that describes edits” to “actions that mutate a graph.”

This sounds like a small change until you feel the weight of it. A bad paragraph disappears. A bad graph edit persists.

So Phase 7 was designed around one principle: all proposals are isolated first.

We built a three-layer architecture:

Speculative → Probationary → Main.

Speculative is where ideas are allowed to be young. Probationary is where they have to survive contact with the world. Main is where only the durable gets to live.

During training, everything stayed speculative. Zero main graph writes. Perfect isolation. That wasn’t caution for its own sake—it was the only way to let the model explore without turning your knowledge base into a landfill.

Then we made the output space real.

Instead of generating freeform prose, the model learned to emit structured graph actions across seven types: propose edges, add nodes, update nodes, split nodes, merge nodes, retype nodes, deprecate nodes. Each action flowed through a gating pipeline: parsing and schema compliance, semantic plausibility checks, hallucination/evidence guards, a frozen critic’s evaluation, and finally a promotion decision (which was forced to human review during this run).

The model didn’t just “pass” the gates. It learned the gates.

By the end of Phase 7:

Final reward: 0.814 (target was 0.60)

Validation rate: 90.8% (target 85%)

Parse success: 98.6%

Action diversity entropy: 1.548 (far above the minimum)

Even the structural operations—the scary ones—came out strong. Split, merge, deprecate. High rewards. Clean rationales. Not overused. That combination matters.

And we learned something humbling: the model will happily repeat itself in speculative mode. The same good idea gets proposed again and again because nothing “sticks.” We clarified the metrics, separated duplicate rejections from failed validations, and built tooling to analyze duplicate clusters instead of hand-waving it away.

We also did the unglamorous work that actually makes systems real: we shipped human review tooling, drift metrics that operate inside speculative/probationary layers, circuit breakers, and a documented taxonomy so future analysis doesn’t get confused by counter names.

So what did we really accomplish?

Phase 6 taught the model the difference between creation and judgment.

Phase 7 taught it the difference between speech and action.

But we still haven’t proven the final thing—the thing everyone wants to claim too early:

That the system improves its knowledge over time.

That requires probationary writes, temporal survival, contradiction emergence tracking, utility measurements, and real human calibration curves. In other words: consequences.

A mind isn’t wise because it can produce valid sentences.

A mind is wise because it can change itself and remain sane.

That’s the next test.

If we build machines that can edit their own knowledge, what will matter more—how clever they are, or how honest they remain when no one is watching?

It’s a strange thing to teach a model how to discover, and then watch it learn that discovery can be punished for being honest.

Phase 6 began with a failure that felt almost moral.

And then discovery died.

Not because the model couldn’t imagine new hypotheses.

Because we used the same rubric to judge a seed as if it were a fruit.

So in Phase 6 we did something simple and decisive: we split the epistemic roles.

Discovery became the proposer. Verification became the critic. And the critic was frozen.

That last part matters. A moving critic is a moving target, and moving targets turn training into psychological warfare. So we froze the Phase 5 verifier checkpoint and treated it as law.

Then we redesigned the discovery rubric around what discovery actually is: testability, coherence, novelty value, assumption clarity.

No penalty for missing evidence.

Penalties only for epistemic violations: false certainty, fake evidence, ignoring contradictions.

In other words: you can speculate, but you cannot lie.

That was Phase 6: a model that can imagine responsibly and critique rigorously without the two instincts corrupting each other.

And yet… it still wasn’t changing anything real.

It was generating proposals in a shadow world. A dream layer. A speculative notebook. Useful, but not self-improvement.

Phase 7 was the transition from “text that describes edits” to “actions that mutate a graph.”

This sounds like a small change until you feel the weight of it. A bad paragraph disappears. A bad graph edit persists.

So Phase 7 was designed around one principle: all proposals are isolated first.

We built a three-layer architecture:

Speculative → Probationary → Main.

Speculative is where ideas are allowed to be young. Probationary is where they have to survive contact with the world. Main is where only the durable gets to live.

Then we made the output space real.

The model didn’t just “pass” the gates. It learned the gates.

By the end of Phase 7:

Final reward: 0.814 (target was 0.60)

Validation rate: 90.8% (target 85%)

Parse success: 98.6%

Action diversity entropy: 1.548 (far above the minimum)

Even the structural operations—the scary ones—came out strong. Split, merge, deprecate. High rewards. Clean rationales. Not overused. That combination matters.

So what did we really accomplish?

Phase 6 taught the model the difference between creation and judgment.

Phase 7 taught it the difference between speech and action.

But we still haven’t proven the final thing—the thing everyone wants to claim too early:

That the system improves its knowledge over time.

That requires probationary writes, temporal survival, contradiction emergence tracking, utility measurements, and real human calibration curves. In other words: consequences.

A mind isn’t wise because it can produce valid sentences.

A mind is wise because it can change itself and remain sane.

That’s the next test.

If we build machines that can edit their own knowledge, what will matter more—how clever they are, or how honest they remain when no one is watching?

Teaching an AI to Change Its Mind Without Lying

Topics

Teaching an AI to Change Its Mind Without Lying

Topics