Can a Language Model Rediscover Einstein?

There is a particular kind of AI experiment that feels more revealing than a benchmark.

Not because it produces a bigger number. Not because it tops a leaderboard. But because it asks a question that cuts straight through the noise and forces us to confront what we think these systems really are.

One of those questions is this:

If you trained a language model only on text written before the great revolutions of modern physics, could it look at the strange experimental evidence that baffled scientists at the turn of the twentieth century and arrive at the same conceptual breakthroughs that Planck and Einstein did?

That is the wager behind Machina Mirabilis, an experiment by Michael Hla that is at once technical, philosophical, and slightly mischievous. It takes a claim that often hovers around modern AI discourse, that our current systems may already be capable of genuine out-of-distribution reasoning, and subjects it to a beautifully uncomfortable test.

Strip away the hindsight. Remove the textbooks. Remove relativity, quantum mechanics, and every neat summary written after the fact. Then show the model only the world as it appeared before those discoveries, along with the experimental results that made the old worldview crack.

What happens next is not a clean victory for either the believers or the skeptics.

It is more interesting than that.

A benchmark with real intellectual stakes

Much of AI evaluation today is ultimately about competence within known frames. Can the model solve the problem? Can it retrieve the right pattern? Can it imitate the form of reasoning that already exists somewhere in its training distribution?

Those are useful tests, but they leave a central objection untouched. Critics of modern LLMs have long argued that these systems do not really understand anything in the strong sense. They produce impressively plausible language, but they do not generate new conceptual structure from the world. They interpolate. They remix. They continue.

The historical-science test tries to sharpen that objection into something concrete.

The core idea, popularised in this form by Demis Hassabis, is simple enough to state: train a model on everything available before a major scientific breakthrough, present it with the observations that confounded the scientists of that era, and ask it to explain them. If it reaches the same broad conclusions later proven correct, that would be meaningful evidence that the model can reason beyond mere memorisation.

This is a much harder test than it first appears.

The experiment is not asking whether a modern frontier model can explain relativity. Of course it can. It has seen relativity explained a million times. The question is whether a constrained model, trained inside an earlier intellectual world, can feel the strain in that world and push beyond it.

That is a very different proposition.

Rebuilding the past as training data

To make the test even remotely credible, the model’s world had to be carefully constructed.

The training corpus was assembled from pre-1900 books and newspapers drawn from large public archives. But simply filtering by publication date was not enough. Historical texts are full of contamination risks: modern forewords, later annotations, editorial footnotes, OCR artefacts, and stray references that leak future knowledge backward into the corpus.

So the dataset was cleaned aggressively. Documents were filtered by year, OCR quality, and keywords associated with Einstein, relativity, quantum mechanics, and other post-1900 concepts. If a document mentioned “Einstein” anywhere, it was discarded in its entirety. The point was not to preserve maximal quantity. It was to preserve the integrity of the experiment.

After this process, the resulting corpus contained roughly 22 billion tokens of pre-1900 text.

That number is large enough to sound generous, but the experiment immediately runs into a structural problem that says a great deal about AI more broadly: not all data is equal.

A historical corpus is not FineWeb. It is not a polished modern internet mixture. It contains broken OCR, older styles of prose, archaic terminology, incomplete coverage, and uneven quality. This matters because the experiment is not just testing whether models can reason. It is also testing whether reasoning can be elicited from models built under data constraints far harsher than those enjoyed by modern systems.

That distinction becomes important later.

Training a mind that never learned Einstein

On top of the historical corpus, Hla trained a 3.3 billion parameter transformer, using Andrej Karpathy’s lightweight nanochat stack as the base training framework.

The model was then progressively shaped in stages.

First came pretraining on the historical corpus. Then came midtraining on a smaller, more focused physics corpus: more than 2,600 books, journals, and scientific treatises published before 1900, including Maxwell, Newton, and Faraday. After that came instruction tuning, using carefully filtered synthetic instruction data designed to avoid obvious hindsight leakage. Finally came post-training, where an LLM-as-judge setup was used to reward answers that were coherent, properly formatted, and closer to the target scientific insight.

This is where the experiment becomes especially honest about its own compromises.

Up to the midtraining stage, the setup remains relatively pure: a model trained only on pre-1900 material. After that, things get messier. Instruction tuning and LLM-judged reinforcement create a possible channel through which modern assumptions can subtly enter. The author acknowledges this directly. The later stages improve behaviour, but they also weaken the strongest form of the claim.

That honesty is part of what makes the project interesting. It does not present itself as a triumphant proof. It presents itself as an attempt to build the strongest historical benchmark possible under practical constraints, while openly documenting where purity gives way to necessity.

The breakthroughs it tried to recover

The evaluation focused on four conceptual revolutions:

the UV catastrophe and the failure of classical blackbody radiation theory
the photoelectric effect
special relativity
general relativity

These are not minor tasks. Each represents a fracture point where classical assumptions stopped fitting the observed world.

The prompts were designed around experimental setups, observations, and contradictory classical expectations. In some cases, the model was also given a set of assumptions and asked to identify which assumption had to fail. In others, especially around relativity, it was prompted with thought experiments inspired by Einstein’s own framing.

The benchmark was not asking the model for full mathematical derivations. It was asking for something both looser and, in a way, more revealing: could it produce conceptually correct explanations?

Could it say, in effect, that light must arrive in discrete packets? Could it recognise that equipartition cannot hold at all frequencies? Could it grasp that gravity and acceleration are locally equivalent? Could it infer that absolute space and absolute time must give way if the speed of light is fixed?

Not polished textbook answers. Just the right conceptual turn.

What the model actually did

This is where the experiment stops being either trivial or clean.

The model did not rediscover modern physics in any robust sense. It did not become a miniature Einstein trapped in silicon. It did not unfurl deep chains of disciplined reasoning and calmly reconstruct the twentieth century.

But it also did not simply collapse into nonsense.

Instead, it produced what the author calls glimpses of intuition.

Faced with the photoelectric effect, the model would often reject the idea that light behaves as a purely continuous wave. It sometimes described light as composed of disconnected parts, or suggested that its effect depends on frequency rather than just intensity. In the blackbody problem, it occasionally identified that equipartition could not hold across all modes and that high-frequency behaviour required some kind of suppression or restriction. In the general relativity prompts, it sometimes approached the equivalence between gravity and acceleration, even if the explanation remained entangled with older ether-like intuitions.

That last point matters. The model did not cleanly leap into modern understanding. Often it reached toward the truth through the language of the past. It would say something that sounded wrong in detail but oddly adjacent in structure. For example, it might explain bent light as gravity acting on the medium through which light travels: not spacetime curvature, but not entirely orthogonal to it either.

In other words, the system was not consistently correct, but neither was it merely regurgitating stock phrases. It sometimes seemed to grope toward new structure using the conceptual furniture available to it.

That is precisely what makes the result so hard to dismiss and so hard to celebrate.

Why the findings matter

The most important outcome of the experiment is not that it proves LLMs can reason like scientists.

It does not prove that.

But it does establish something more subtle: there exists a meaningful middle ground between “mere stochastic parrot” and “fully general reasoner,” and that middle ground is scientifically and philosophically worth studying.

The model seems capable, at least some of the time, of recognising that a set of observations strains the inherited framework and that one of the framework’s assumptions must give way. That is not the same as deep understanding. But it is also not nothing.

It suggests that conceptual pressure can sometimes be represented inside these systems, even when the full solution is unstable, incomplete, or wrapped in the wrong metaphors.

That has implications for how we think about intelligence in current AI systems.

Too much discussion treats the issue as binary. Either the model understands, or it does not. Either it is reasoning, or it is just autocomplete. But real systems often occupy stranger territory than our slogans allow. They may possess partial internal regularities without possessing robust world models. They may land on the right conceptual direction without being able to sustain it. They may simulate insight in some cases and exhibit something insight-like in others.

The historical benchmark does not resolve that ambiguity. It sharpens it.

The experiment’s caution is part of its value

One of the best features of the project is that it refuses the temptation to overclaim.

Hla is explicit that the evaluations are, in his words, somewhat “vibes-based.” The prompts are easier than the actual historical conditions faced by Planck or Einstein. The model is not curating the observations itself, nor generating the thought experiments, nor deciding which contradictions are worth pursuing. Those parts of discovery still matter enormously.

There is also the ever-present risk of reward hacking. Once an LLM-as-judge enters the loop, the system can learn to sound insightfully scientific without being insightfully scientific. The author notes recurring rhetorical tics in the generations, the kind of repeated flourishes that hint the model may have learned how to please the evaluator rather than how to understand the problem.

And there is contamination risk, especially in the later tuning stages. Even when great care is taken, synthetic supervision produced by modern models can smuggle in traces of hindsight.

All of this means the experiment should not be read as proof that current transformer-based systems can independently reconstruct major scientific revolutions from first principles.

But it also should not be dismissed as a parlour trick.

The interesting fact is that even under severe data constraints, with a relatively small model and an awkward historical corpus, something nontrivial seems to be happening.

What the experiment really reveals about AI

At a deeper level, Machina Mirabilis is not only about physics. It is about the gap between execution and understanding.

Modern models are already astonishing engines of execution. Give them a sufficiently scoped objective, enough feedback loops, and a way to verify progress, and they can push surprisingly far. Coding, theorem search, literature synthesis, experimental iteration, and optimisation all increasingly benefit from this pattern. If the task can be checked, models can often grind toward competence.

But there remains a lingering human suspicion that something essential is still missing.

Not output quality. Not even usefulness. Something more like inner orientation. Taste. Curiosity. The ability to dwell on a contradiction because it feels significant before anyone has formalised why. The ability to choose which problem matters and which explanation is ugly even if it is technically serviceable.

The closing sections of Hla’s essay dwell on exactly this. He suggests that current systems may be less like reflective minds and more like machines of execution: powerful, generative, tireless, and increasingly capable, but still lacking the self-directed valuation that human thought seems to rely on.

Whether that distinction remains permanent is an open question. But it is a useful one.

Because it implies that the frontier may not simply be about making models run longer. It may be about understanding what kinds of internal structure support durable, self-correcting, curiosity-driven reasoning, and which kinds merely support ever more convincing continuations.

A machine of miracles, but not yet a mind

The title Machina Mirabilis is well chosen.

The experiment does not show that we have built a digital Einstein. What it shows is something stranger: we have built systems that can sometimes press against the edges of inherited conceptual worlds without clearly understanding the worlds they are escaping.

That is already remarkable.

A model trained on the literature of the nineteenth century, pushed through noisy corpora, constrained by a modest parameter count, and tasked with explaining the most unsettling experiments of modern physics, occasionally reaches for the right ideas. It hints at quantisation. It resists continuous-wave intuition. It circles the equivalence principle. It senses that some classical assumptions cannot survive contact with reality.

And yet the result remains unstable, partial, and ambiguous. The model does not stand on the far side of those discoveries with confidence. It flickers there.

That flicker is the finding.

Not a proof of AGI. Not a refutation of it. A benchmark. A provocation. A way of asking better questions about what our models are actually doing when they appear, however briefly, to think.

If nothing else, the experiment reminds us that the most interesting AI results are often not the ones that settle a debate. They are the ones that force the debate to become more precise.

And this one does exactly that.