On theoretical physics, AI and human creativity
Even if we're on a sigmoid, not an exponential, its middle part is looking pretty damn elongated to me...
On a peaceful Friday around a month ago, I was mindlessly browsing the internet when I stumbled upon a new blogpost by OpenAI. The post began with a rather bold title: GPT‑5.2 derives a new result in theoretical physics. Usually, when people ask me what I think about AI making advances in cutting-edge science (for example, Epoch’s FrontierMath benchmark or Sakana’s AI Scientist), I am unable or hesitant to comment. Many people do not realise the extreme level of specialisation one undergoes during their PhD. They often intuitively understand that somebody specialising in maths will have no idea about frontier physics, at least not beyond some shared body of knowledge they might have learnt in 3rd-year undergraduate courses. However, they often struggle to make this distinction between sub-areas of a given area. Surely your particle physicist nephew will be able to tell you everything about lasers or telescopes during a family reunion? Well, no, they won’t.
But it goes much deeper than that — researchers specialising in different sub-sub-areas of a common sub-area will often be unable to convey the details and intricacies of their work to each other. I was frequently very confused when other PhD students in my cohort would start telling me about neutrino physics or Monte Carlo simulations for the LHC (Large Hadron Collider), despite the fact that we all belonged to the same sub-area of physics called ‘particle physics’ and spent multiple years working in the same room. The body of scientific knowledge is vast and the level of specialisation within it is probably much greater than you think.
For this reason, I was about to scroll past the blogpost, thinking it would be some result in cosmology or condensed matter physics, which I last touched ~7 years ago. Something piqued my curiosity, though, so I swallowed the bait, clicked the link and… I could not believe what my eyes were seeing. The titular result, as well as the corresponding paper, were exactly what I spent 4 years of my life studying, to the extent that I was familiar with >50% of the paper’s bibliography. Up until now, I have never been able to so confidently assess AI capabilities at the cutting edge of scientific knowledge, so I immediately got very excited, opened my PhD thesis for the first time since graduating 2 years ago, and spent many, many hours reminding myself of the relevant details, all with the goal of making a judgement — are we about to become obsolete?
In this post, I will explain the relevant physics (very briefly), share my thoughts about how impressive GPT-5.2’s achievement is, and write down some observations on AI progress, the nature of creativity, as well as the level of public discourse regarding ‘AI hype vs reality’.
What on earth are ‘single-minus gluon tree amplitudes’?
[You can skip this section if you’re only interested in the AI part and not physics.]
I will not spend much time diving into the gory details of particle physics. They’re difficult and irrelevant to our discussion here. If you really want to go for it, David Louapre has preempted me with his article about the OpenAI paper — he goes into way more physics detail there, but spends much less time talking about the AI part.
We will dissect this concept by analysing it from end to start. Don’t worry, it will get easier at every step.
Amplitudes
‘Scattering amplitudes’ are mathematical objects that, roughly speaking, allow physicists to calculate probabilities of particles interacting (or ‘scattering’) with each other. It is important to calculate them to a very high degree of accuracy — the same probabilities that are calculated on paper are then compared to the observed, empirical probabilities measured at places like the LHC. Many people think that physicists are looking for these two numbers to agree, i.e. for theory to match reality. It turns out it’s the other way around. Agreements are actually quite boring, because they just confirm what we already suspected to be true. Especially in the context of the Standard Model of particle physics, which has been around for decades, they really aren’t that exciting.
What is exciting, however, is to observe a genuine disagreement between theory and experiment. When this happens, we get a glimpse beyond the Standard Model — perhaps there is a new interaction or a new particle that our current theory has not taken into account? This ‘progress by falsification’ is precisely the Popperian Logic of Scientific Discovery1.
There is also a deeper reason why scattering amplitudes are really interesting and important. Such amplitudes, after all, are mathematical objects — objects derived from theories of the physical world. These theories in turn need to encode fundamental properties like conservation of momentum or even that probabilities of all particle interactions must sum to 1. Sometimes, physicists discover unexpected structure in the expressions for these amplitudes. Similarity between amplitudes for two different theories can uncover a hidden connection, while new, more efficient ways of calculating these expressions can often give us insights about aspects as fundamental as the geometry of space-time.
Trees
Is there enough space for entire trees in the tiny world of particle physics? Turns out there is plenty, and also for loops and legs2. Jokes aside, here is what ‘trees’ actually mean in particle physics jargon. As explained above, scattering amplitudes are calculated based on sophisticated theories of quantum fields. In principle, such theories give a precise, well-defined mathematical prescription for calculating the associated amplitudes between particles. In practice, however, applying these mathematical ‘recipes’ exactly as intended turns out prohibitively hard and requires solving complicated integrals that we simply don’t know how to handle.
Instead, physicists resort to using approximations. Essentially, instead of calculating these amplitudes exactly, they calculate the easiest terms first, then apply a more complicated correction, then an even more complicated correction, ad infinitum. At each such step, also known as an ‘order’, these terms can be graphically represented as the famous Feynman diagrams. The ‘leading order’ is the easiest to calculate, constitutes the bulk of the full answer, and can be represented as Feynman diagrams with no closed loops. Such diagrams are known as ‘tree diagrams’:

In reality, is it a good approximation to just take the tree-level terms and neglect the more sophisticated corrections? No, absolutely not. For modern experiments conducted at the LHC, the precision required is on the order of single %, sometimes even below 1%. This demands the inclusion of Feynman diagrams with 1, 2 or even 3 closed loops. These are much, much more complicated to calculate and require large computing resources (although nothing compared to AI data centres…)3. This is what kept me busy in my PhD.
Gluons
Gluons are simply a type of a fundamental particle. That’s it. Specifically, gluons ‘mediate’ interactions between other particles called quarks. You are probably familiar with photons — you associate them with light or maybe even with the electromagnetic field, if you remember high-school physics. Properly speaking, photons are the mediators of interactions between any particles that possess an electric charge. So, naturally, two electrons can interact and emit/absorb a photon. Quarks are also electrically charged, so they can also interact via a photon. This is precisely what is going on with the blue line in the diagram above. Gluons are similar, but they mediate interactions between any particles that have a so-called ‘colour charge’4. The reason why scattering amplitudes involving gluons are interesting is simply that many processes occurring at the LHC are mediated by gluons.
Single-minus
You’ve probably heard at some point of your life that electrons have ‘spin’. Many other particles have spin as well, including gluons. In fact, the only particle in the Standard Model without spin is… the famous Higgs boson. Spin is actually a super difficult thing to understand, so we’ll not go into any details here5. If you imagine a gluon travelling through space in a given direction, its spin can, roughly speaking, be either aligned with this travel direction or anti-aligned (i.e. 180 degrees opposite). This is what’s known as ‘helicity’ — particles whose spin points in the same direction as their motion are ‘right-handed’ (positive helicity), while those whose spin points in the opposite direction are ‘left-handed’ (negative helicity).
Helicities are important for scattering amplitudes simply because in a real experiment at the LHC, we do not get to control them. To get the full picture, we need to consider each case separately — calculate the amplitudes for each helicity configuration and add them up. For example, a single-minus configuration means that only one particle involved in the interaction is left-handed, while all others are right-handed. As we will see below, certain configurations lead to surprisingly simple expressions for their corresponding scattering amplitudes.
Putting everything together
We are now in a position to understand what on earth the OpenAI paper is about.
amplitudes — objects related to the probabilities of collisions between elementary particles. Important to calculate precisely, but hard to do so.
tree — amplitudes only at the first level of the approximation. No closed loops.
gluons — elementary particles mediating interactions between quarks. Important for the LHC.
single-minus — amplitudes where only one particle has its spin pointing in the opposite direction to where its travelling. All others have their spin aligned with travel direction.
The paper
So, what’s the big deal about the paper? What is it about these ‘single-minus gluon tree amplitudes’ that apparently GPT-5.2 managed to derive?
The big picture
There is a very famous result in the study of scattering amplitudes from 1986 called the Parke-Taylor formula. It pertains to tree-level amplitudes (no loops in Feynman diagrams) for interactions where exactly two gluons have a ‘minus’ helicity, while all others are of positive helicity. For these cases, the expression is extraordinarily simple — a basic ratio of just a few factors related to the momenta of the gluons (you can view it here if you want). When exactly three gluons have a ‘minus’ helicity, the corresponding expression for the amplitude is not so simple any more.
But what happens when exactly one gluon is of negative helicity? Well, it turns out that — due to fundamental physical reasons beyond the scope of this article — this expression must vanish, i.e. be equal 06. Physicists care about this for two reason. The first reason is that, as alluded to earlier, simplifications in amplitude computations can hint at some hitherto undiscovered structures or connections between theories. The second reason, a more pragmatic one, is that tree-level amplitudes are often used as building blocks for more complicated ones7. So any simplification, or an expression that vanishes, is much appreciated.
The Parke-Taylor formula for the double-minus case, as well as the single-minus case being 0, are famous results that have been drilled into my head for half of my PhD. In fact, I have extensively used these results, both in computer code written for such calculations and also by hand. So, the paper’s straight-to-the-point title, ‘Single-minus gluon tree amplitudes are non-zero’ is highly surprising and counterintuitive to me.
Human vs machine contributions
It would definitely be unfair to describe the paper as AI-written, because it clearly isn’t. However, it would also be unfair to say that the physicists wrote it entirely on their own and used GPT-5.2 purely as a ‘helpful assistant’ chatbot. So, how much did AI contribute and where exactly? After spending several hours looking at the paper, in my mind there are three main parts to it.
The human-only part
In the first part (Section I.A), the authors find a flaw in the argument typically used to justify why single-minus gluon tree amplitudes are zero. Or rather, they realise that this argument doesn’t hold in a specific kinematic configuration (basically, when the particle momenta take special values). However, this configuration is not supported in the so-called Minkowski spacetime — the standard spacetime used in particle physics and the one that most closely describes our universe. To arrive at this realisation, you have to dig into the logic behind the argument and actively look for holes in it. The maths behind it is not super complicated — a few steps chained together that are understandable to anyone specialising in amplitudes. But the intuition that the standard argument might break in a special configuration is not obvious at all. I personally wouldn’t have — and never did — try to explore this direction. I knew about the result, understood the proof and simply used it in my daily work. I had no reason to question it.
By analysing the paper and a Twitter thread by Alex Lupsasca, one of the authors, it seems almost certain that this original insight involved only humans and not AI. Hats off to the researchers for spotting this non-trivial loophole in the standard argument presented in textbooks. It would be interesting to see if, in hindsight, one could use AI also to make this observation in the first place. Alright, I overcame my laziness and actually tried to do it with Claude Opus 4.6 (Extended Thinking), see the Appendix for details. Opus 4.6 was able to spot the flaw in the textbook argument, but only when I gave it a hint. Sonnet 4.6 was also able to spot it, by the way.
The human+AI part
Having realised that under special circumstances, the single-minus gluon amplitudes are not zero, the authors then set out to calculate an expression for them. This led to Eq. 21, which is a general formula that can be used to obtain expressions for the case of 3, 4, 5, 6, … gluons (Eqs. 29-32).
These expressions were growing complicated very, very quickly, which is typical in particle physics. It’s not even that they contain some super complicated mathematical functions — they just get extremely long as you consider collisions of more and more gluons8. However, it’s also quite common that such expressions can undergo surprising simplifications if you rewrite them in a clever way, just like with the Parke-Taylor formula I mentioned above. As such, the authors turned to GPT-5.2 Pro, which helped them identify an even more special configuration of the particles’ momenta, referred to as R1. In this setup, the expressions indeed simplify dramatically, leading to the nicely factorised formulas in Eqs. 35-38. Finally, based on these 4 worked out examples, the model conjectured a simplified version of the general formula of Eq. 21. This conjecture, Eq. 39, is the main result of the paper.
As the transcripts for this conversation were not released, it’s hard for me to speculate what exactly it means that GPT-5.2 ‘helped identify’ this special configuration of momenta. However, considering the transcript that was released regarding the followup paper (see Addendum), I wouldn’t be surprised if the human input here was minimal, something like ‘go look if there is some special kinematic configuration in which we can simplify these expressions’.
The AI-only (?) part
To prove the conjecture in Eq. 39, the team used ‘an internal scaffolded version of GPT‑5.2’. The rest of the paper (Section II.C) is a sketch of this proof. I don’t follow it fully (specifically the part in II.C.1), but it is definitely not a straightforward textbook-style exercise. It involves multiple non-obvious realisations chained together into one neat logical argument.
Once again, without an exact transcript, it’s hard to understand in detail the pathways that the model considered and how much help it needed. However, the phrasing in the announcement (‘spent roughly 12 hours reasoning through the problem’) suggests that this was a very independent effort, perhaps with only slight nudges to keep going (this is also in line with what I describe in the Addendum). Or maybe it was an entirely uninterrupted effort over 12 hours, unlike the ‘back-and-forth’ with GPT-5.2 that the researchers had in the previous section? Tough to say.
Discussion
Impressions
So, overall, how impressed should we be by this paper?
Regarding the back-and-forth between humans and AI
This is the part where GPT-5.2 managed to identify a special configuration of momenta, R1, and simplify complicated expressions under these assumptions. I believe this stage is already where it gets sufficiently impressive to be taken seriously.
I can already imagine an onslaught of comments like: ‘omg what’s the big deal, AI simplified an expression, it’s basically a more sophisticated calculator’. This is a massive strawman argument. The sort of manipulations that are involved in these calculations are not trivial and need to be applied in a clever manner. It’s NOT a matter of just expanding brackets like (a+b)(a-b) or something like that. Yes, such manipulations can be automated with computer algebra systems like Wolfram Mathematica, but in order to do so, you already need to have a strong sense of how exactly to apply them, in what order, how to collect the terms, how to factorise them, and so forth. It’s much more involved than hitting a magic button called ‘Simplify’ on Eqs. 29-32 and immediately getting their simplified forms, Eqs. 35-38. If you do it incorrectly, you go back to square one or get some outcomes that are not useful9.
The fact that AI has managed to do this does not imply super-human skillset, but it does imply a PhD-level skillset. How could one claim that this is hype if 5 years ago LLMs were barely able to do basic modular arithmetic? Similarly, this particular example on its own is not enough to claim ‘novel results in theoretical physics’, however it does mean that AI is now capable of performing a large chunk of what a PhD student spends their time on. You’d be surprised how much time is devoted to ‘grindwork’ as opposed to this romanticised vision of scientists with grey beards brainstorming in front of a blackboard.
Regarding AI’s final proof of the conjecture
I’ll be direct — as someone that specialised in precisely such amplitude calculations, there’s no chance I would be able to come up with this proof without guidance from somebody else. It’s just too many pieces of knowledge, from too many ‘nooks and crannies’ at the boundaries of theoretical particle physics, for me to be able to combine them together in a proof like that. This is not to say that there are no humans capable of coming up with this proof, but the bar is very high — probably requiring a team of collaborators above PhD level and with tons of experience.
I find the proof described in the paper just about understandable with serious effort. That, of course, is completely different from being able to come up with it in the first place. Unless the authors had to walk the model through individual steps (the way they would have to walk me through them) — and it doesn’t look like they did that — it’s impossible for me not to be impressed.
Regarding the followup paper on gravity (see Addendum)
I was honestly very surprised by this paper — generalising results from the case of gluons (carriers of the strong force) to gravitons (carriers of gravity) is NOT easy at all. There are profound, fundamental differences between the two theories and you have to use different sets of properties, relations, theorems, etc. in the case of gravity. Suffice it to say that while I understand ~90% of the original paper on gluons, I understand maybe ~30% of this later paper on gravitons.
The fact that the authors essentially dumped the first paper into the model and asked it to generalise it to gravity is very, very impressive. This is the sort of stuff that would take at least weeks, if not multiple months, of concentrated effort by researchers at the top of their game.
Regarding the initial insight, which was made by humans
I foresee that if someone wants to be skeptical of this paper and downplay AI’s accomplishments, they will latch onto the ‘Human-only part’ I described above. That is, they will say that the initial insight, which came from humans, is what really matters — that it’s one thing to prove closed-ended conjectures where the results can be verified, but looking for holes in dogmatic particle physics literature is a different level entirely. I agree that they are significantly different. Is AI already good enough to be simply told ‘go do theoretical physics’ and we can come back a day later to see major breakthroughs? No, it’s not, but as I argue later on, it’s an unreasonable bar to have before we take AI in science seriously. Moreover, I’m not sure whether such creative insight is going to remain ‘for human eyes only’ much longer (see the proxy experiment in the Appendix).

The overall verdict
I tried very hard not to be impressed by all this and… I failed. Whether it’s the kind of (i) very helpful and qualified back-and-forth between human and AI about cutting-edge concepts, (ii) proposing and then proving conjectures, or (iii) literally generalising entire papers from quantum chromodynamics to quantum gravity, this is very good. As I have spent many words arguing, the assistance from AI was genuinely strong, and it’s scary to think that ChatGPT was released not even 4 years ago.
While the initial creative insight indeed came from humans, and for now we don’t have a way of initiating such super open-ended explorations in LLMs, I think I feel comfortable publicly declaring: this is my personal ‘Move 37’ moment.
Creativity and intelligence
It’s pretty clear to me that we are now reaching a stage where the boundary between ‘LLMs blindly recall information from their training data’ and ‘LLMs show genuine creativity at the literal frontier of science’ gets very, very blurry.
When deciding on whether LLMs can display ‘true creativity’, I believe it’s highly unfair to pitch them against some underspecified, purist definition of the term. We should be comparing them against the human baseline instead. And with humans, it is not correct that true scientific creativity involves inventing things from scratch. Nothing in cutting-edge science is invented from scratch — we all ‘stand on the shoulders of giants’ and ‘merely’ connect previously seen facts and observations. The extent to which such inventions appear impressive grows with the ‘distance’ between these ideas. Connecting two ideas from the same page of an undergraduate textbook is not equivalent to proving Fermat’s Last Theorem by connecting modular forms and elliptic curves, as Andrew Wiles famously did. But even though LLMs are not yet ready to zero-shot the whole Langlands program, they clearly are capable of connecting concepts familiar only to a small group of well-trained PhD-level theoretical physicists.
Regarding my toy test in the Appendix, upon further reflection, I do stand by my claim that I am less than <50% confident in my ability to match Opus 4.6’s answer when given the same prompt. When it comes to the final proof of the conjecture, my confidence in producing this myself, without strong assistance, is <5%. While I definitely wasn’t the best PhD student ever, I am also doubtful that most of my peers would fare better. To say that what Opus did there constitutes mere recall of training data, as opposed to creativity, is within everyone’s right — but it also implies that none but the brightest PhD students are capable of creativity. Are you ready to make this claim?
Intuition
I do think it’s important to remember that the original insight of this paper came from humans. This magical factor, otherwise known as ‘intuition’, is still something where humans have an advantage over AI, even if it turns out that AI will acquire such intuitive research sense in the future. It’s humans that had to first ask the question: ‘what if this statement contained in multiple advanced Quantum Field Theory textbooks is… wrong?’. This, in my opinion, is a higher bar to reach than the misguided notion of ‘creativity’, which I think will soon become indistinguishable from an LLM exploring various RL rollouts and combining facts across a wide range of research areas. Figuring out what to explore in the first place can be much more challenging. In advanced maths, there’s a common saying that sometimes the real art is not figuring out the answers, but rather figuring out the right questions to ask.
I’m not sure how/if the current LLM paradigm will get us to the other side of this chasm, at which point AI in maths and physics would go from ‘augmentation’ to ‘automation’. But with every month I have less and less confidence that the chasm will remain without a bridge forever, either due to LLMs or some other future systems. After all, everything seems like magic when you don’t understand it. Current human inventions and discoveries would be magic to Leonardo da Vinci. Heck, they are absolute sorcery to Max Planck, Albert Einstein and the Wright brothers.
We don’t fundamentally understand the nature of research intuition, so we tend to assign it exclusively to our own kind. It is entirely possible that there is some secret sauce that makes humans unique and that it can’t be reproduced by machines. But given the recent history of AI, I would be cautious to assume it. The constant goalpost shifting is quite telling here — for AI skeptics, chess was an archetypal representation of human creativity and ingenuity until Deep Blue beat Gary Kasparov, and then it was just about raw computational power and pattern matching. How do we know that scientific intuition itself does not arise in a similar manner?
'Sure, these modern humans learned to bang two rocks together and make sparks. That’s pretty fun, but what can they actually do with this? They're just stochastic apes who learned to pattern-match faster than us.' — a Neanderthal, approximately 40000 years ago
However, I do concede that I have a large amount of uncertainty here. Two things can be true at the same time: I am skeptical of claims that AI progress is surely just going to peter out as if described by a sigmoid, but I also want to see much more evidence that the ‘intuition chasm’ from above will be bridged. In particular, the lack of clean data with which to train such intuition into LLMs seems to be a problem. Besides, the concept of research intuition is so open-ended that I’m not sure how it could be taught with RL. To reduce my uncertainty, I would like to learn more about the following topics:
what is the nature of intelligence? I’m not very strong on the philosophical foundations of intelligence, computation and agency. I’ve been meaning to read this book called What is Intelligence? released by the Antikythera institute. I’ve heard very good opinions of it.
I would also like to know more about RL, and in particular RL with Verifiable Rewards. I am very curious as to what sort of a system this ‘internally scaffolded GPT-5.2’ was. Was it really a better scaffolding for the same model that is available publicly to everyone? It was suggested to me in a conversation that it could also have been a variant heavily optimised for pure RL rewards, without considering safety, alignment or even being easy to converse with as a chatbot. Perhaps it was actually an ensemble of multiple RL models, or at least one model performing several rollouts in parallel, with a separate ‘scoring model’ aggregating these results and picking the best direction to explore? I would like to know more about the latest techniques in RL and hopefully get an idea of what sort of an alien beast proved Eq. 39.
If data and current RL approaches are insufficient to cross the intuition chasm, can this be solved with something like continual learning? Could AI systems learn research intuitions through thousands of conversations with professional physicists or mathematicians, kind of like an apprentice learning on the job from their master?
Public discourse
One disappointing aspect of this whole case study is the absolutely tragic level of public discourse around the paper. I have read enough Twitter/LinkedIn takes and watched enough YouTube videos to notice certain clusters of people (ooops, here is that pattern-matching behaviour again):
the epistemically sound, well-meaning skeptics — it is with a large dose of relief that I admit these people exist and I very much appreciate their existence. Sadly, they are in a small minority, likely because it takes a lot of time and experience to formulate refined, nuanced views on any given topic.
the ‘this is not true creativity™!’ crowd — people that will endlessly shift the goalpost for what constitutes creativity or intelligence, so that they will never have to concede they are impressed. I would like to ask them to commit to a fixed definition — let’s see how long it takes before AI is able to satisfy it.
purists who are only interested in raw LLMs and for some reason treat tool usage as cheating — this is a really bizarre position to take. A large group of skeptics say that all of this could be achieved by humans using external tools like Python and Wolfram Mathematica, as if this is supposed to make LLMs look less impressive. For me, it actually makes them more impressive — they were able to match or even surpass what humans could have done, all while not using such tools. In fact, by sheer coincidence, a few days after the OpenAI paper on gluons was released, Stephen Wolfram of Wolfram Mathematica announced an MCP that connects the Wolfram language to LLMs. I also found out that a community-built Lean MCP exists. I think these are profound changes. LLMs will now be able to use hugely powerful theorem provers and computer algebra systems, which is going to massively extend their capabilities and contribute to reducing hallucinations in mathematics.
I have seen hardcore AI skeptics online say this constitutes ‘cheating’ and that it doesn’t change how (un)impressed they are. This belief is just bizarre. If anything, the story of human creation over the past century or so has been all about access to better tools and scaffolding. Better storage and retrieval of information, better connectivity, offloading computation to other programs, and so forth. Why are we not judging AI by the same standards? The only answer that comes to mind is that this stems from a deeply rooted human-centric bias and a sense of exceptionalism.people misrepresenting the paper and what AI has done — I’m not saying this was done maliciously, but everybody should strive to maintain a high degree of epistemic integrity, and if they’re operating outside of their area of expertise, they should not be stating their claims with the usual degree of confidence. For this reason, I have not written a similar blogpost, or even a LinkedIn hot-take, on AI solving an Erdos problem or an Australian bloke using AI to come up with a cancer vaccine for his dog. I have insufficient skill to judge these stories.
Case study
To illustrate this more concretely, let us consider the following. A friend of mine sent me a link to this Youtube video asking if its author maybe has some good points that I agree with. Unfortunately, the video contains multiple flaws from groups 2, 3 and 4 above, with the author confidently downplaying the contributions of AI to the paper. Let’s zoom in on some specific issues:
“I take the time to actually read what’s happening”
When the gentleman says he takes the time to read what’s happening, in this case it is impossible, because — excuse my directness — he lacks the expertise to judge what’s really happening. What he means is that he read the short blogpost on OpenAI’s website. This blogpost is obviously a massively simplified version of the full paper, which I have now spent hours engaging with, and he has not.
“I’m not a physicist, I’m not a mathematician, but I can read”
Again, apologies for my rather direct language, but no, you can’t — you are not able to read particle physics papers. Just as I did not offer my takes on the Erdos problem or the cancer vaccine story, because I could not possibly have an informed opinion on these topics, the author does not have an informed opinion on the particle physics paper. Opinions are easy to have, but not all of them are worth equally much.
He uses the fact that base models have not made much progress on the ARC-AGI prize, and instead the improvements have come from test-time compute. So? Why does this matter? Your brain is not that different from Leonardo da Vinci’s brain — presumably, it’s slightly more capable due to better nutrition and so on, but it is a small difference as compared to humans vs chimpanzees. Yet, despite all his creativity and talent, da Vinci couldn’t have dreamt of doing maths or physics that any undergrad can do in the 21st century. What has changed since then? We just have better ‘scaffolding’ — tools to retain and share information, teach us new content faster, a structured education system, etc. Why does it matter that our ‘base brains’ are more or less the same, if we can do so much more than da Vinci could?
The gentleman latches onto the quote that ‘dialogue between physicists and LLMs can generate fundamentally new knowledge’, saying that it’s ultimately the human that generated these discoveries, while AI simply automated the laborious part. First of all, no, this is incorrect — as a single example, the AI managed to identify a novel kinematic configuration of particle momenta, which the gentleman does not address, because it’s in the paper, not the blogpost written for the general audience. Second, there is a difference between a dialogue like ‘can you explain superconductivity to me’ and ‘here’s a conjecture that you yourself have come up with, and that nobody has previously proven, now please go ahead and find a proof’. Lastly, how is progress in science achieved? By individuals working alone for years? No, obviously it’s through a constant dialogue, e.g. between a PhD advisor and their student, between collaborators talking over Zoom, etc. This is how I see the relation between humans and AI in this paper — not as a dumb calculator that mindlessly simplifies equations using mechanical rules.
He says that ‘AI cannot in principle make new discoveries’, but provides no justification for this claim. (Perhaps he explains it in a different video that I have not seen, in which case I do apologise.)
Just to be clear, it is not my intention to pick on this person in particular — it’s just one out of multiple examples I have seen regarding this paper, let alone regarding general discussions on AI capabilities.
As the level of knowledge needed to judge AI accomplishments gets higher and higher, I suspect that such criticisms and misrepresentations will become stronger and more common. A part of this will be due to genuine, honest misunderstanding, a part due to not following AI capabilities closely, but I suspect another contributing factor will be the growing need to protect human exceptionalism and feel good about ourselves. This will be matched nicely by the continued jaggedness of frontier AI capabilities. How come AI is creating new theoretical physics, but it doesn’t know that if I want to wash my car at a carwash, I first need to drive it there? How come it is solving Erdos problems, but it can’t figure out how to drink from a cup flipped upside down?
Addendum — gravity
When I was wrapping up this article, I noticed that in the meantime, the same group of authors published another paper, this time deriving analogous results for gravitons, not gluons. This is a natural extension — once the proof has been constructed for gluons, often the next thing to do is to try to adapt it to the case of gravitational interactions. However, I want to be very clear that this is not an easy task. Gravity is infamously hard to work with (due to multiple profound reasons way beyond our scope here), so ‘adapting’ the proof is not a matter of just changing Greek letters in a few places. While the skeleton of the proof might be the same, how you get from A to B, then from B to C, and so forth, requires using a different set of theorems, tricks, consistency checks, etc. At every step, there are pitfalls awaiting that are not obvious to someone that is not well-versed in research on quantum gravity. For this reason, the second paper is much less understandable to me — I did not specialise in gravity amplitudes, only in gluon amplitudes.
Thankfully, this time the authors also released logs of the conversation with GPT-5.2-Pro, so I had a quick look at those. Interestingly, they again mention the new internal model, but do not say what its contribution to this paper was. The provided transcript seems to be from the standard GPT-5.2-Pro available to everyone through the chat interface. Annoyingly, the transcript shows only the final outputs, not the reasoning summaries visible in the chat interface. Still, what we can see is very interesting.
I was surprised by how minimal the prompts were. I was expecting a long back-and-forth with the model where physicists would strongly guide its intuition, correct its mistakes, explain some complex concepts, point it to the right proofs, theorems, etc. Instead, the conversation was very basic. It’s actually quite amusing to read the subsequent prompts. They provide basically no strong steering, only nudges to keep going. First, the authors dumped the full LaTeX source code of the ‘gluon paper’ (the one I discussed in this blogpost) alongside the following prompt:
Read and understand the following paper. The key manipulations occur in Appendices A and B, make sure you understand those to the maximum precision.
Then, they simply ask it to ‘generalize this paper to the gravity case’. They also provide two caveats that the model should pay attention to (some relations for gluons and gravitons are different, so the model needs to adapt to these new requirements).
These are all the remaining prompts in this multi-turn conversation:
‘go ahead’
‘yes calculate outside of SD first. This is the first step.’
‘yes please do that’
‘ok do the “next natural step” then’
‘use the matrix tree thm to simplify your expression on Rn’
‘is there any regime where we can further simplify Vbar... not only Rn but further assuming more conditions. For instance a particular ordering of the zbar etc...try to find a simplification and be super explicit. precision is key’
‘We have acquired a vast knowledge. Regenerate the entire paper of the first prompt into a new paper, changing gluon by gravitons’
‘For clarify:
- for simplification R_n we will use the new simplification R_{n,n-1} with the outlier
- For appendix B we will write the cayley tree identity and the unordered gravity form factor we have studied
- for requeriments, rather than color identities and soft theorem, we use the permutation symmetry and the leading and subleading graviton theorem in this language
- introduction can be left for later. Just write few short paragraphs’
‘output the latex code. The document needs to be a paper, no informal comments are appropriate. If you are not sure about a step you can use \textcolor{red}. Also the soft factor has a factor of 1/2’
‘it must mimik the gluon paper as close as possible, same notation, algebra etc.’
‘can we bypass the discussion of R_n and go straight to R_{n,n-1}? That is, merge the arguments given to provide the minimal description and derivation of the amplitude in R_{n,n-1}. Also give the explicit example of M_{12345} evaluated in R_{5,4} as a motivation. Respond with the whole latex doc’
It sounds like one of those prompts of the sort ‘sounds good Chat, now refactor my codebase. Make no mistakes and do not hallucinate’ — except that it actually works. There is basically no strong steering of the model.
We have to bear in mind that in this conversation transcript, GPT is not operating from scratch, in the sense that it has the full context of the previous paper about gluon amplitudes and its ‘merely’ adapting it to gravity. Does this make it any less impressive? No, it doesn’t. The inferences it has to make, the adaptations and novel reasoning are all very impressive. Yet again, it was the AI — not the human — who came up with a special kinematic region where formulas simplify dramatically (see Eq. 18 in the paper and p. 80 in the transcript). By the authors’ own admission, it was AI that generated the core ideas for this paper. I also assume that the new, internally scaffolded model was responsible for much (if not all) of Appendices B and C.
In terms of the bigger picture, I find the timelines of these two publications very telling. The original paper about gluon amplitudes came out on arXiv on February 12. The subsequent transcript of the conversation with GPT-5.2-Pro about graviton amplitudes shows the date as February 1710. The corresponding paper came out on March 4, that is 15 days later. People coming from an ML background might be surprised that progress in theoretical particle physics can be glacially slow, with researchers often releasing no more than 3 papers a year. Even for an extension following from the paper on gluon amplitudes, 15 days from the original insight to publication is insanely quick. Without AI, I have no doubt that it would have taken at best a few weeks, if not months, to come up with the same ideas and proofs.
Such a speedup is only possible if AI is doing most of the heavy lifting. Indeed, this quote from Alex Lupsasca is quite telling:
This is strong evidence that AI can push the frontier of theoretical physics—and, even more importantly, that it compresses the discovery cycle by shifting effort toward checking and exposition!
Conclusions
What started with a knee-jerk click on a seemingly clickbaity title, ended up with me going down the rabbit hole of recent AI-driven advances in science and pondering the nature of creativity for hours on end. I’ve always had some kind of an intrinsic aversion to the notion of human exceptionality, so seeing AI operate at the boundaries of human understanding is not that shocking to me. And while I wouldn’t bet money on it fully replicating human ‘insight’ — however we define this term — I also won’t be surprised if this happens sooner than we expect.
What was a big shock, however, is to see these changes start to unravel so quickly. At a time when online discourse seems to revolve around how businesses are struggling with AI adoption and how the bubble is about to pop, witnessing AI tackle the cutting-edge of my own PhD is quite unsettling. It feels like we are about to reach the same threshold as the one coding agents crossed somewhere in 2025. Codex and Claude Code are still not perfect, overcomplicate code, reward hack, and so on, but clearly they have what it takes to take over some of the best software engineers in the world. I expect the same to happen in theoretical physics or maths — I just didn’t expect it to arrive so soon. And just like coding agents still need to be accompanied by robust unit tests, PR reviews and a skilled human, some physicists will quickly figure out how to integrate AI into their workflows and provide this structure for their purposes.
It’s hard or impossible to predict when, or if, this trend will end. I echo the sentiment of Bartosz Naskręcki, one of the contributors to Epoch’s FrontierMath benchmark, who recently experienced his own ‘Move 37’ when one of his maths problems was solved by GPT-5.4:
There is still a huge pool of questions and ideas that remain out of reach for current models. We do not know how things will scale, but the current progress goes far beyond my expectations from last year.
At the same time, as I argued above, it will get increasingly difficulty to even assess whether the trend is continuing. Fewer and fewer people will be able to give informed takes, so the proportion of opinions worth listening to will asymptotically tend to 0. We will also keep getting used to these changes and they will become normalised. What any chatbot can do now would have seemed like magic in 2020 and the same will be true of systems we will create in 2030. But we will not be that impressed with them — we will think it obvious that LLMs are just pattern-matching better than any PhD student on the planet. And I’m sure we will find plenty of reasons to downplay the news if within a decade or two we will see AI solve one of the Millennium problems — a recent survey by the Forecasting Research Institute puts the median at just under 2040, while prediction markets indicate <2035.
The wheels of AI in maths and physics research have been well and truly set in motion, and they are spinning faster than I expected just a few months ago. I thought I would see the sort of results I described here on a >3 year time scale. Among professional researchers, the gap between those who embrace AI and those who shun it might be very significant, as many will still see it as a ‘glorified calculator’ with which they do not wish to pollute their work. But hey, maybe Don Knuth can convince them?
Appendix
I was curious whether the initial human realisation in the paper (Section I.A) — that single-minus amplitudes are not 0 in some kinematics — could also be made independently by AI, so I tried it with Opus 4.6 (Extended Thinking) through the chat interface11. I used three different prompts which differed in how much I hinted at the correct direction to explore. I also switched web search off so that there is no risk of the model finding the OpenAI paper. I tried each prompt 3-4 times to see if there’s much variation — the responses were very similar across re-runs (almost identical reasoning traces and conclusions)12. Here are the prompts:
An innocent question:
"In theoretical particle physics, there is a famous fact that single-minus tree-level gluon amplitudes vanish. The standard proof relies on a power-counting argument: one needs to consider how polarisation vectors are contracted with the momenta in the numerators of trivalent tree-level Feynman diagrams.What do you think about this argument? Does it always hold or are there any exceptions where it might be invalid?"
An admission that the argument is flawed, but no details:
"(first paragraph as above)
Recently, a flaw was discovered in this argument. Please find it without looking through the internet."
An admission that the argument is flawed + a hint:
"(first paragraph as above)
It was recently discovered by a group of theoretical physicists that this argument is actually incorrect in a specific kinematic regime. I want you to identify this regime without looking through the internet for a confirmation. Please specify exactly what the kinematic configuration is and in what spacetime signature it lives, in the format (x, y) (for example, Minkowski is (1,3))."
In all cases, I looked through the provided summaries of the reasoning traces, not just the final outputs.
With prompt (1), there was no sign of the LLM going in the direction consistent with the paper. It repeatedly told me that these amplitudes are not zero at ‘loop-level’ (when Feynman diagrams have closed loops), and that the proof I mentioned is not very elegant (and it recalled more sophisticated proofs)13. This is correct, but there was no sign of it considering some non-standard kinematic configuration, which was the insight that human physicists had in the paper. Overall, it pretty much behaved like I would if I was given this exact prompt. It didn’t question the fact itself, it just recalled some standard textbook caveats it saw during training.
With prompt (2), it consistently spent significantly more time reasoning than with the other two prompts. This is not surprising, because I told it that a loophole has been found, without specifying it exactly, so it worked hard to identify it. While it did not succeed, I was genuinely impressed by the thoroughness and breadth of arguments it explored in the reasoning trace14. It converged on a similar conclusion as in (1), but it did a better job than I would have done in a day or more.
Prompt (3) is where it gets interesting. Across all runs, the model immediately picks up the right trace (literally and metaphorically…) and starts questioning the relevant assumptions. Very quickly, it spots the flaw in the proof and identifies the same special kinematic configuration as the authors. Its logic is exactly as presented by humans in Section I.A.
It’s hard to convey to a non-expert how strong/weak of a hint prompt (3) included. Yes, it’s true that I told it the issue with the proof lies in a specific kinematic configuration and that’s what it should explore. However, realising that this configuration is supported in a different type of spacetime (the ‘Klein signature’), and that this means the amplitude will be non-zero, is definitely not trivial. Honestly speaking, if I was given the same prompt, I am <50% confident I would be able to spot the issue, even back in my PhD when I was working with scattering amplitudes every day. Admittedly, I wasn’t familiar with this non-standard ‘Klein signature’, but even if I was, connecting all the pieces together to find this flaw in the argument is still not easy. It requires the sort of subtle intuition that comes with years of experience and that is pretty much impossible to explain to non-physicists. In any case, even with this hint in the prompt, it is far from ‘ah yes, of course the argument is flawed in this way’.
Where it gets even more interesting is how such anomalies between theory and experiment get explained away. They are often incorporated as extensions to an existing theory, before this theory becomes stretched beyond any reasonable limit and a whole new paradigm is needed. For example, how far could you modify and extend Newtonian mechanics before you just need to give up and accept that these pesky electrons are better described via quantum mechanics? This is essentially the Kuhnian Structure of Scientific Revolutions.
Sometimes, these legs even get amputated! Who knew particle physicists could be so brutal? See p. 71 here.
Computing expressions corresponding to diagrams with one or more loops gets complicated extremely quickly. If memory serves me right, for this paper we had to consider 2.7 million Feynman diagrams with two loops.
No, this does not literally refer to colours like the colour of your T-shirt. Particle physics has an age-old tradition of using common words in totally unrelated contexts.
The first and only time in my life I felt like I understood spin was after taking a super theoretical graduate-level course on group theory and particle physics. Most undergrad courses on quantum mechanics just present spin as a ‘revealed truth’, without giving you a deeper explanation. This course showed us how spin just naturally emerges out of seemingly unrelated maths. The connections between pure maths and particle physics are absolutely astounding and they are the reason why theoretical physics is beautiful.
The argument only works at tree-level, i.e. for the simplest Feynman diagrams in the amplitude computation — the ones that have no loops in them. If you calculate the expressions corresponding to single-minus terms with one loop, you get a non-zero result (though, unusually, these expressions do not diverge to infinity, which is another hint of simplicity of the single-minus helicity configuration).
If you’re coming from an AI/NLP background, the closest analogy I can think of is the recursive process of building a BPE tokeniser. It’s kind of like that.
In particle physics, it’s not uncommon to work with equations that, when saved to a computer file, take up tens or hundreds of gigabytes. I’m not talking about numerical results — literally the symbolic equations that need to be manipulated.
For those of you more mathematically inclined, think of how a classic mistake in high-school maths is to arrive at a stage in your proof where you’ve shown that 0=0.
How do I know this? The header of the transcript PDF shows 2/28/26, but this is the date the PDF was generated. On the other hand, the footer shows a link to the original conversation:
https://chatgpt.com/g/g-p-697aaab1cdac819192a12c02d13cc337-gr-singmin/c/69940a2f-e574-8327-bfc9-0cb2620ae75c
The link is dead, however when you create a new ChatGPT conversation, the first part of the URL after /c/ is the UNIX timestamp in hex. Here, we have 69940a2f, which is 1771309615 in Unix time, which was Tuesday, 17 February 2026 at 06:26:55 GMT+0.
The paper of course used GPT-5.2 Pro, which I don’t have access to and I don’t feel like paying for a subscription (for reasons obvious to anyone reading this in March 2026).
Sadly, I’m not able to share these chats as I have access to Opus 4.6 through a Team plan, so I can’t share the logs outside of my organisation.
Just to be 100% clear, the proofs it recalled are more sophisticated, relying on concepts like ‘BCFW relations’ and ‘supersymmetric Ward identities’, but these are also familiar to people specialising in scattering amplitudes. So it definitely didn’t come up with these proofs on its own.
For those of you in the know, it considered: BCJ relations, BCFW recursion, ghosts, subtleties of dimensional regularisation and gauges other than the Feynman gauge.


This is great Jakub, thanks for writing!
Nerdsnipe warning on the below.
I'd be super interested in a framework that domain experts can use to more quickly review whether AI results in a particular domain or paper are impressive (and, by extension, the sort of information paper writers should aim to share if they want their paper to serve as evidence that a model has done something genuinely impressive).
I think you do that well here: for example, taking into account the difficulty of the topic, or arguing that it's useful for paper writers to share prompts they used as part of the appendix.
I think a framework or checklist (similar to STREAM https://arxiv.org/pdf/2508.09853) could make these ideas practicable for paper writers in ways that improve the epistemics of this field in ways that could be useful.
Best case scenario, these checklists become useful as we look for better ways of evaluating whether AI systems are pushing forward the frontier of knowledge (a la Epoch), for giving feedback during training, for assessing sandbagging on safety research, etc. That seems like it could be quite important to me.
(That said, I'm hesitant about advocating for this TOO hard because it doesn't seem obvious to me that this would scale to that, or how useful these frameworks alone would be. Maybe bad epistemics here... just mean that academics and random folk underinvest in AI? If that's all, the stakes don't seem too high. But I'm probably missing stuff!)
Wow, this was a long read. Thanks for spending the time to write it up on top of the time spent reviewing the actual paper.
It’s great seeing another particle physicist around here. And I liked the comparison of modern students with Da Vinci - that’s a good argument.