Discussion about this post

User's avatar
Edward Kembery's avatar

This is great Jakub, thanks for writing!

Nerdsnipe warning on the below.

I'd be super interested in a framework that domain experts can use to more quickly review whether AI results in a particular domain or paper are impressive (and, by extension, the sort of information paper writers should aim to share if they want their paper to serve as evidence that a model has done something genuinely impressive).

I think you do that well here: for example, taking into account the difficulty of the topic, or arguing that it's useful for paper writers to share prompts they used as part of the appendix.

I think a framework or checklist (similar to STREAM https://arxiv.org/pdf/2508.09853) could make these ideas practicable for paper writers in ways that improve the epistemics of this field in ways that could be useful.

Best case scenario, these checklists become useful as we look for better ways of evaluating whether AI systems are pushing forward the frontier of knowledge (a la Epoch), for giving feedback during training, for assessing sandbagging on safety research, etc. That seems like it could be quite important to me.

(That said, I'm hesitant about advocating for this TOO hard because it doesn't seem obvious to me that this would scale to that, or how useful these frameworks alone would be. Maybe bad epistemics here... just mean that academics and random folk underinvest in AI? If that's all, the stakes don't seem too high. But I'm probably missing stuff!)

Nazar Bartosik's avatar

Wow, this was a long read. Thanks for spending the time to write it up on top of the time spent reviewing the actual paper.

It’s great seeing another particle physicist around here. And I liked the comparison of modern students with Da Vinci - that’s a good argument.

1 more comment...

No posts

Ready for more?