Academic Highlights from NeurIPS 2025, San Diego

In this post, I will give a day-by-day summary of academic highlights from NeurIPS 2025. There were hundreds of presentations and thousands of posters given in San Diego between Dec. 1 and 8, and I was not present in Mexico City for the work presented there. The papers summarized in this post will not be, by any means, a comprehensive list of papers and posters that were published in NeurIPS this year. I will provide highlights of the oral sessions I attended and list the best posters from the poster sessions I went to. The latter were too numerous to describe individually, so they are given in tables.

Tuesday, Dec. 2

I arrived in San Diego the night of Dec. 1 and checked into my hotel. Dec. 2 marked the beginning of the conference with a day of industry presentations and exposes.

Lambda Expo Workshop: Multimodal Superintelligence Workshop

The speaker began and introduced three horsemen of multimodal apocalypse

Data : Data is balkanized. Over 500 MM datasets exist, but no standards
Compute : Compute is rarely always available. For those who always have a cluster or can keep the same cluster ad infinitum, that is a great privilege
Execution : The human capital can be lacking. Practitioners do not always follow best practices and are not omnipotent. Many machine learning researchers do not work with multimodal data altogether. They also all have to get on the same page.

Lambda's LAILA workflow is about unifying the three horsemen, removing the problems

Jason Zheng, Stanford | Lambda

Zheng described how to unify data. The process begins with the data as they are now, distributed across numerous silos such as the GPU VRAM, RAM and different storage.

To speed up retrieval of data in these numerous locales, LAILA tracks them with unique resource identifiers. Further, the problem of non-standard dataset formats, which are often a block to generalization from one benchmark to inference on real-world data, LAILA introduces a universal dataset format.

Unification through resource location speeds up inference and training by increasing throughput of training. The universal dataset format saves time by removing the need of preprocessing for training and/or inference on new data. Further, the format accommodates all (popular) modalities.

Jessica Nicholson, University of Bath | Lambda

Zheng then moved to the Hybrid Execution Engine but merely promised more information would come about it later and handed things over to Jessica Nicholson to explain Unified Compute its inner workings.

She discussed Neural Policy Orchestration. The concept of orchestrating a neural policy unifies the training objective compute available, and the universe of discourse of your dataset.

Jessica advertised that more information about Lambda, and LAILA would be available at Booth #713, and I looked forward to seeing them there.

Multimodal Foundation Models - Amir Zadeh, Carnegie Mellon University | Lambda

Amir Zadeh, the first author listed on the expo's page in the NeurIPS schedule, climbed up the stage and showed the classic video of the McGurk effect by which a listener's perception of a spoken syllable is shaped not only by the sound but by reading, subconsciously, the lips of the speaker. Two videos are presented of the same man speaking. In it, the man's lips say different syllables, but the audio is exactly the same. 70-80% of people, Zadeh explained, change their judgment of the syllable said according to which video is shown to them.

The point of this was to demonstrate that multimodal reasoning is pervasive in natural neural networks (our brains). Zadeh remarked that unimodal reasoning is actually rare in animals and people. The core function of multimodal reasoning, Zadeh continued, is to predict any modality from any other.

He introduced Massively Multimodal Modeling (4M). This out-of-the-box multimodal model would be able to receive data in one of a set of modalities and predict what input in the other modalities would be for the same multimodal sample. For this reason, says Zadeh, the model is also inherently multitask; and as he explained, he showed several examples of this any-to-any inference and generation.

Zadeh presented a matrix of input modalities opposite a row of outputs in the other modalities studied, including text, visible light image, audio, and various other imaging methods. 4M is able to perform any-to-any MM generation and achieves SOTA in various tasks as compared to unitask models, showing that 4M leverages multi-task MM learning without losing fidelity in any one task.

4M features a redesigned architecture and is scaled up to support tens of modalities with data and model size on the billions scale. Naturally this implies training length in the trillions of tokens.

Expo Talk Panel: Agentic AI/RL, Cloud Native & Pytorch

Just as I sat down to attend the panel, Davide Testuggine (SE, Meta) was finishing his presentation titled PyTorch-Native RL & Agentic Development At Scale.

Everyone clapped and Aksel Joonas Reedi of Hugging Face took the stage to give his talk: RL Environments are Forever

Reedi introduced RL environments, which he says are a new frontier in post-training. However, they face two problems:

The training is costly.
When problems are too easy, there is not enough signal, and when they are too hard, the agents receive too poor a reward to increase in their performance

DeepSeek-v3.2 is a triumph of RL environment training. It boasts a new SOTA for performance and used 1800 such environments for post training. The downside is that these environments were scattered throughout GitHub and had to be scrounged up.

The subject of this talk is a collaboration between Meta and HF called OpenEnv, which standardizes RL environment training through an end-to-end framework. OpenEnv not only standardizes all stages of the RL environment training pipeline, unifying the training process for all models, but it also integrates with existing HF APIs.

Environments are live, just like many transformer models, through the HF API. Environments can be created and deployed to Hugging Face Hub with the new openenv executable.

OpenEnv and Reinforcement learning

Danial Han and Sanyam Bhutani describe the results of a hackathon hosted to demonstrate reinforcement learning for agents using the new openenv project.

In this hackathon LLMs were trained through reinforcement learning to play chess, wordle, mario, pac-man, and even Pokemon showdown.

Bhutani describes in this talk how to build one's own environment for a similar project, which is a very simple endeavor, requiring just one line of code to initialize the RL environment: openenv init my_env.

The strength of openenv is the universal, type-safe interface produced for RL learning in openenv:

reset: which starts new episodes
stop: which executes action and returns observations
state: grabs episode metadata

For example the RL training for the game of Connect4 has the following interface

reset: take the chips out of the board
stop: make a move
state: get the arrangement of chips on the board

Han took over, transitioning to a more-detailed example of RL for the game 2048, which has a live notebook that was available through a google colab notebook.

Sharing his screen, Han described the RL environment with the notebook on the board. The notebook, he explained, also contained pointers on how to create RL training objectives to avoid degenerate strategies.

Han also told the audience how just one prompt, fed to the model many, many times is sufficient to perform RL once an environment is in place and that this is true not just of simple games like Connect4 and 2048 but for complicated ones like Pokemon Showdown. The model follows the reward function, and if it is well-made, the rest falls into place.

Wednesday, Dec. 3

Keynote - Rich Sutton, University of Alberta Openmind Research Institute

Sutton says that, while ML is wildly successful, both economically and scientifically, we have lost sight in many ways of the goal. In this talk, Sutton says he will introduce his vision for superintelligence from experience with the titular OaK architecture.

Outline:

Sutton's Quest for AI-agents that respect the bitter lesson.
Everything must be done at runtime in big worlds
OaK architecture.

Sutton's Quest

Sutton's quest is to design an AI agent that is

Generalizable
Experiential
Able to discover novel abstractions from inference
Without bitterness, meaning that nothing domain-dependent is built-in.

The bitter lesson, as Sutton calls it, is a revelation that AI should not dependent on developer knowledge. It is the culmination of historical observations

AI developers try to build knowledge into agents
this helps in the short term
Long term, it plateaus the performance of the model
Breakthroughs eventually arrive, masking the problem.

Big Worlds

The Big World Perscpective is something that Sutton has acquired from experience with agents. The space of the problem is inherently larger and more complex than the framework used to model the agent.

This implies that everything the agent learns cannot be a complete representation of the problem space but, rather, an approximation. This makes the problem space (world) to appear to be moving.

Sutton says that inaccuracies in the design-time understanding of the problem space necessitate run-time learning and abstractions, since the problem is misunderstood or understood only proximally at any time.

OaK Architecture

It is with the bitter lesson and the big world perspective in mind that Sutton introduces the OaK architecture. He began by stating that there is substantial consensus on the mechanics of intellegent decision making between disciplines. Different fields, he claims, merely attribute different names to the same decision making processes. He said this over a table of terms from different scientific and intellectual disciplines that are near synonyms.

Providing examples from the table, he used this dogma as justification for a consensus model of intelligent agents.

Perception: Observations produce state-feature vectors
Reactive Policy: Produces actions appropriate for the state
Value Function: Calculates reward a la RL
Transition model: Predicts the consequences of alternate actions.

Previous approaches focus on learning what to do at the next step at each step. Meanwhile, Sutton's Option and Knowledge (OaK) architecture compares alternative options. Options are policy and termination condition pairs $(\pi, \gamma)$ . In the OaK architecture, agents compare many such options and learns base on the value of following these options to their termination conditions.

Instead of learning what to do at individual steps, the agent learns a high-level transition model that creates plans for multi-step state transitions (plans).

OaK inherently adds auxiliary subproblems to a known problem and builds up from these to learn the overall task. Features learned by the OaK are subproblems that produces options and inform the transition model.

Subproblem analysis in AI/RL is not a new concept. Sutton produced citations going back to Curiosity in RL (1991) that show that researchers have looked at subproblems for a long time. However, OaK sets itself apart by answering four open questions with respect to subproblems in RL

Q: What should the subproblems be? - A: Different for different tasks, but they have a unified representation in OaK through the abstraction of options.
Q: Where should they come from? - A: Agents must create them. they cannot all be pre-programmed or anticipated by designers.
Q: Can the agent generate its own subproblems? - A: Yes. OaK creates subproblems i from features and the reward associated with them kappa.
Q: How do the subproblems help the main problem? - A: OaK leverages the option abstraction to solve subproblems learning large, multistep transitions through the transition model that maximize features associated with subproblems and weighs the importance of each subproblem with its kappa value.

While a lofty ambition, OaK has not yet been completely implemented. Sutton explains that several issues with different steps in the pipeline are holding it back from complete implementation. Many steps rely on continual deep learning, which has not been implemented yet in the RL literature while others have gaps in the design while the math behind them remains sound. For now, while not practicable, the theory of OaK architecture remains promising.

Oral 1C Theory 1 - Optimal Mistake Bounds for Transductive Online Learning - Zachary Chase, Kent State & Jonathan Shafer, MIT.

Theory:

Q: Is unlabeled data useful for classification?
A: In PAC learning, No

But with online learning, it might be.

Transductive Online Learning

For some domain $\Chi$ and Hypothsis Class $\mathcal{H} \sub \{0,1\}^\Chi$ , we can describe online learning as a zero-sum game between a Learner and Adversary

For each trial, the learner chooses a hypothesis, the adversary chooses a label, and they are compared.

In transductive online learning, all trials are known to the adversary and the adversary must choose all of the labels before hand. The learner is given more information, namely the full sequence of unlabeled trials from the beginning of the sequence.

Chase et al. (2025) show a lower bound for the transductive online learning model that grows at least Big- $\Omega$ of the square root of the standard mistake bound for all Hypothesis Classes and that, there exists some $\mathcal{H}$ such that the transductive online learning mistake is bounded below Big- $O$ of the sqrt of the standard mistake bound for that H.

Outline of Proof

Proof of the lower (Big- $\Omega$ ) bound follows from a greedy algorithm and the corresponding proof of its minimal error.

The upper (Big- $O$ ) bound for some $\mathcal{H}$ in the set of Hypothesis Classes, follows from the idea that sparse codewords are easy to guess. In other words, everytime the transductive online learning model makes a mistake at a single trial $x_t$ it gains a goodly amount of information about the whole sequence of trails $\{x_i\}_{i \in {1,n}}$ . For example, knowing that a bit sequence is one-hot encoded would permit the learner to make a single mistake about the sequence by guessing 0 at every step. The Learner would only be wrong for the one bit that is not zero in such a sequence.

It was revealed at the end of the talk that their paper won runner-up for best paper in the track.

High Dimensional Calibration From Swap Regret - Maxwell Fishelson, MIT

Fishelson began the presentation with the following thought experiment. Imaging that you must predict the likelihood of rain on each day. The goal of the forcaster is to produce a distribution of forecasts that match the empirical frequency for the realized outcome each day. Complicating matters, we suppose that the outcome, whether it rains for each day, is chosen by an adversary. At the outset, Fishelson notes, the task is impossible to do with perfection since the adversary is capable of choosing the opposite outcome at each step.

This thought experiment models single-class calibration, where for some convex subspace $\mathcal{P} \sub R^d$ , the forecaster chooses positions in $\mathcal{P}$ and the calibration error is the one-norm distance between the forecaster's prediction and whereever the adversary places the true value in $\mathcal{P}$ .

Moving forward to multi-class calibration, wherein the forecaster must assign probabilities to many different possible outcomes, rather than one of two outcomes. For such multiclass problems, we see calibration error less than some $\epsilon$ times $T$ and prove bounds on $T$ .

Fishelson et al. (2025) proposes TreeCal, which is a calibration error optimizer that generalizes to all P and error norms with bounds $\epsilon T >= Cal^{||·||^2}_T$ , where $Cal^{||·||^2}_T$ is the calibration error, and $T <= (diam(\mathcal{P})/\sqrt{\epsilon})^{O(Rate(\mathcal{P}, ||·||)/\epsilon^2)}$ .

TreeCal

Consider some tree of timesteps of length $H^L$ . Tree splits the tree into $L$ layers split into intervals of size $H^{(L-i)}$ at each layer $i \in [1, L]$ . Treecal produces labels for forecasts as the set of layer intervals that contain the forecast.

The error is characterized with a calibration error similar to something called a proper scoring rule that measures distance from a curve using a Bregman divergence. Bregman divergence for each layer is calculated for each layer of the tree, and the sum of the divergence for each layer forms a telescoping sum down its layers. This removes all diverges except for the top layer.

With the assumption that there is a regularization function $R$ that is strongly convex and $\rho$ -lipschitz, then Fishelson et al. (2025) can prove their bounds on $T$ . Note, however, that this does not give insight into $R$ .

Poster Session 1: 11am - 2pm

ID	Title	Link
806	NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMs	https://arxiv.org/abs/2506.02024
2112	Caption This, Reason That: VLMs Caught in the Middle	https://arxiv.org/abs/2505.21538
2204	PINN Balls: Scaling Second-Order Methods for PINNs with Domain Decomposition and Adaptive Sampling	https://arxiv.org/abs/2510.21262
24xx	Structural Causal Bandits under Markov Equivalence	https://openreview.net/pdf?id=IL1wvzOgqD
2513	Structural Casual Bandits Under Markov Equivalence	https://openreview.net/forum?id=3aFwsZxM5H
27xx	Connecting Jensen–Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learning	https://arxiv.org/abs/2510.20644
2706	Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Models	https://bramtoula.github.io/combo/
2714	ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibility	https://openreview.net/forum?id=GClrNUTqly
2913	The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Games	https://arxiv.org/abs/2502.08519
3011	Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions	https://arxiv.org/abs/2510.18638
3513	Integration Matters for Learning PDEs with Backwards SDEs	https://neurips.cc/virtual/2025/loc/san-diego/poster/116412 (SPOTLIGHT)
4004	Token embeddings violate the manifold hypothesis	https://arxiv.org/abs/2504.01002
4207	Set-LLM: A Permutation-Invariant LLM	https://arxiv.org/abs/2505.15433
4507	ML4CO-Bench-101	https://github.com/Thinklab-SJTU/ML4CO-Bench-101
5003	End-to-End Vision Tokenizer Tuning	https://arxiv.org/abs/2505.10562
5105	DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection	https://aimagelab.github.io/DitHub/
5314	WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarios	https://openreview.net/forum?id=s5p9ByKN1j

Poster Session 2: 4:30pm - 7:30pm

ID	Title	Link
108	CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoder	https://www.arxiv.org/abs/2510.18583
502	Bayesian Concept Bottleneck Models with LLM Priors	https://arxiv.org/abs/2410.15555
904	On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm	https://arxiv.org/abs/2505.11840
1016	Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformers	https://openreview.net/forum?id=l1DDTSqFq7
3012	Regularized least squares learning with heavy-tailed noise is minimax optimal	https://openreview.net/pdf?id=TjQP5hc3WC
3018	Robustness in Both Domains: CLIP Needs a Robust Text Encoder	https://github.com/LIONS-EPFL/LEAF
4011	Do Language Models Use Their Depth Efficiently?	https://arxiv.org/abs/2505.13898
4610	Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration	https://openreview.net/forum?id=64WeVllQjq
4911	Differentiable Hierarchical Visual Tokenization	https://github.com/dsb-ifi/dHT/ (SPOTLIGHT)

Thursday, Dec. 4

Keynote 3: Yejin Choi, Stanford

Choi began the keynote by summarizing the various spectres that haunt our field today: going from entropy, to the fruitlessness of further bruteforce upscaling, and finally mode collapse, wherein model performance tanks when models are training on AI generated output.

Prismatic Synthesis

Princeton has introduced the Vendi Score, but the original Vendi Socre could not correlate with the data to cover out-of-distribution generalization. For that use case, Choi devised the G-Vendi score.

Further, she introduced Prismatic Synthesis. Sample data are fed to a Deepseek R1-32B model, which generates synthetic data. The data are then filtered for diversity and quality (about 70% of it). The results are further refined by another R1 model.

The approach outperforms baselines while using 20x smaller model and has no reliance on supervision. That is, it does not need human-labelers.

In conclusion, Choi states, reasoning requires data that transends the internet. Synthetic data can rescue us, and systematic diversification of the data improves it.

RL as Pre-Training (RLP)

Next, she introduced RL as Pre-Training. The current paradigm is Next Token Prediction. This is learning by naive guessing, she posits, and it can be greatly improved through the introduction of RL into the NTP pipeline. Whereas NTP has no explicit reasoning step. RLP introduces an explicit reasoning step into the pre-training.

RLP learns a Thought Policy and compares the entropy between a no-thought policy and the Thought Policy. The information gain (entropy difference) from the training step with the Though Policy compared to the one without is a reward funtion.

Empirical evaluation asked two questions:

Can RLP improve reasoning ability of a base model?
Can RLP match the performance of a fully trained model, even with 200B fewer parameters?

Answers:

RLP outperforms the baseline Qwen3-1.7B-Base by 19% and CPT by 17% across math and science benchmarks. After SFT, RLP compounds its advantage over the baselines by several percentage points.
RLP continues to outperform the fully-trained Qwen with 200B fewer params. Averages 35% increase.

Overall, the results gave Choi's team pause and makes her question whether the very foundations of pre-training and NPT should be revisited. She pointed out that her team is not the first or last to ask this question either.

Concluding Remarks

AI should be democratize, says Choi. It should be of humans, by humans, for humans:

Of Humans: Belong and originate from human beings and reflect human values
By Humans: AI should be created and moderated by the people, not by large corporations or governments
For Humans: AI should serve humans, not other AI or the other way around.

OpenThought3 has approached the popularization of effortful RL. Instead of riding trends and cutting papers into thin slivers to get published, Choi says that researchers should focus on earthmoving publications following unconventional methods.

Choi summarizes the "Art of Artificial Reasoning." Choi says that it's all about the data. At first, the internet was cannibalized, then humans were made to write further data, and when this was no longer sufficient for bruteforce scaling we have moved on to synthetic datasets.

The solution, she links back to her own work, is through the synthesis in ID data through cannibalization of OOD data. Choi justifies the approach by pointing out that the universe of knowledge is much greater than the available writings on the internet or that human annotators are capable of reproducing manually. Other synthetic approaches are really no stronger (although they are faster) than having humans write more data since they cannot produce data in a different distribution than the training data.

Finally, she posits few open questions:

Are there new theories of intelligences, knowledge, and reasoning that will allow humans to access the broader universe of knowledge themselves?
Humans do not have a context window size of 1M tokens: is this a limitation or a good thing?

Poster Session 3

ID	Title	Link
612	Neurosymbolic Diffusion Models	https://arxiv.org/abs/2505.13138
706	Multiclass Loss Geometry Matters for Generalization of Gradient in Separable Classification	https://arxiv.org/abs/2505.22359
902	Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Training	https://arxiv.org/abs/2511.04485
907	Differentiable Sparsity via D-Gating: Simple and Versatile Structured Penalization	https://arxiv.org/abs/2509.23898 (SPOTLIGHT)
1109	ElliCE: Efficient and Provably Robust Algorithmic Recourse via the Rashomon Sets	https://neurips.cc/virtual/2025/loc/san-diego/poster/118970
1412	Utility Engineering: Analyzing and Controlling Emergent Value Systems in AI	https://arxiv.org/abs/2502.08640
2215	A physics-preserved transfer learning method for differential equations	https://arxiv.org/abs/2505.01281
2512	Investigating Hallucinations of Time Series Foundation Models Through Signal Subspace Analysis	https://neurips.cc/virtual/2025/loc/san-diego/poster/115394
4211	Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning	https://arxiv.org/abs/2509.15087
xxxx	A Compressive-Expressive Communication Framework for Compositional Representations	https://arxiv.org/abs/2501.19182

Poster Session 4

ID	Title	Link
3005	Assessing the Quality of Denoising Diffusion Models in Wasserstein Distance	https://arxiv.org/abs/2506.09681
3909	Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models	https://arxiv.org/abs/2505.17761
4015	Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks	https://arxiv.org/abs/2502.16763

Friday, Dec. 5

Oral 5A: EvoLM: In Search of the Lost Language Model Training Dynamics - Zhenting Qi, Harvard

Qi began by introducing the three types of training in modern LM. Training consists of pretraining, mid-training (domain specific objectives), and post-training (fine-tuning).

Open Questions in LM Training

I sat for Oral 5A in the F-H Exhibiton Halls in wait for the fifth poster session. With bated breath, I anticipated the latest and greatest in LLM research to be presented on the biggest stage that the machine learning community has to offer.

A young man stood center-stage for a long time, audio technicians and event organizers fiddling with the audio equipment to make sure everything was in working order. Nearer to me, a man stood atop a smaller stage behind the camera used to broadcast the festivities to the conference website and to the 12 projector screens arranged in a 2x6 matrix throughout the conjoined exhibit halls.

In pre-training and post-training, there are a host of unanswered questions "Does scaling up pretraining always improve post-training?," "Does RL inject new capabilities beyond pretrained model?", "How do we allocate computer between SFT and RL?", etc.

While not possible to answer each of the questions, Qi says, his team introduces EvoLM, an end-to-end pipeline and model suite for training and evaluation. It has Pretrain, CPT, SFT and RL training modules and encapsulates evaluation as well. The model itself is a billion-scale (1 or 4 B) Llama architecture.

Scaling up Training

Pretraining is scaled from 20B Tokens (Ts) to 320B Ts. Mid-training is scaled up as well. CPT mid-training is performed with some 50B Ts. While it would normally trigger forgetting alone, pretraining data replay could mitigate this problem, Qi says. Then as CPT data increase, the post-trained model performance improves.

SFT post-training compute is scaled from 1 epoch to 32 epochs on a dataset of 50k examples. SFT-ed models show improved performance with diminishing returns on OOD tasks with RL.

Finally, RL is scaled up from 1 to 32 epochs, and it shows performance on both ID and OOD tasks but with diminishing returns. The rL further increases the chance of sampling high-quality trajectories.

Post-training is scaled up for both SFT and RL with a further 100k examples.

Conclusions

Qi et al. release 100+ LMs trained at scale with their pipeline.

Their future plans include training larger models and multimodal architectures. They also plan to explore more diverse post-training objectives.

Large Language Diffusion Models - Shen Nie, Renmin University of China

Nie introduces Large Language Diffusion with mAsking (LLaDA). It features unsupervised pretraining with masked tokens, supervised finetuning through response masking, and language generation by predicting masked tokens in parallel rather than linear next token prediction.

Nie compares generative models to a highly dimensional random distribution. In this way, the output of such models is a sampling from that distribution space. Futher the properities of LLMs theoretically originate from the generative paradigm, not exclusively from the autoregressive framework.

Currently LLMs perform L-R Next-token-prediction token-by-token. However, the intuition that language contains redundencies would suggest we can parallelize the procession of tokens. Diffusion models are naturally capable of bidirectional reasoning, but it was an open question at the outset how to model text with a diffusion model. The answer was a Masked Diffusion Model. With probability $t \in [0,1]$ , tokens in each dimension of the neural network are masked with a special mask token.

In training, the diffusion model (transformer) accepts a sequence with arbitrary masked tokens and must predict what the true tokens are behind the mask. The diffusion model is scored with the CE loss on the masked positions.

Naturally, this creates a relationship with a famous related work: BERT. BERT uses a fixed-ratio mask prediction, but non-generative models lack explicit probability representation.

Nie demonstrates the scalability of LLaDA in large context windows and shows examples of its capabilities as a chatbot interacting with some user. Whereas Autoregressive models fall into the so-called reversal curse wherein they fail to learn equivalence relationships in both directions, i.e. they are able to reproduce 'A is B' but not 'B is A'. LLaDA breaks those shackles.

WHile promising, Nie cuations, there are still challeneges for the diffusion language model. There is but sparse supervision, independence of masked tokens from visible token distribution is not provable, system optimizations must be made as well.

Nie calls back to Sutton's Bitter Lesson, featuring the quote "We have to learn the bitter lesson that building in how we think we think does not work in the long run." With respect to LLaDA, this lesson teaches us that LMs should not necessarily be front-to-back, that is to say sequential. LLaDA is a first step in this direction.

Sejnowski-Hinton Award 2025: Random synaptic feedback weights support error backpropagation for deep learning - Timothy Lillicrap, University College London & Daniel Cownden, University of St. Andrews

At NeurIPS 2025, the 2025 Sejnowski-Hinton Award for papers that contribute to computational theories of the brain using AI insights was awarded to Timothy Lillicrap, Daniel Cownden, Douglas Tweed, and Colin Akerman on the basis of their work Random synaptic feedback weights support error backpropagation for deep learning. Lillicrap et al. (2016) observed that brains learn quickly, computing complex behaviors in what should be trillions of parameters. The question for AI research and neuroscience is how.

Backpropagation in the Brain

Neuroscientists have outlined how learning occurs on the small scale, on the level of individual neurons, but this understanding does not scale up to the level of an individual. In ML, it is known that ML uses backpropagation to scale up learning. The authors questioned whether that was the same for real brains, and it turns out that Crick (1989), Grossberg (1987), and other authors in the 80s already investigated the idea. they called biological backpropagation unrealistic for many reasons. For one, the authors summarize, Crick (1989) questioned what goal the biological backpropagation would be optimizing toward. Grossberg (1987) conceived of the Weight Transfer Problem which asks how weights, such as the ones necessary for backpropagation, are transferred in the brain.

The Backward Pass

In the 2010s, the team was "stoked" to understand the relationship between backpropagation and biological neural networks. They benchmarked several training methods with a two-layer NN and noticed that one called Feedback Alignmnet converged as or more quickly than backpropagation. This comparison between the two methods held under benchmarks with larger models and more complex data/tasks.

FA works, they prove, in the sense that it will converge no matter what, but Lillicrap et al. admit that there is no guarantee on the speed of convergence despite the empirical results. Further, there are cases in which it will not work (will not converge quickly).

Direct Feedback Alignment (DFA)

Further, DFA extends the FA training paradigm by broadcasting error (backpropagating) error all the way back to the first layers of the network.

Keynote: Demystifying Depth: Priciples of Learning in Deep Neural Networks - Andrew Saxe, University College of London Theory of Learning Lab

For their breathtaking and seemingly miraculous abilities, Saxe points out that AIs and LLMs are still practically black-boxes. We understand very little of the inner workings behind it.

The question of deep learning theory is to go from raw data to strong results with large amounts of compute without going through the process of computer simulation (i.e. without running experiments directly)

Since tackling physical problems is very difficult in an explicit way, so-called surrogate models are necessary to understand the problem at some level of abstraction that we can reason about. For example, whereas many gas molecules push against the walls of a box and create pressure within it, the equations for calculating that pressure are not written in the terms of the individual gas molecules. Saxe's contribution is the same but for deep learning.

Deep Linear Neural Network

Deep Linear Neural Networks rephrase the problem of gradient descent. The MSE Loss becomes nonconvex (non-differentiable) and the equation of gradient flow dynamics (gradient descent) is altered. However, thanks to the linearity, the error surface can visualized in 2D. As it turns out, shallow networks converge very quickly whereas deeper networks do so slowly. The simplicity of the Deep Linear NN equations actually allows for analytical solution of the equations, and this analysis bares out the empirical data.

Training speed is shown to be $O(1/sb_o^D)$ where s is the input-output correlation, $b_0$ is the inital layer weight, and D is the depth. Further manipulations of the Deep Linear equations reproduces theoretically what empirical results have told us about the relationship between weight initialization and training speed. Random initialization creates an exponential dependence on the depth whereas modern approaches show that convergence is $O(1)$ with respect to the depth.

SVD change of variables

SVD decomposition in the change of variables introduced by Deep Linear NNs produces a hierarchy of saddle points, wherein said points surround local maxima in a pyrimidal shape. SVD further allows us to understand how DLNN will behave for some "world" without actually having to train the model.

PCR

Work with linear attention reproduces the hierarchy and shows that models implement PCR in context. This is a known algorithm and this provides an example of a training surface in which we know what is happening at all points during training.

Nonlinear networks pose the final challenge before we can apply Saxe's work to real problems. Thanks to gating, nonlinear Deep NN are actually the composition of linear networks, and this is how Saxe models them. However, he admits that the understanding is only approximate. There is the phenomenon of neural race conditions, which means that one path of the nonlinear network actually will dominate over time during training. Paths that learn faster (minimize loss faster), will dominate, and losing pathways do not matter in the long run. This sparsifies the network that theorticians need to consider to understand training behavior.

As an example, Saxe shows application of his new theory to multilingual translation wherein samples in any of a set of languages may be translated to the another language in the set. There are many ways to connect the subnetworks that represent the model for each language pair. Following the results stated previously, the quickest training and best generalization falls out of a network that shares neurons for every language pair. Further, his work permits us to quantify how much of the available data (~40%) must be trained on before the model generalize ID tasks into OOD.

Poster Session 5

ID	Title	Link
712	Adaptive Riemannian ADMM for Nonsmooth Optimization: Optimal Complexity Without Smoothing	https://arxiv.org/abs/2510.18617
714	Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifolds	https://arxiv.org/abs/2510.21468
2711	DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoning	https://arxiv.org/abs/2505.20241
2807	Universal Sequence Preconditioning	https://arxiv.org/abs/2502.06545
3011	On the Necessity of Adaptive Regularisation: Optimal Anytime Online Learning on $l_p$ Balls	https://arxiv.org/abs/2506.19752
xxxx	Sinusoidal Initialization, Time for a New Start	https://arxiv.org/abs/2505.12909

Saturday, Dec. 6

ML for Physical Science Workshop

Continual Learning for Particle Accelerators - Malachi Schram, Jefferson Lab, US DOE

Schram first gave an overview of his work, beginning with the motivation. His team would like to encorporate ML into workflows for particle accelerators at Oak Ridge. Particularly, this work discusses efforts to apply ML toward the automation of the detection of failed in the beam pulse responsible for supplying a controlled stream of particles to the accelerator.

The data are time-series of 100MB/s in flow rate. The Differential Current Monitor (DCM) shoots the beam of particles with brief pauses between them. The goal is to understand whether the waveform of pauses changes in response to a beam fault and, if so, how to leverage AI/ML to detect the fault (classify the status). A Siamese model (SNN) was developed and later augmented with Gaussian Processes Approximation. ROC shows a true fault detection rate of 60%.

Data Drift

Over time, the character of the data will change in response to measurable parameters in the system. The beam, for instance, is tune continuously. Data drift may also occur thanks to non-measurable parameters and do so unexpectedly and unwelcomely. Changes in the hardware associated with the side effects from testing, such as the heating of components, can effect the speed; timing; and strength of the beam. Thus a model that can classify the beam status in the presence of such immeasurable data drift is necessary. Conditional Siamese NNs (CSNN) are employed and compared to the traditional approach. The CSNN prevails both in terms of accuracy and time because conditional learning has been shown to maintain performance against the effect of drifting data.

Continual Learning Experiment

Instrument data were collected and segmented into a train and test set for continual learning under a variety of beam configurations. CL addresses the drift due to known changes, but there is catastrophic forgetting and traditional training actually shows decaying performance in the long run. Adding Growing Selective Replay, wherein a set of remembered replays is kept while everything else is discarded, fixes the decay problem.

Summary

AI/ML is more effective than traditional techniques for detecting beam faults, UQ methods have been introduced for OOD data, CL counteracts the effect of catastrophic forgetting shown in the traditional training regime.

Sunday, Dec. 7

The PokéAgent Challenge: Competitive and Long-Context Learning at Scale

Kaggle Game Arena - MinMin Chen, Google DeepMind

Chen introduces the concept of Jagged Intelligence. Models have excelled at complex problem solving, but the models are still capable of non-sensical errors (blunders) in various games, even ones as simple as tic-tac-toe.

Further, static benchmarks are becoming both oversaturated and contaminating the training data. Thus, dynamic, competetive arenas with human preference, agentic performance, or games rank models are becoming common place.

Games challenge agents not just in problem solving ability but in their ability to form a theory of mind for their opponent and to plan in the long term.

Game Arena

Chen provides an overview of the arena. There is a game environment in which the game(s) take place. An agent harness builds prompts and parses agentic actions, acting as a mediator between the models and the game environments. An IO loop is established and an external visualizer and leaderboard provides a UI for displaying game information and reasoning traces and a ranking of agents, respectively.

Turn-based classic board games and card games were the first to be adapted to the arena. A framework called OpenSpiel was accessed, which is an open source framework for RL & Games with 100+ supported games.

The prompt builder inserts several arguments into a general prompt that asks the model to play a game, explains it to them, lists the legal moves from their current position, and prompts the AI to make another move. Chen points out that we can also test models on move legality by removing the legal moves list. Experiments with the latter configuration reveal that even the strongest models sometimes struggle to think of a legal move, showing an illegal move rate of 17%. Self-consistency and re-thinking have been used to cut this down to 0%.

The Arena was the backbone of the chess tournament benchmark. A single elmination knockout tournament was instated with Burstein seeding. Any LLM that continued to produce illegal moves after three re-think steps was eliminated. A swiss was also implemented where every model faced every other model. OpenAI's o3 won the tournament with Grok 4 coming second and Gemma 3.5 Pro third.

Discussion

Some people may find the chess benchmark regressive since we have had models that play games at a high level for a long time. However, Chen explains, such perspectives overlook the fact that models benchmarked are not specifically designed to play Chess (or other games) whereas the models decades ago could do nothing but play chess. It is a triumph of LLMs' generalizability.

On the other hand, models are sensitive to input format and appear to be stronger at openings (which are memorized) and weak in the mid to end game. Other challenges include high variance, especially for games without perfect information, and inference cost ballooning as the game continues in the long-run.

Lightning Talks

The lightning talks were given by winners and highly performant teams from a pair of agentic Pokemon competitions that took place before NeurIPS. The winners of these two competitions (tracks) were given awards. The competitions were as follows:

Track 1: Best of 99 Single Elmination Trounament of OU Pokemon battles in Gen 1 and Gen 9
Track 2: Speedrunning Competition for Pokemon Emerald. Agents race to beat the first gym leader, Roxanne

Curriculum Learning for Pokemon Battle (Track 1) - Qiao Wang, Google

There is a massive state space for Pokemon battles, so the team takes a two phase approach.

Phase 1: Fine Tuning. Qabra is finetuned on 65K games from various generations.
Phase 2: The model learns on an additional 100K games with a coach.

Achieves a 0.8 win ratio over the coach and was 2nd place in the competition. Code available on Github github link.

PA Agent (Track 1, Gen 1 OU winner) - Xianwei Shi, PingAn Life Insurance

The high combination space for teambuilding is a particular challenge that compounds the complexity of Pokemon battles on top of actually having to play the game.

The team's solution starts by sampling teams from Smogon forums and combines RL + Transformers to iterate over inter-model battles starting with sample teams and human battle data as a baseline.

The training begins with all human battle data at first and gradually decreases the proportion of human participants as training continues until all combatants are agents.

Foul Play: A Competitive Pokemon Showdown (Track 1, Gen 9 OU Winner) - Patrick Mariglia

A monte carlo method for playing Pokemon. The author starts by introducing the main information problem with Pokemon as an AI problem. Despite the fact that the model has just five legal moves at any point in the game, each one of these moves can give rise to many child game states, and it is not always possible (rarely possible) to know which of the states the game will be in until have the decision has been made and the turn has played out.

His team used the PokeEngine battle engine, which performs monte-carlo search to efficiently search a simplified tree of possible games states. The tree is "simplified" in the sense that the 16 different damage rolls that would normally have to be considered are ignored.

Information is king in the game of Pokemon, says the author (this is true from personal experience). Playing well is about understanding the information given do you and performing deductions on the available data. For instance, if a Landrous Therian outspeeds (does its move for the turn first) a Garchomp, then it must be holding a Silk Scarf, which boosts speed. The Silk Scarf does not announce itself, but its presence can be inferred from the fact that a slower Pokemon outsped a faster one.

For cases where something is not certain, most of the time, Set Prediction is used. Pokemon showdown hosts battle data that Foul Play accesses. It also analyzes the game state. Using the probabilities from Showdown's servers' databanks, the model builds a decision tree from the game state, performs monte-carlo search on the possible states, and picks one.

The code and documentation are available on GitHub.

Hamburg Pokerunners (Track 2 Second Place) - Arian Urdu, Univ. Hamburg

The team's solution began with dreamerv3, which learns a model of the game world through learning actor/critic imagination. However, training the world model was prohibitively expensive.

The final solution relies on recurrent-PPO. It samples the game world, feeds embeddings of the world to a recurrent LSTM, and learns a DNN to navigate. Sampling the world rather than having to learn it ahead of team simplifies training and the world representation.

Heatz (Track 2 Winner) - Junik Bae

Only system that was able to beat Roxanne is less than an hour (40:14). The approach prompts an LLM to make a scripted policy for navigation subtasks and evaluates them in the environment. An RL agent begins with the scripted policy and undergoes RL learning to improve on the LLM's original "intuition."

The code is available on Github

It ended with a panel, but I had to leave for my presentation at the Evaluating the Evolving LLM Lifecycle Benchmarks, Emergent Abilities, and Scaling Workshop.

Christopher standing by the portait-style OPTiCAL poster at the LLMEval workshop at NeurIPS, December 7, 2025.

References

Chase, Z. Hanneke, S. Moran, S. Jonathan S. Optimal Mistake Bounds for Transductive Online Learning. Advances in Neural Information Processing Systems 38. San Diego, CA. 2025
Fishelson, M. Golowich, N. Mohri, M. Schneider, J. High-Dimensional Calibration from Swap Regret. Advances in Neural Information Processing Systems 38. San Diego, CA. 2025
Lillicrap, T. Cownden, D. Tweed, D. B. Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications. 7 (13276). 2016
Crick, F. The Recent Excitement About Neural Networks. Nature 337. 129-132. 1989
Grossberg, S. Competitive Learning: From Interactive Activation to Adaptive Resonance. Cognitive Science. 11, 23–63. 1987.

For more information about our research, return to our homepage: ufdatastudio.com.