Academic Highlights from NeurIPS 2025, San Diego

By Christopher William Driggers-Ellis on Dec 17, 2025
A picture of the Marriot Marquis hotel behind the San Diego Convention Center. The latter was the venue for NeurIPS 2025.

In this post, I will give a day-by-day summary of academic highlights from NeurIPS 2025. There were hundreds of presentations and thousands of posters given in San Diego between Dec. 1 and 8, and I was not present in Mexico City for the work presented there. The papers summarized in this post will not be, by any means, a comprehensive list of papers and posters that were published in NeurIPS this year. I will provide highlights of the oral sessions I attended and list the best posters from the poster sessions I went to. The latter were too numerous to describe individually, so they are given in tables.

Tuesday, Dec. 2

I arrived in San Diego the night of Dec. 1 and checked into my hotel. Dec. 2 marked the beginning of the conference with a day of industry presentations and exposes.

Lambda Expo Workshop: Multimodal Superintelligence Workshop

The speaker began and introduced three horsemen of multimodal apocalypse

  • Data : Data is balkanized. Over 500 MM datasets exist, but no standards
  • Compute : Compute is rarely always available. For those who always have a cluster or can keep the same cluster ad infinitum, that is a great privilege
  • Execution : The human capital can be lacking. Practitioners do not always follow best practices and are not omnipotent. Many machine learning researchers do not work with multimodal data altogether. They also all have to get on the same page.

Lambda's LAILA workflow is about unifying the three horsemen, removing the problems

Jason Zheng, Stanford | Lambda

Zheng described how to unify data. The process begins with the data as they are now, distributed across numerous silos such as the GPU VRAM, RAM and different storage.

To speed up retrieval of data in these numerous locales, LAILA tracks them with unique resource identifiers. Further, the problem of non-standard dataset formats, which are often a block to generalization from one benchmark to inference on real-world data, LAILA introduces a universal dataset format.

Unification through resource location speeds up inference and training by increasing throughput of training. The universal dataset format saves time by removing the need of preprocessing for training and/or inference on new data. Further, the format accommodates all (popular) modalities.

Jessica Nicholson, University of Bath | Lambda

Zheng then moved to the Hybrid Execution Engine but merely promised more information would come about it later and handed things over to Jessica Nicholson to explain Unified Compute its inner workings.

She discussed Neural Policy Orchestration. The concept of orchestrating a neural policy unifies the training objective compute available, and the universe of discourse of your dataset.

Jessica advertised that more information about Lambda, and LAILA would be available at Booth #713, and I looked forward to seeing them there.

Multimodal Foundation Models - Amir Zadeh, Carnegie Mellon University | Lambda

Amir Zadeh, the first author listed on the expo's page in the NeurIPS schedule, climbed up the stage and showed the classic video of the McGurk effect by which a listener's perception of a spoken syllable is shaped not only by the sound but by reading, subconsciously, the lips of the speaker. Two videos are presented of the same man speaking. In it, the man's lips say different syllables, but the audio is exactly the same. 70-80% of people, Zadeh explained, change their judgment of the syllable said according to which video is shown to them.

The point of this was to demonstrate that multimodal reasoning is pervasive in natural neural networks (our brains). Zadeh remarked that unimodal reasoning is actually rare in animals and people. The core function of multimodal reasoning, Zadeh continued, is to predict any modality from any other.

He introduced Massively Multimodal Modeling (4M). This out-of-the-box multimodal model would be able to receive data in one of a set of modalities and predict what input in the other modalities would be for the same multimodal sample. For this reason, says Zadeh, the model is also inherently multitask; and as he explained, he showed several examples of this any-to-any inference and generation.

Zadeh presented a matrix of input modalities opposite a row of outputs in the other modalities studied, including text, visible light image, audio, and various other imaging methods. 4M is able to perform any-to-any MM generation and achieves SOTA in various tasks as compared to unitask models, showing that 4M leverages multi-task MM learning without losing fidelity in any one task.

4M features a redesigned architecture and is scaled up to support tens of modalities with data and model size on the billions scale. Naturally this implies training length in the trillions of tokens.

Expo Talk Panel: Agentic AI/RL, Cloud Native & Pytorch

Just as I sat down to attend the panel, Davide Testuggine (SE, Meta) was finishing his presentation titled PyTorch-Native RL & Agentic Development At Scale.

Everyone clapped and Aksel Joonas Reedi of Hugging Face took the stage to give his talk: RL Environments are Forever

Reedi introduced RL environments, which he says are a new frontier in post-training. However, they face two problems:

  • The training is costly.
  • When problems are too easy, there is not enough signal, and when they are too hard, the agents receive too poor a reward to increase in their performance

DeepSeek-v3.2 is a triumph of RL environment training. It boasts a new SOTA for performance and used 1800 such environments for post training. The downside is that these environments were scattered throughout GitHub and had to be scrounged up.

The subject of this talk is a collaboration between Meta and HF called OpenEnv, which standardizes RL environment training through an end-to-end framework. OpenEnv not only standardizes all stages of the RL environment training pipeline, unifying the training process for all models, but it also integrates with existing HF APIs.

Environments are live, just like many transformer models, through the HF API. Environments can be created and deployed to Hugging Face Hub with the new openenv executable.

OpenEnv and Reinforcement learning

Danial Han and Sanyam Bhutani describe the results of a hackathon hosted to demonstrate reinforcement learning for agents using the new openenv project.

In this hackathon LLMs were trained through reinforcement learning to play chess, wordle, mario, pac-man, and even Pokemon showdown.

Bhutani describes in this talk how to build one's own environment for a similar project, which is a very simple endeavor, requiring just one line of code to initialize the RL environment: openenv init my_env.

The strength of openenv is the universal, type-safe interface produced for RL learning in openenv:

  • reset: which starts new episodes
  • stop: which executes action and returns observations
  • state: grabs episode metadata

For example the RL training for the game of Connect4 has the following interface

  • reset: take the chips out of the board
  • stop: make a move
  • state: get the arrangement of chips on the board

Han took over, transitioning to a more-detailed example of RL for the game 2048, which has a live notebook that was available through a google colab notebook.

Sharing his screen, Han described the RL environment with the notebook on the board. The notebook, he explained, also contained pointers on how to create RL training objectives to avoid degenerate strategies.

Han also told the audience how just one prompt, fed to the model many, many times is sufficient to perform RL once an environment is in place and that this is true not just of simple games like Connect4 and 2048 but for complicated ones like Pokemon Showdown. The model follows the reward function, and if it is well-made, the rest falls into place.

Wednesday, Dec. 3

Keynote - Rich Sutton, University of Alberta Openmind Research Institute

Sutton says that, while ML is wildly successful, both economically and scientifically, we have lost sight in many ways of the goal. In this talk, Sutton says he will introduce his vision for superintelligence from experience with the titular OaK architecture.

Outline:

  • Sutton's Quest for AI-agents that respect the bitter lesson.
  • Everything must be done at runtime in big worlds
  • OaK architecture.

Sutton's Quest

Sutton's quest is to design an AI agent that is

  • Generalizable
  • Experiential
  • Able to discover novel abstractions from inference
  • Without bitterness, meaning that nothing domain-dependent is built-in.

The bitter lesson, as Sutton calls it, is a revelation that AI should not dependent on developer knowledge. It is the culmination of historical observations

  1. AI developers try to build knowledge into agents
  2. this helps in the short term
  3. Long term, it plateaus the performance of the model
  4. Breakthroughs eventually arrive, masking the problem.

Big Worlds

The Big World Perscpective is something that Sutton has acquired from experience with agents. The space of the problem is inherently larger and more complex than the framework used to model the agent.

This implies that everything the agent learns cannot be a complete representation of the problem space but, rather, an approximation. This makes the problem space (world) to appear to be moving.

Sutton says that inaccuracies in the design-time understanding of the problem space necessitate run-time learning and abstractions, since the problem is misunderstood or understood only proximally at any time.

OaK Architecture

It is with the bitter lesson and the big world perspective in mind that Sutton introduces the OaK architecture. He began by stating that there is substantial consensus on the mechanics of intellegent decision making between disciplines. Different fields, he claims, merely attribute different names to the same decision making processes. He said this over a table of terms from different scientific and intellectual disciplines that are near synonyms.

Providing examples from the table, he used this dogma as justification for a consensus model of intelligent agents.

  • Perception: Observations produce state-feature vectors
  • Reactive Policy: Produces actions appropriate for the state
  • Value Function: Calculates reward a la RL
  • Transition model: Predicts the consequences of alternate actions.

Previous approaches focus on learning what to do at the next step at each step. Meanwhile, Sutton's Option and Knowledge (OaK) architecture compares alternative options. Options are policy and termination condition pairs (π,γ)(\pi, \gamma). In the OaK architecture, agents compare many such options and learns base on the value of following these options to their termination conditions.

Instead of learning what to do at individual steps, the agent learns a high-level transition model that creates plans for multi-step state transitions (plans).

OaK inherently adds auxiliary subproblems to a known problem and builds up from these to learn the overall task. Features learned by the OaK are subproblems that produces options and inform the transition model.

Subproblem analysis in AI/RL is not a new concept. Sutton produced citations going back to Curiosity in RL (1991) that show that researchers have looked at subproblems for a long time. However, OaK sets itself apart by answering four open questions with respect to subproblems in RL

  1. Q: What should the subproblems be? - A: Different for different tasks, but they have a unified representation in OaK through the abstraction of options.
  2. Q: Where should they come from? - A: Agents must create them. they cannot all be pre-programmed or anticipated by designers.
  3. Q: Can the agent generate its own subproblems? - A: Yes. OaK creates subproblems i from features and the reward associated with them kappa.
  4. Q: How do the subproblems help the main problem? - A: OaK leverages the option abstraction to solve subproblems learning large, multistep transitions through the transition model that maximize features associated with subproblems and weighs the importance of each subproblem with its kappa value.

While a lofty ambition, OaK has not yet been completely implemented. Sutton explains that several issues with different steps in the pipeline are holding it back from complete implementation. Many steps rely on continual deep learning, which has not been implemented yet in the RL literature while others have gaps in the design while the math behind them remains sound. For now, while not practicable, the theory of OaK architecture remains promising.

Oral 1C Theory 1 - Optimal Mistake Bounds for Transductive Online Learning - Zachary Chase, Kent State & Jonathan Shafer, MIT.

Theory:

  • Q: Is unlabeled data useful for classification?
  • A: In PAC learning, No

But with online learning, it might be.

Transductive Online Learning

For some domain X\Chi and Hypothsis Class H{0,1}X\mathcal{H} \sub \{0,1\}^\Chi, we can describe online learning as a zero-sum game between a Learner and Adversary

For each trial, the learner chooses a hypothesis, the adversary chooses a label, and they are compared.

In transductive online learning, all trials are known to the adversary and the adversary must choose all of the labels before hand. The learner is given more information, namely the full sequence of unlabeled trials from the beginning of the sequence.

Chase et al. (2025) show a lower bound for the transductive online learning model that grows at least Big-Ω\Omega of the square root of the standard mistake bound for all Hypothesis Classes and that, there exists some H\mathcal{H} such that the transductive online learning mistake is bounded below Big-OO of the sqrt of the standard mistake bound for that H.

Outline of Proof

Proof of the lower (Big-Ω\Omega) bound follows from a greedy algorithm and the corresponding proof of its minimal error.

The upper (Big-OO) bound for some H\mathcal{H} in the set of Hypothesis Classes, follows from the idea that sparse codewords are easy to guess. In other words, everytime the transductive online learning model makes a mistake at a single trial xtx_t it gains a goodly amount of information about the whole sequence of trails {xi}i1,n\{x_i\}_{i \in {1,n}}. For example, knowing that a bit sequence is one-hot encoded would permit the learner to make a single mistake about the sequence by guessing 0 at every step. The Learner would only be wrong for the one bit that is not zero in such a sequence.

It was revealed at the end of the talk that their paper won runner-up for best paper in the track.

High Dimensional Calibration From Swap Regret - Maxwell Fishelson, MIT

Fishelson began the presentation with the following thought experiment. Imaging that you must predict the likelihood of rain on each day. The goal of the forcaster is to produce a distribution of forecasts that match the empirical frequency for the realized outcome each day. Complicating matters, we suppose that the outcome, whether it rains for each day, is chosen by an adversary. At the outset, Fishelson notes, the task is impossible to do with perfection since the adversary is capable of choosing the opposite outcome at each step.

This thought experiment models single-class calibration, where for some convex subspace PRd\mathcal{P} \sub R^d, the forecaster chooses positions in P\mathcal{P} and the calibration error is the one-norm distance between the forecaster's prediction and whereever the adversary places the true value in P\mathcal{P}.

Moving forward to multi-class calibration, wherein the forecaster must assign probabilities to many different possible outcomes, rather than one of two outcomes. For such multiclass problems, we see calibration error less than some ϵ\epsilon times TT and prove bounds on TT.

Fishelson et al. (2025) proposes TreeCal, which is a calibration error optimizer that generalizes to all P and error norms with bounds ϵT>=CalT2\epsilon T >= Cal^{||·||^2}_T, where CalT2Cal^{||·||^2}_T is the calibration error, and T<=(diam(P)/ϵ)O(Rate(P,)/ϵ2)T <= (diam(\mathcal{P})/\sqrt{\epsilon})^{O(Rate(\mathcal{P}, ||·||)/\epsilon^2)}.

TreeCal

Consider some tree of timesteps of length HLH^L. Tree splits the tree into LL layers split into intervals of size H(Li)H^{(L-i)} at each layer i[1,L]i \in [1, L]. Treecal produces labels for forecasts as the set of layer intervals that contain the forecast.

The error is characterized with a calibration error similar to something called a proper scoring rule that measures distance from a curve using a Bregman divergence. Bregman divergence for each layer is calculated for each layer of the tree, and the sum of the divergence for each layer forms a telescoping sum down its layers. This removes all diverges except for the top layer.

With the assumption that there is a regularization function RR that is strongly convex and ρ\rho-lipschitz, then Fishelson et al. (2025) can prove their bounds on TT. Note, however, that this does not give insight into RR.

Poster Session 1: 11am - 2pm

IDTitleLink
806NestedFP: High-Performance, Memory-Efficient Dual-Precision Floating Point Support for LLMshttps://arxiv.org/abs/2506.02024
2112Caption This, Reason That: VLMs Caught in the Middlehttps://arxiv.org/abs/2505.21538
2204PINN Balls: Scaling Second-Order Methods for PINNs with Domain Decomposition and Adaptive Samplinghttps://arxiv.org/abs/2510.21262
24xxStructural Causal Bandits under Markov Equivalencehttps://openreview.net/pdf?id=IL1wvzOgqD
2513Structural Casual Bandits Under Markov Equivalencehttps://openreview.net/forum?id=3aFwsZxM5H
27xxConnecting Jensen–Shannon and Kullback-Leibler Divergences: A New Bound for Representation Learninghttps://arxiv.org/abs/2510.20644
2706Fantastic Features and Where to Find Them: A Probing Method to combine Features from Multiple Foundation Modelshttps://bramtoula.github.io/combo/
2714ProteinConformers: Benchmark Dataset for Simulating Protein Conformational Landscape Diversity and Plausibilityhttps://openreview.net/forum?id=GClrNUTqly
2913The Complexity of Symmetric Equilibria in Min-Max Optimization and Team Zero-Sum Gameshttps://arxiv.org/abs/2502.08519
3011Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functionshttps://arxiv.org/abs/2510.18638
3513Integration Matters for Learning PDEs with Backwards SDEshttps://neurips.cc/virtual/2025/loc/san-diego/poster/116412 (SPOTLIGHT)
4004Token embeddings violate the manifold hypothesishttps://arxiv.org/abs/2504.01002
4207Set-LLM: A Permutation-Invariant LLMhttps://arxiv.org/abs/2505.15433
4507ML4CO-Bench-101https://github.com/Thinklab-SJTU/ML4CO-Bench-101
5003End-to-End Vision Tokenizer Tuninghttps://arxiv.org/abs/2505.10562
5105DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detectionhttps://aimagelab.github.io/DitHub/
5314WearVQA: A Visual Question Answering Benchmark for Wearables in Egocentric Authentic Real-world scenarioshttps://openreview.net/forum?id=s5p9ByKN1j

Poster Session 2: 4:30pm - 7:30pm

IDTitleLink
108CovMatch: Cross-Covariance Guided Multimodal Dataset Distillation with Trainable Text Encoderhttps://www.arxiv.org/abs/2510.18583
502Bayesian Concept Bottleneck Models with LLM Priorshttps://arxiv.org/abs/2410.15555
904On the O(dK1/4)O(\frac{\sqrt{d}}{K^{1/4}}) Convergence Rate of AdamW Measured by 1\ell_1 Normhttps://arxiv.org/abs/2505.11840
1016Beyond Scalars: Concept-Based Alignment Analysis in Vision Transformershttps://openreview.net/forum?id=l1DDTSqFq7
3012Regularized least squares learning with heavy-tailed noise is minimax optimalhttps://openreview.net/pdf?id=TjQP5hc3WC
3018Robustness in Both Domains: CLIP Needs a Robust Text Encoderhttps://github.com/LIONS-EPFL/LEAF
4011Do Language Models Use Their Depth Efficiently?https://arxiv.org/abs/2505.13898
4610Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integrationhttps://openreview.net/forum?id=64WeVllQjq
4911Differentiable Hierarchical Visual Tokenizationhttps://github.com/dsb-ifi/dHT/ (SPOTLIGHT)

Thursday, Dec. 4

Keynote 3: Yejin Choi, Stanford

Choi began the keynote by summarizing the various spectres that haunt our field today: going from entropy, to the fruitlessness of further bruteforce upscaling, and finally mode collapse, wherein model performance tanks when models are training on AI generated output.

Prismatic Synthesis

Princeton has introduced the Vendi Score, but the original Vendi Socre could not correlate with the data to cover out-of-distribution generalization. For that use case, Choi devised the G-Vendi score.

Further, she introduced Prismatic Synthesis. Sample data are fed to a Deepseek R1-32B model, which generates synthetic data. The data are then filtered for diversity and quality (about 70% of it). The results are further refined by another R1 model.

The approach outperforms baselines while using 20x smaller model and has no reliance on supervision. That is, it does not need human-labelers.

In conclusion, Choi states, reasoning requires data that transends the internet. Synthetic data can rescue us, and systematic diversification of the data improves it.

RL as Pre-Training (RLP)

Next, she introduced RL as Pre-Training. The current paradigm is Next Token Prediction. This is learning by naive guessing, she posits, and it can be greatly improved through the introduction of RL into the NTP pipeline. Whereas NTP has no explicit reasoning step. RLP introduces an explicit reasoning step into the pre-training.

RLP learns a Thought Policy and compares the entropy between a no-thought policy and the Thought Policy. The information gain (entropy difference) from the training step with the Though Policy compared to the one without is a reward funtion.

Empirical evaluation asked two questions:

  • Can RLP improve reasoning ability of a base model?
  • Can RLP match the performance of a fully trained model, even with 200B fewer parameters?

Answers:

  • RLP outperforms the baseline Qwen3-1.7B-Base by 19% and CPT by 17% across math and science benchmarks. After SFT, RLP compounds its advantage over the baselines by several percentage points.
  • RLP continues to outperform the fully-trained Qwen with 200B fewer params. Averages 35% increase.

Overall, the results gave Choi's team pause and makes her question whether the very foundations of pre-training and NPT should be revisited. She pointed out that her team is not the first or last to ask this question either.

Concluding Remarks

AI should be democratize, says Choi. It should be of humans, by humans, for humans:

  • Of Humans: Belong and originate from human beings and reflect human values
  • By Humans: AI should be created and moderated by the people, not by large corporations or governments
  • For Humans: AI should serve humans, not other AI or the other way around.

OpenThought3 has approached the popularization of effortful RL. Instead of riding trends and cutting papers into thin slivers to get published, Choi says that researchers should focus on earthmoving publications following unconventional methods.

Choi summarizes the "Art of Artificial Reasoning." Choi says that it's all about the data. At first, the internet was cannibalized, then humans were made to write further data, and when this was no longer sufficient for bruteforce scaling we have moved on to synthetic datasets.

The solution, she links back to her own work, is through the synthesis in ID data through cannibalization of OOD data. Choi justifies the approach by pointing out that the universe of knowledge is much greater than the available writings on the internet or that human annotators are capable of reproducing manually. Other synthetic approaches are really no stronger (although they are faster) than having humans write more data since they cannot produce data in a different distribution than the training data.

Finally, she posits few open questions:

  • Are there new theories of intelligences, knowledge, and reasoning that will allow humans to access the broader universe of knowledge themselves?
  • Humans do not have a context window size of 1M tokens: is this a limitation or a good thing?

Poster Session 3

IDTitleLink
612Neurosymbolic Diffusion Modelshttps://arxiv.org/abs/2505.13138
706Multiclass Loss Geometry Matters for Generalization of Gradient in Separable Classificationhttps://arxiv.org/abs/2505.22359
902Q3R: Quadratic Reweighted Rank Regularizer for Effective Low-Rank Traininghttps://arxiv.org/abs/2511.04485
907Differentiable Sparsity via D-Gating: Simple and Versatile Structured Penalizationhttps://arxiv.org/abs/2509.23898 (SPOTLIGHT)
1109ElliCE: Efficient and Provably Robust Algorithmic Recourse via the Rashomon Setshttps://neurips.cc/virtual/2025/loc/san-diego/poster/118970
1412Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIhttps://arxiv.org/abs/2502.08640
2215A physics-preserved transfer learning method for differential equationshttps://arxiv.org/abs/2505.01281
2512Investigating Hallucinations of Time Series Foundation Models Through Signal Subspace Analysishttps://neurips.cc/virtual/2025/loc/san-diego/poster/115394
4211Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuninghttps://arxiv.org/abs/2509.15087
xxxxA Compressive-Expressive Communication Framework for Compositional Representationshttps://arxiv.org/abs/2501.19182

Poster Session 4

IDTitleLink
3005Assessing the Quality of Denoising Diffusion Models in Wasserstein Distancehttps://arxiv.org/abs/2506.09681
3909Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Modelshttps://arxiv.org/abs/2505.17761
4015Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networkshttps://arxiv.org/abs/2502.16763

Friday, Dec. 5

Oral 5A: EvoLM: In Search of the Lost Language Model Training Dynamics - Zhenting Qi, Harvard

Qi began by introducing the three types of training in modern LM. Training consists of pretraining, mid-training (domain specific objectives), and post-training (fine-tuning).

Open Questions in LM Training

I sat for Oral 5A in the F-H Exhibiton Halls in wait for the fifth poster session. With bated breath, I anticipated the latest and greatest in LLM research to be presented on the biggest stage that the machine learning community has to offer.

A young man stood center-stage for a long time, audio technicians and event organizers fiddling with the audio equipment to make sure everything was in working order. Nearer to me, a man stood atop a smaller stage behind the camera used to broadcast the festivities to the conference website and to the 12 projector screens arranged in a 2x6 matrix throughout the conjoined exhibit halls.

In pre-training and post-training, there are a host of unanswered questions "Does scaling up pretraining always improve post-training?," "Does RL inject new capabilities beyond pretrained model?", "How do we allocate computer between SFT and RL?", etc.

While not possible to answer each of the questions, Qi says, his team introduces EvoLM, an end-to-end pipeline and model suite for training and evaluation. It has Pretrain, CPT, SFT and RL training modules and encapsulates evaluation as well. The model itself is a billion-scale (1 or 4 B) Llama architecture.

Scaling up Training

Pretraining is scaled from 20B Tokens (Ts) to 320B Ts. Mid-training is scaled up as well. CPT mid-training is performed with some 50B Ts. While it would normally trigger forgetting alone, pretraining data replay could mitigate this problem, Qi says. Then as CPT data increase, the post-trained model performance improves.

SFT post-training compute is scaled from 1 epoch to 32 epochs on a dataset of 50k examples. SFT-ed models show improved performance with diminishing returns on OOD tasks with RL.

Finally, RL is scaled up from 1 to 32 epochs, and it shows performance on both ID and OOD tasks but with diminishing returns. The rL further increases the chance of sampling high-quality trajectories.

Post-training is scaled up for both SFT and RL with a further 100k examples.

Conclusions

Qi et al. release 100+ LMs trained at scale with their pipeline.

Their future plans include training larger models and multimodal architectures. They also plan to explore more diverse post-training objectives.

Large Language Diffusion Models - Shen Nie, Renmin University of China

Nie introduces Large Language Diffusion with mAsking (LLaDA). It features unsupervised pretraining with masked tokens, supervised finetuning through response masking, and language generation by predicting masked tokens in parallel rather than linear next token prediction.

Nie compares generative models to a highly dimensional random distribution. In this way, the output of such models is a sampling from that distribution space. Futher the properities of LLMs theoretically originate from the generative paradigm, not exclusively from the autoregressive framework.

Currently LLMs perform L-R Next-token-prediction token-by-token. However, the intuition that language contains redundencies would suggest we can parallelize the procession of tokens. Diffusion models are naturally capable of bidirectional reasoning, but it was an open question at the outset how to model text with a diffusion model. The answer was a Masked Diffusion Model. With probability t[0,1]t \in [0,1], tokens in each dimension of the neural network are masked with a special mask token.

In training, the diffusion model (transformer) accepts a sequence with arbitrary masked tokens and must predict what the true tokens are behind the mask. The diffusion model is scored with the CE loss on the masked positions.

Naturally, this creates a relationship with a famous related work: BERT. BERT uses a fixed-ratio mask prediction, but non-generative models lack explicit probability representation.

Nie demonstrates the scalability of LLaDA in large context windows and shows examples of its capabilities as a chatbot interacting with some user. Whereas Autoregressive models fall into the so-called reversal curse wherein they fail to learn equivalence relationships in both directions, i.e. they are able to reproduce 'A is B' but not 'B is A'. LLaDA breaks those shackles.

WHile promising, Nie cuations, there are still challeneges for the diffusion language model. There is but sparse supervision, independence of masked tokens from visible token distribution is not provable, system optimizations must be made as well.

Nie calls back to Sutton's Bitter Lesson, featuring the quote "We have to learn the bitter lesson that building in how we think we think does not work in the long run." With respect to LLaDA, this lesson teaches us that LMs should not necessarily be front-to-back, that is to say sequential. LLaDA is a first step in this direction.

Sejnowski-Hinton Award 2025: Random synaptic feedback weights support error backpropagation for deep learning - Timothy Lillicrap, University College London & Daniel Cownden, University of St. Andrews

At NeurIPS 2025, the 2025 Sejnowski-Hinton Award for papers that contribute to computational theories of the brain using AI insights was awarded to Timothy Lillicrap, Daniel Cownden, Douglas Tweed, and Colin Akerman on the basis of their work Random synaptic feedback weights support error backpropagation for deep learning. Lillicrap et al. (2016) observed that brains learn quickly, computing complex behaviors in what should be trillions of parameters. The question for AI research and neuroscience is how.

Backpropagation in the Brain

Neuroscientists have outlined how learning occurs on the small scale, on the level of individual neurons, but this understanding does not scale up to the level of an individual. In ML, it is known that ML uses backpropagation to scale up learning. The authors questioned whether that was the same for real brains, and it turns out that Crick (1989), Grossberg (1987), and other authors in the 80s already investigated the idea. they called biological backpropagation unrealistic for many reasons. For one, the authors summarize, Crick (1989) questioned what goal the biological backpropagation would be optimizing toward. Grossberg (1987) conceived of the Weight Transfer Problem which asks how weights, such as the ones necessary for backpropagation, are transferred in the brain.

The Backward Pass

In the 2010s, the team was "stoked" to understand the relationship between backpropagation and biological neural networks. They benchmarked several training methods with a two-layer NN and noticed that one called Feedback Alignmnet converged as or more quickly than backpropagation. This comparison between the two methods held under benchmarks with larger models and more complex data/tasks.

FA works, they prove, in the sense that it will converge no matter what, but Lillicrap et al. admit that there is no guarantee on the speed of convergence despite the empirical results. Further, there are cases in which it will not work (will not converge quickly).

Direct Feedback Alignment (DFA)

Further, DFA extends the FA training paradigm by broadcasting error (backpropagating) error all the way back to the first layers of the network.

Keynote: Demystifying Depth: Priciples of Learning in Deep Neural Networks - Andrew Saxe, University College of London Theory of Learning Lab

For their breathtaking and seemingly miraculous abilities, Saxe points out that AIs and LLMs are still practically black-boxes. We understand very little of the inner workings behind it.

The question of deep learning theory is to go from raw data to strong results with large amounts of compute without going through the process of computer simulation (i.e. without running experiments directly)

Since tackling physical problems is very difficult in an explicit way, so-called surrogate models are necessary to understand the problem at some level of abstraction that we can reason about. For example, whereas many gas molecules push against the walls of a box and create pressure within it, the equations for calculating that pressure are not written in the terms of the individual gas molecules. Saxe's contribution is the same but for deep learning.

Deep Linear Neural Network

Deep Linear Neural Networks rephrase the problem of gradient descent. The MSE Loss becomes nonconvex (non-differentiable) and the equation of gradient flow dynamics (gradient descent) is altered. However, thanks to the linearity, the error surface can visualized in 2D. As it turns out, shallow networks converge very quickly whereas deeper networks do so slowly. The simplicity of the Deep Linear NN equations actually allows for analytical solution of the equations, and this analysis bares out the empirical data.

Training speed is shown to be O(1/sboD)O(1/sb_o^D) where s is the input-output correlation, b0b_0 is the inital layer weight, and D is the depth. Further manipulations of the Deep Linear equations reproduces theoretically what empirical results have told us about the relationship between weight initialization and training speed. Random initialization creates an exponential dependence on the depth whereas modern approaches show that convergence is O(1)O(1) with respect to the depth.

SVD change of variables

SVD decomposition in the change of variables introduced by Deep Linear NNs produces a hierarchy of saddle points, wherein said points surround local maxima in a pyrimidal shape. SVD further allows us to understand how DLNN will behave for some "world" without actually having to train the model.

PCR

Work with linear attention reproduces the hierarchy and shows that models implement PCR in context. This is a known algorithm and this provides an example of a training surface in which we know what is happening at all points during training.

Nonlinear networks pose the final challenge before we can apply Saxe's work to real problems. Thanks to gating, nonlinear Deep NN are actually the composition of linear networks, and this is how Saxe models them. However, he admits that the understanding is only approximate. There is the phenomenon of neural race conditions, which means that one path of the nonlinear network actually will dominate over time during training. Paths that learn faster (minimize loss faster), will dominate, and losing pathways do not matter in the long run. This sparsifies the network that theorticians need to consider to understand training behavior.

As an example, Saxe shows application of his new theory to multilingual translation wherein samples in any of a set of languages may be translated to the another language in the set. There are many ways to connect the subnetworks that represent the model for each language pair. Following the results stated previously, the quickest training and best generalization falls out of a network that shares neurons for every language pair. Further, his work permits us to quantify how much of the available data (~40%) must be trained on before the model generalize ID tasks into OOD.

Poster Session 5

IDTitleLink
712Adaptive Riemannian ADMM for Nonsmooth Optimization: Optimal Complexity Without Smoothinghttps://arxiv.org/abs/2510.18617
714Finite-Time Analysis of Stochastic Nonconvex Nonsmooth Optimization on the Riemannian Manifoldshttps://arxiv.org/abs/2510.21468
2711DreamPRM: Domain-Reweighted Process Reward Model for Multimodal Reasoninghttps://arxiv.org/abs/2505.20241
2807Universal Sequence Preconditioninghttps://arxiv.org/abs/2502.06545
3011On the Necessity of Adaptive Regularisation: Optimal Anytime Online Learning on lpl_p Ballshttps://arxiv.org/abs/2506.19752
xxxxSinusoidal Initialization, Time for a New Starthttps://arxiv.org/abs/2505.12909

Saturday, Dec. 6

ML for Physical Science Workshop

Continual Learning for Particle Accelerators - Malachi Schram, Jefferson Lab, US DOE

Schram first gave an overview of his work, beginning with the motivation. His team would like to encorporate ML into workflows for particle accelerators at Oak Ridge. Particularly, this work discusses efforts to apply ML toward the automation of the detection of failed in the beam pulse responsible for supplying a controlled stream of particles to the accelerator.

The data are time-series of 100MB/s in flow rate. The Differential Current Monitor (DCM) shoots the beam of particles with brief pauses between them. The goal is to understand whether the waveform of pauses changes in response to a beam fault and, if so, how to leverage AI/ML to detect the fault (classify the status). A Siamese model (SNN) was developed and later augmented with Gaussian Processes Approximation. ROC shows a true fault detection rate of 60%.

Data Drift

Over time, the character of the data will change in response to measurable parameters in the system. The beam, for instance, is tune continuously. Data drift may also occur thanks to non-measurable parameters and do so unexpectedly and unwelcomely. Changes in the hardware associated with the side effects from testing, such as the heating of components, can effect the speed; timing; and strength of the beam. Thus a model that can classify the beam status in the presence of such immeasurable data drift is necessary. Conditional Siamese NNs (CSNN) are employed and compared to the traditional approach. The CSNN prevails both in terms of accuracy and time because conditional learning has been shown to maintain performance against the effect of drifting data.

Continual Learning Experiment

Instrument data were collected and segmented into a train and test set for continual learning under a variety of beam configurations. CL addresses the drift due to known changes, but there is catastrophic forgetting and traditional training actually shows decaying performance in the long run. Adding Growing Selective Replay, wherein a set of remembered replays is kept while everything else is discarded, fixes the decay problem.

Summary

AI/ML is more effective than traditional techniques for detecting beam faults, UQ methods have been introduced for OOD data, CL counteracts the effect of catastrophic forgetting shown in the traditional training regime.

Sunday, Dec. 7

The PokéAgent Challenge: Competitive and Long-Context Learning at Scale

Kaggle Game Arena - MinMin Chen, Google DeepMind

Chen introduces the concept of Jagged Intelligence. Models have excelled at complex problem solving, but the models are still capable of non-sensical errors (blunders) in various games, even ones as simple as tic-tac-toe.

Further, static benchmarks are becoming both oversaturated and contaminating the training data. Thus, dynamic, competetive arenas with human preference, agentic performance, or games rank models are becoming common place.

Games challenge agents not just in problem solving ability but in their ability to form a theory of mind for their opponent and to plan in the long term.

Game Arena

Chen provides an overview of the arena. There is a game environment in which the game(s) take place. An agent harness builds prompts and parses agentic actions, acting as a mediator between the models and the game environments. An IO loop is established and an external visualizer and leaderboard provides a UI for displaying game information and reasoning traces and a ranking of agents, respectively.

Turn-based classic board games and card games were the first to be adapted to the arena. A framework called OpenSpiel was accessed, which is an open source framework for RL & Games with 100+ supported games.

The prompt builder inserts several arguments into a general prompt that asks the model to play a game, explains it to them, lists the legal moves from their current position, and prompts the AI to make another move. Chen points out that we can also test models on move legality by removing the legal moves list. Experiments with the latter configuration reveal that even the strongest models sometimes struggle to think of a legal move, showing an illegal move rate of 17%. Self-consistency and re-thinking have been used to cut this down to 0%.

The Arena was the backbone of the chess tournament benchmark. A single elmination knockout tournament was instated with Burstein seeding. Any LLM that continued to produce illegal moves after three re-think steps was eliminated. A swiss was also implemented where every model faced every other model. OpenAI's o3 won the tournament with Grok 4 coming second and Gemma 3.5 Pro third.

Discussion

Some people may find the chess benchmark regressive since we have had models that play games at a high level for a long time. However, Chen explains, such perspectives overlook the fact that models benchmarked are not specifically designed to play Chess (or other games) whereas the models decades ago could do nothing but play chess. It is a triumph of LLMs' generalizability.

On the other hand, models are sensitive to input format and appear to be stronger at openings (which are memorized) and weak in the mid to end game. Other challenges include high variance, especially for games without perfect information, and inference cost ballooning as the game continues in the long-run.

Lightning Talks

The lightning talks were given by winners and highly performant teams from a pair of agentic Pokemon competitions that took place before NeurIPS. The winners of these two competitions (tracks) were given awards. The competitions were as follows:

  • Track 1: Best of 99 Single Elmination Trounament of OU Pokemon battles in Gen 1 and Gen 9
  • Track 2: Speedrunning Competition for Pokemon Emerald. Agents race to beat the first gym leader, Roxanne

Curriculum Learning for Pokemon Battle (Track 1) - Qiao Wang, Google

There is a massive state space for Pokemon battles, so the team takes a two phase approach.

  • Phase 1: Fine Tuning. Qabra is finetuned on 65K games from various generations.
  • Phase 2: The model learns on an additional 100K games with a coach.

Achieves a 0.8 win ratio over the coach and was 2nd place in the competition. Code available on Github github link.

PA Agent (Track 1, Gen 1 OU winner) - Xianwei Shi, PingAn Life Insurance

The high combination space for teambuilding is a particular challenge that compounds the complexity of Pokemon battles on top of actually having to play the game.

The team's solution starts by sampling teams from Smogon forums and combines RL + Transformers to iterate over inter-model battles starting with sample teams and human battle data as a baseline.

The training begins with all human battle data at first and gradually decreases the proportion of human participants as training continues until all combatants are agents.

Foul Play: A Competitive Pokemon Showdown (Track 1, Gen 9 OU Winner) - Patrick Mariglia

A monte carlo method for playing Pokemon. The author starts by introducing the main information problem with Pokemon as an AI problem. Despite the fact that the model has just five legal moves at any point in the game, each one of these moves can give rise to many child game states, and it is not always possible (rarely possible) to know which of the states the game will be in until have the decision has been made and the turn has played out.

His team used the PokeEngine battle engine, which performs monte-carlo search to efficiently search a simplified tree of possible games states. The tree is "simplified" in the sense that the 16 different damage rolls that would normally have to be considered are ignored.

Information is king in the game of Pokemon, says the author (this is true from personal experience). Playing well is about understanding the information given do you and performing deductions on the available data. For instance, if a Landrous Therian outspeeds (does its move for the turn first) a Garchomp, then it must be holding a Silk Scarf, which boosts speed. The Silk Scarf does not announce itself, but its presence can be inferred from the fact that a slower Pokemon outsped a faster one.

For cases where something is not certain, most of the time, Set Prediction is used. Pokemon showdown hosts battle data that Foul Play accesses. It also analyzes the game state. Using the probabilities from Showdown's servers' databanks, the model builds a decision tree from the game state, performs monte-carlo search on the possible states, and picks one.

The code and documentation are available on GitHub.

Hamburg Pokerunners (Track 2 Second Place) - Arian Urdu, Univ. Hamburg

The team's solution began with dreamerv3, which learns a model of the game world through learning actor/critic imagination. However, training the world model was prohibitively expensive.

The final solution relies on recurrent-PPO. It samples the game world, feeds embeddings of the world to a recurrent LSTM, and learns a DNN to navigate. Sampling the world rather than having to learn it ahead of team simplifies training and the world representation.

Heatz (Track 2 Winner) - Junik Bae

Only system that was able to beat Roxanne is less than an hour (40:14). The approach prompts an LLM to make a scripted policy for navigation subtasks and evaluates them in the environment. An RL agent begins with the scripted policy and undergoes RL learning to improve on the LLM's original "intuition."

The code is available on Github

It ended with a panel, but I had to leave for my presentation at the Evaluating the Evolving LLM Lifecycle Benchmarks, Emergent Abilities, and Scaling Workshop.

Christopher standing by the portait-style OPTiCAL poster at the LLMEval workshop at NeurIPS, December 7, 2025.

References

  1. Chase, Z. Hanneke, S. Moran, S. Jonathan S. Optimal Mistake Bounds for Transductive Online Learning. Advances in Neural Information Processing Systems 38. San Diego, CA. 2025
  2. Fishelson, M. Golowich, N. Mohri, M. Schneider, J. High-Dimensional Calibration from Swap Regret. Advances in Neural Information Processing Systems 38. San Diego, CA. 2025
  3. Lillicrap, T. Cownden, D. Tweed, D. B. Akerman, C. J. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications. 7 (13276). 2016
  4. Crick, F. The Recent Excitement About Neural Networks. Nature 337. 129-132. 1989
  5. Grossberg, S. Competitive Learning: From Interactive Activation to Adaptive Resonance. Cognitive Science. 11, 23–63. 1987.

For more information about our research, return to our homepage: ufdatastudio.com.

Proudly Funded By

© Copyright 2025 by UF Data Studio. Built with ♥ by ceg.me (via CreativeDesignsGuru!).