This post will highlight academic contributions that I saw at the 2026 Conference on Computer Vision and Pattern Recognition (CVPR26). I attended CVPR to view new research there on my dissertation area, comic understanding, and on a secondary research area I have been involved with for some time, reasoning benchmarks for Vision Language Models (VLMs) and other types of Multimodal LLM (MLLM). Even before I registered for or attended the conference, I had picked out a number of talks, posters and oral presentations and created an itinerary of them for the week revolving around these two subjects. As a result, this post provides a focused look at work representing these twin research areas in CVPR26.
Dr. Chuang Gan began his talk with a video of his daughter playing a game in which she placed differently-sized disks into spaces that were fitted specifically for the individual disks. She completed the game without any special training. His point was that his daughter, as an agent, had some preconditioning for the task that allowed her to make a plan and complete it, although she had to revise the plan whenever she placed the wrong disk in the wrong hole.
We are still very far away from this sort of general intelligence in embodied agents, robots, Gan said; but his work seeks to bridge the gap in performance through two recent works: MindJourney and Action Images.
MindJourney - As an example, Gan shows an image of a living room facing an open kitchen and poses the quesntion, "If I sit on the couch and face the chairs, will the kitchen be to my right?" A VLM lacks the ability to imagine the space, as a human would, to answer the question. Thus, MindJourney sees the VLM collaborating with a world model at test-time to observe the space and gather data sufficient to answer spatial reasoning questions. Naively the VLM would simply be allowed to propose steps in the world model, but this would rely on the shaky spatial reasoning capabilities that we are already trying to mitigate. Instead the VLM and the world model are connected using a spatial beam search, and iteratively traverses the space. For several SOTA VLMs, the MindJounry approach outperfroms the baseline.
Action Images - World action models in video modality generalize well, but policy generation does not tend to generalize. Gan proposes transferring the priors learned in video generation to policy generation and explains the Action Images approach his team devise for this purpose. The method relies on Multiview Action Projection, wherein pixel-grounded action images are leveraged to track embodied agents' actions in video space. For a frame of RGB video, the action iamges method encodes a normal point, action point, and upper point into the red, green and blue channels of the video, respectively. The method is trained with three datasets, including DROID, RLBench, and BridgeV2, in a pipeline that leverage video, audio and text encoding with a large video generation backbone. Real-world qualitative results are promising, and quantitative in-domain and zero-shot perfromance increases significantly for the RLBench test data. Gan also remarks that the model can generalize to out of domain tasks.
We now review oral presentations from the CogVL workshop beginning with Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions by Sengupta et al.
The authors focus on counting tasks for vision language models but researches the effect that incidental prompt and image properties, rather than the number or positioning of relevant objects, has on model attention and performance. The results were somewhat surprising. Specifying image features in the prompt aided performance, but models counted objects in the most generic images best. In this blogpost, we explore the problem, the authors’ benchmarking method, what these results mean, and their impact on work at the UF Data Studio.
Counting is an important task in many application domains. When presenting his teams’ work, Sengupta used pathology and robotics as two examples. Further, he stated in the presentation, the needs of counters in such tasks often go well beyond the numbers and labels of the relevant objects. Sengupta therefore characterizes a domain gap between specialized counting approaches, citing Sam-based and open-world approaches as examples, and LLMs. The former answers the question of how many instances of some object there are but cannot answer in-depth questions while an LLM cannot count objects in an image without the aid of a vision module, making it a VLM. VLMs would appear well-positioned to bridge the domain gap, but the authors cite poor performance for VLM counters in previous investigations on the topic. The current work seeks to leverage controlled, synthetic data to understand what features of the image-prompt pair drive VLM underperformance.
The authors’ method is very clear. As Sengupta summarized it in his presentation, the authors curate a benchmark of 50 base images that are perturbed with one of four image transformations to measure the effect visual features on VLM performance. Additionally, the authors evaluate on five types of prompts meant to probe a cognitive cue and increasing cognitive load (level of description). Three open-source VLMs are benchmarked on the image-prompt pairs. The primary performance metrics are Accuracy and something called the Mean Relative Count Error (MRCE), which is the mean ratio between the absolute error of the stated count of objects in each image and the true number of objects. More importantly, however, the authors probe attention over visual tokens using heatmaps and an intersection-over-union-based (IOU-based) quantitative measure.
While similar to the UF Data Studio’s OPTiCAL and other VLM reasoning benchmarks in many ways, the authors take a new perspective on VLM reasoning benchmarks and demonstrate very clearly that VLMs pay attention to irrelevant details, particularly properties of the prompt, that do not aid them in the counting task. This is a novel finding that calls into question existing work, including previous OPTiCAL publications.
The current work explores counting and the effect of image and prompt perturbations on counting performance, but the question of what these perturbations do to performance in other tasks remains largely open. This paper was a workshop submission, so we understand the authors limiting their scope, but a promising direction for future work is the inclusion of other spatial and compositional reasoning tasks. As one would expect from what has already been said, I think it would be interesting to follow a similar procedure using the OPTiCAL data to measure spatial and positional reasoning task performance in an abstract environment. We may well reproduce the finding that unrelated stimuli in the image and prompt have dramatic, unexpected bearings on the models’ output via the attention measure used by Sengupta et al. The novelty of this work will be in the probe of positional and compositional reason tasks, and we may investigate mitigation strategies as an additional contribution.
The authors classify existing image-language datasets into two classes, perceptual similarity and semantical similarity. The correlate both types of similarity, the authors manually curate anonymous captions wherein the names of specific objects in semantically similar objects are replaced with category labels. Images are grouped together into sets of 532, and all images in a set have the same ground truth anonymous caption.
The group-anonymous caption pairs make training data for the image retrival method relsim. The method retrieves images that semantically similar to some query image rather than the most similar images according to a quantitative distance metric. The presenter gives several qualitative examples comparing the results from DINO NN and relsim for several image queries.
The workshop began with an address from the chief organizer. He expressed graditude for our attendence and spoke to the work remaining to be done in the space of generative AI for storytelling. The workshop organizers had accepted half of its submissions, crowned a best paper, four keynotes, and a poster session.
Diffusion models provide impressive generative results, but there is a compute resource gap. There is therefore a need for efficient diffusion models that would enable efficient high-quality generation. A diffusion architectures typically consist of tokenizers, VAEs, and the diffusion model. The bottleneck arises from token redundancy at the tokenization step and the diffusion architecture's step redundancy. Gu's work focuses on making diffusion models more efficent first by reducing the number of tokens and then by post-training tokenizer adaptation. Gu et al. achieve 64x compression ratio for the first approach, but post-training has proved more difficult. The final solution, DC-Gen consists of three separate fine-tuning phases but results in significant inference-tiem acceleration. Additionally, DC-Gen "unlocks" 4k image generation and comparable quality with the original diffusion models with significant speedup.
Moving forward, Gu reports a reduction of the total number of steps in Any-Step Video Diffusion Distillation: AnyFlow. AnyFlow overcomes the key foward training limitation of diffusion models, replacing it with Flow Map Distilation that simulates many steps with specialized distillation policy. Results with the T2V Causal benchmark, as well as qualitative examples, show that the model achieve SOTA video generation performance with fewer NFEs (diffusion steps).
Here, we summarize a talk that Dr. Kiyoharu Aizawa gave at the Workshop on Generative AI for Storytelling (AISTORY) during 2026 Conference on Computer Vision and Pattern Recognition (CVPR26). There is no recording available for the talk, but the UF Data Studio’s own Christopher Driggers-Ellis attended the workshop. We summarize the proceedings from his notes.
The talk concerns itself with the titular question “How do Humans Read Manga?” As the procedure and results of his talk will reveal, Aizawa refers not to user preferences for style or genre when he asks the question. Rather, Aizawa investigates the mechanics of reading manga itself, which are valuable insights for practitioners looking to design manga reader applications or integrate machine learning into the manga reading experience.
Aizawa explained his method at length. His team devised an interface in which the user is shown manga panel-by-panel, rather than page-by-page, and permitted to navigate through the manga with the left and right arrow keys. Panel boundaries are adjusted to make sure that any illustrations that cross panel boundaries are still visible to participants, although this implies leakage from the adjacent panel as a consequence. The primary metric is how much time the participant spends reading each panel. The experiment's dataset is much bigger than previous user studies, consisting of seven whole volumes out of 109 from the Manga109 dataset, representing diverse genres. The total size of the dataset is over 5000 panels, and each participant viewed them all with the ability to suspend and resume the experiment baked into the panel viewer. 18 Japanese-speaking university students served as participants, and they were compensated with JP¥4000 each.
I recently completed a survey of the comic understanding literature, although it is not yet published, and I found very few user studies or other types of HCI research in the survey. As a result, any user study on questions concerning manga or other types of comic book data has novelty and value to researchers and practitioners alike.
While it would take significant work to reproduce the experiment, I would like to reproduce and extend it for English language manga and American participants. At the same time, as someone who reads manga frequently for my own entertainment, I can confirm the results of the user study on an anecdotal and rational basis. The eye can "read" a textless image much quicker than text in an image, and I can personally attest to the fact that pages with more text, particularly dense text boxes or speech balloons, take longer to read than pages without any text or less-dense collections of text. I am therefore pleased with Aizawa’s work in the HCI direction, but I recognize the limited scope of the reported experiment. The only metric mentioned in the presentation is the panel reading time, recorded for each panel and participant, and his team draws various conclusions from this data in combination with properties of the panels.
Aizawa performs a user study on the time users spend reading manga, but there are a variety of other questions that his procedure can be adapted or expanded to answer, and there is a massive opportunity in reproducing his experiment with these adjustments and with western audiences. Therefore, I am inspired to repeat his experiment with additional data collection apparatus, metrics and with American participants using English translations of manga. These adjustments will indicate empirically whether the reading time results gleaned by Aizawa for Japanese participants reading in the original Japanese are reproduced for Americans reading English translations of manga. Further, the additional data collection apparatus will enable new insights for researchers and practitioners.
Modern drug discovery follows three routes: Animal Immunization, Library Screen and B-Cell Mining. Each of them is inefficient (Library Screening) or has ethical concerns (Animal Immunization, B-Cell Mining). We would rather perform experiments in silico, says Simon Kohl.
There has been massive progress in this direction in the last five years. AF2 (2020), RFDiffusion and ProtonMFMN (2023) and AlphaProteo, AF3 (2024), Latent-X1 (2025) and finally Latent-X2, Chai-2, and JAM (2026) are just a few examples showing the trajectory of computational drug discovery, which the present characterizes as exponential growth.
To understand computational drug discovery, we begin with an introduction to proteins, which are the building blocks of life. Proteins consist of amino acid structures, and there are 20 amino acids although they may fold in a variety of ways. A protein's structure determines its function (rather than is composition of amino acids). This is why structural representation in silico was such an important step forward. A single wrong mutation in a protein could totally alter or disrupt its function. This would be, Kohl says, as if changing a single pixel in an image would make it undecipherable. Therefore, protein folding methods must be incredibly precise to be effective. This is the key innovation of AlphaFold2 (AF2). CASP14 is a protein folding dataset used to evaluate computational folding models. AF2 achieves performance comparable to in-lab experiemnts. Thanks to AF2, Kohl says, we can say that protein structure prediction is "solved."
Whereas AF2 presents a molecular microscope, Kohl continues, we would like to move forward by predicting the structure of unseen proteins with desirable propteries. In other words, we would like to make the protein method generative. The core philosophy of Kohl's solution is to treat atoms in the protein like a point cloud and condition on a target to make a binder from scratch. This is RFDiffusion and starts from raw atomic noise to generate coherent targets and binders.
AF2 was the root node that gave rise to the generative RFDiffusion, and Kohl posits that this is a design trajectory that has played out in ML research throughout the years. ImageNet is his prime example. From a massive image dataset, ImageNET posed metrics and a challenge met by inferential image classifiers, which in turn gave way to image and video generation. The CASP project plays the same role for AF2, which is the inferential forerunner to the generative RFDiffusion.
Next up, the first foundation model Latent-X1 generates unseen features on the fly, and experiments pose the ultimate level of validation for protein folding models. Prompt, Generation, Prediction and finally Wet Lab Validation form a pipeline that demonstrates the models' efficacy. Latent-X1 has been validated for Peptides and mini-binders with success rates for cyclic peptides exceeding . Mini-binders are mid-size proteins with a diverse range of folds. These are useful for gene and cell therapies. Without optimization, Latent-X1 produces binding affinities in the picomolar range, exceeding expectations. Latent-X2 innovates on X1, its predecessor, by enabling the generation of unseen drug classes.
Kohl relates the industry's desire for low-liability, manufacturability, and low polyreactivity. Typically, drug manufacturers must seek these three invariants interative in what Kohl calls a game of Whack-A-Mole. The Latent series of foundation models produces drugs with these three properties in the first generation. Kohl reports an ex vivo experiment on blood samples from 10 healthy donors that, while not fantastic, promises the relevant properties without additional iterations.
Kohl concludes that his team's frontier models promise a future in which drug discovery research will no longer be hamstring by a dependence on naturally occuring proteins, for the experiments he details through the talk demonstrate the efficacy of Latency-series models' protein generation capabilities. Additionally, the Latent-Y model provides an agentic interface for the Latent-X2 protein generation model. Open problems include protein dynamics, human drug response, and systems biology. Proteins are not static as the figures in the presentation would suggest, and their dynamics have been greatly ignored. It is not known what would happen if human beings were exposed to any of the drugs that have been discovered by Latent-series models; and there is no clear way of testing any hypotheses. Predictions for the next 10 years include self-improving design, white cell prodiction and simulation, personalization, and prediction of human drug response culminating in the solution of all human disease.
"Biology," Kohl remarked before taking the audience questions, "is finally becoming programmable."
Chow is, according to the speaker introduction, a leader in the quantum computing industry with classical physics and mathematics backgrounds. He led IBM Quantum Experience, the first cloud-available quantum compute cluster. In this talk, Chow spoke about his team at IBM and their quantum computing research.
At IBM, Chow states, researchers deliever in four strategic areas: Silicon, a la Moore's Law; Quantum, solving problems intractable for classical systems at scale; AI; and algorithms, which Chow characterizes as the mathematical bridge that connects the previous three directions to important applications.
Chow presents a roadmap or timeline of quantum computing research at IBM and the deployment strategy. The top, colored portion of the roadmap shows increasing layers of abstraction that have built a quantum computing software stack to enable high-level quantum programming. 10 years ago, Quantum devices were first put onto the cloud by IBM. This enabled researchers in remote locations, outside of physics labs, to run quantum programs. It was not until 2023, however, that quantum computers could run programs that are not possible on classical devices. This year, a theoretical result guarantees quantum computing methods have provable advantage over all classical methods of solving or approximating problems.
Looking forward, Chow continues, IBM expects to deliver the first fault-tolerant quantum compute cluster at scale, with million operations and broad failure resistance. 2033 and beyond, Chow says that IBM expects to actually run Shor's algorithm.
To key prerequisites for bringing quantum compute to the world were developing a platform that scales beyond classical computation and activating a network of institutions with problems that quantum compute could serve. Key computational areas include Hamiltonian Simulation, Optimization, Machine Learning, and Differential Equations.
To enable design for quantum compute clusters, the Qiskit platform, an open SDK for quantum computing. IBM also offers the most performant quantum computers on the market with 240,000 CLOPs, 156 or 120 cubits, 2Q-median error rate.
Delivering both the hardware and software capabilities necessary to do work at scale in quantum computers, IBM mantains a network of 341 members comprising some 176 academic institutions and 76 industry partners. Chow is proud to have taken quantum compute out of the physics lab and to have brought it to the developer's fingertips. He is sure that developers can take advantage of the compute IBM provides but they will require compute clusters with much more cubits and logical gates than presently available. The Starling and Blue Jay systems on IBM's docket for 2029 and 2033, respectively, will bridge these gaps.
Starling will be a fault-tolerant cluster with an addressable logic qubit space and a universal instruction set over the entire set of logical qubits. Recent publications from IBM have been working toward removing roadblocks to the Starling's scalable, fault-tolerate computation. The method relies on a topological homology between a sheet and a taurus to connect qubits and on novel methods for packaging pump bonds and other aspects of the chip architecture to make the architecture module. The X-Ray micrographs of existing IBM quantum architectures, including the Heron and Loon, demonstrate the increasing complexity of IBM's quantum chips. Because qubit chips rely on superconductor technologies, they cryogenically stored. Racks of the qubit infrastructure can be stored in cryogenic storage and cooled to 1.5 millikelvin. This is 1.5 thousandths of a degree above absolute zero. Power consumption is measured in megawatts and the compute cluster will require thousands of square feet of floor space, but Chow assures us that the power demands are less than the demands of growing GPU clusters.
Overall, the direction that Chow sees is the harmonization of all three types of compute hardware. Clusters will have the CPU, the GPU and the Qubit Processing Unit (QPU), and a key to future applications will be prudent integration of all three hardware types. During experiments IBM completed in collaboration with Riken, a Japanese lab, and the University of Manchester researchers integrate both the classical architectures and the new QPU to study iron-sulfide molecules and validate the half-mobius topology of a manufactured molecule, resepctively. Neither are technically examples of quantum advantage, since the experiments are possible with classical simulations, but they are examples of verifiable physical experiments that IBM's quantum compute can complete now.
From a machine learning perspective, Chow continues, there is the opportunity to explore quantum feature spaces. Qubits, for example, enable the utilization of an infinite-dimensional Hilbert space. The possibility is known to exist for highly regulatized qubit structures. Conversely, AI could help us analyze quantum computing experiments since data gathered through quantum compute experiments can serve as training data for AI systems.
Finally, Chow moves on the Reference Architecture. IBM published a paper on such a referenc earlier in 2026. This reference architecture serves as the blueprint for the design of scalable quantum computers. Chow delievered numerous remarks about the evolution of this reference architecture, showing the iterations that were made to deliver the final reference. Over time, the distribution of work to the QPU or traditional processing units will be more and more abstracted from the user, and QPU will be integrated into computing clusters just as GPU is integrated with CPU today. To utilize the architecture, IBM is publishing an open-source Qiskit quantum computing SDK, and Chow presents a timeline for the rollout of its features.
The paper is concerned with faithfulness evaluation. That is to say whether a VLM model pays attention to relevent area(s) of an image when it answers a question. Faithfulness was judged only for correct answers and judge model only checks tool outputs. The authors hypothesize low faithfulnesses because reward design is outcome dominant and sparse, only rewarding or punishing the agent for the correct/incorrect answer.
The authors design Tool-Aware Policy Optimization the model things and crops a region of the image and a judge determines whether the model was faithful in its cropping of the image. The process is iterative and continues until the correct region is highlighted by the model. Reward design focuses on the output accuracy but includes a term for tool use to mitigate potenital reward hacking, which is what the method is meant to combat in the first place.
In benchamrks, CodeV performs slightly better than state of the art (SOTA) proprietary VLMs in a variety of VQA datasets. The authors also demonstrate increases in model faithfulness. The authors provide work as open source but warn that the static rubrics are expensive and cann still be reward hacked. Verifiable signals can also be difficult to find for general tasks.
InfiniBench is a benchmark for visual spatial reasoning with customizable scene complexity baked in. We can think of image or video captions or descriptions of 3D ienvironments on an axis of complexity. Previous works on spatial reasoning, the presenter explains, focus on visual input, spatial questins, and how they impact real-world applications. Models consistently underperform but why remains an open question. Part of the issue is that there is no way to control for the scene complexity in existing benchmarks, and this is where the new InfiniBench comes in.
The first instinct in constructing InfiniBench was to search existing real-world datasets for images, but these lacked sufficient numbers of complex images. Instead, InfiniBench consists of synthetic data and a generation pipeline that creates 3D scenes according to user input via a 3-stage generation pipeline.
First, an LLM-agent is asked to create a high-level description of the 3D scene that the user asked for. It generates constraints that go to a Layout Optimizer. A CoT loop closes with a refinement of constraints. Inside the Layout Optimizer, the second part of the pipeline, uses cluster-based optimization to effectively develop a coherent composition of objects in the scene. Finally, a camera trajectory optimizer.
InfiniBench effectively diagnoses failures in VLM spatial reasoning. VLM accuracy plummets as the number of objects scalses. Models' always perform better with a Birds-eye-view of the scene and with exocentric perspectives vs. egocentric ones generally.
As with NeurIPS 2025, I attended most if not all of the poster sessions hosted for CVPR. Below are tables of what I thought were the best posters relevant to the comic understanding and MLLM reasoning research areas. I have separated workshop poster sessions from main conference posters and days of the main conference from one another. As with the poster tables in the NeurIPS blogpost, I include the poster ID from the conference program, the paper title, and a link to an open-access preprint where available.
Special thanks to the Computer Vision Foundation for organizing and hosting the CVPR conference. Special thanks to the staff of the Colorado Convention Center for managing the logistics of the operation. And special thanks to the city of Denver for hosting the attendees throughout the week of June 3, 2026 to June 7.
For more information about our research, return to our homepage: ufdatastudio.com.