Papers

Academic papers and research lineage

This archive traces how AI capability, safety, evaluation, and governance ideas evolve over time alongside public reporting and discussion.

RePEc: Research Papers in Economicsdate pending

Mastering Atari, Go, chess and shogi by planning with a learned model

Abstract Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess1 and Go2, where a perfect simulator is available. However, in real-world problems, the dynamics governing the environment are often complex and unknown. Here we present the MuZero algorithm, which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. The MuZero algorithm learns an iterable model that produces predictions relevant to planning: the action-selection policy, the value function and the reward. When evaluated on 57 different Atari games3—the canonical video game environment for testing artificial intelligence techniques, in which model-based planning approaches have historically struggled4—the MuZero algorithm achieved state-of-the-art performance. When evaluated on Go, chess and shogi—canonical environments for high-performance planning—the MuZero algorithm matched, without any knowledge of the game dynamics, the superhuman performance of the AlphaZero algorithm5 that was supplied with the rules of the game.

Julian Schrittwieser · Ioannis Antonoglou · Thomas Hubert · Karen Simonyan · Laurent Sifre · Simon Schmitt · Arthur Guez · Edward Lockhart · Demis Hassabis · Thore Graepel · Timothy Lillicrap · David Silver

arXiv cs.CVJul 17, 2026

PanoAffordanceNet: Towards Holistic Affordance Grounding in 360{\deg} Indoor Environments

arXiv:2603.09760v2 Announce Type: replace Abstract: Global perception is essential for embodied agents in 360{\deg} spaces, yet current affordance grounding remains largely object-centric and restricted to perspective views. To bridge this gap, we introduce a novel task: Holistic Affordance Grounding in 360{\deg} Indoor Environments. This task faces unique challenges, including severe geometric distortions from Equirectangular Projection (ERP), semantic dispersion, and cross-scale alignment difficulties. We propose PanoAffordanceNet, an end-to-end framework featuring a Distortion-Aware Spectr…

arXiv cs.LGJul 17, 2026

Stop Thinking, Start Looking: Efficient Post-Training for Multimodal Document Question Answering via Reasoning-Free Alignment

arXiv:2607.14682v1 Announce Type: cross Abstract: Efficient multimodal document question answering with explicit visual grounding, locating the precise document region that supports each answer remains an open challenge. Current approaches bifurcate into Supervised Fine-Tuning (SFT), which requires large annotated datasets and reaches optimization plateaus, and reasoning-centric Reinforcement Learning (RL), which depends on verbose intermediate traces that inflate inference token cost without clear benefit. We introduce Perception-RFT, a training framework that applies Group Relative Policy O…

arXiv cs.CLJul 17, 2026

Smarter and Cheaper at Once: Byte-Exact KV-Cache Grafting Turns a Frozen Small Model into a Verified-Knowledge Flywheel

arXiv:2607.14431v1 Announce Type: new Abstract: We report a way to make a frozen small language model both more capable and dramatically cheaper at once, without changing any weights. Verified knowledge is deposited once as a byte-exact key-value (KV) state artifact and later restored, by graft, into a fresh inference context. The restore is bit-exact: under a pinned deterministic configuration, the grafted logits are byte-for-byte identical to a fresh computation (SHA-256 equality), with zero KL divergence and 100% argmax agreement over fifty samples. We show that own-position graft is the u…

arXiv cs.AIJul 17, 2026

VTM-Nav: Hierarchical Visual-Topological Memory for Cross-Episode Object-Goal Navigation

arXiv:2607.14514v1 Announce Type: cross Abstract: Object-goal navigation requires an embodied agent to locate and reach an instance of a specified object category in an indoor environment. Recent training-free approaches leverage vision-language models (VLMs) for open-vocabulary semantic reasoning, but are typically evaluated under an episodic protocol that resets all scene-specific state after each episode. We introduce Cross-Episode Object-Goal Navigation, in which an agent repeatedly operates in the same scene, retains only self-acquired experience, and keeps its model parameters fixed. To…

arXiv cs.CVJul 17, 2026

CRISP: Constrained Refinement via Iterative Squeezing Process for Robust Medical Image Segmentation under Domain Shift

arXiv:2607.15231v1 Announce Type: new Abstract: Distribution shift in medical imaging remains a central bottleneck for the clinical translation of medical AI. Failure to address it can lead to severe performance degradation in unseen environments and exacerbate health inequities. Existing methods for domain adaptation are inherently limited by exhausting predefined possibilities through simulated shifts or pseudo-supervision. Such strategies struggle in the open-ended and unpredictable real world, where distribution shifts are effectively infinite. To address this challenge, we adopt the "Ran…

arXiv cs.AIJul 17, 2026

Reward-Free Evolving Agents via Pairwise Validator

arXiv:2607.14408v1 Announce Type: new Abstract: A self-evolving agentic loop repeatedly proposes a tweaked version of an agent (its prompt template or program) and accepts or rejects the change based on a per-iteration quality signal. Designing that signal is often the costly part of the project: a reliable scalar reward requires domain expertise and labeled examples that are themselves as expensive to assemble as the agent's underlying task. We propose replacing the scalar at the accept/reject gate with a pairwise validator: a frozen LLM that, given the parent and child candidate, returns a…

arXiv cs.LGJul 17, 2026

Learning in Infinitesimal Non-Compositional Sketches

arXiv:2607.15107v1 Announce Type: new Abstract: This paper develops a categorical framework -- Learning in Infinitesimal Non-Compositional Sketches (LINCS) -- as the repair of non-compositionality: failures of diagrams to factor through quotient sketches lifted to the tangent category setting. Machine learning problems are specified as sketches: graphs with commutativity conditions $\mathcal D$, limit cones $\mathcal L$, and colimit cocones $\mathcal K$, generalizing the usual scalarization of loss functions or vector space assumptions. Non-compositionality is defined purely as failure of a u…

arXiv cs.AIJul 17, 2026

SceneBind: Binding What and Where Across Vision, Audio and Language

arXiv:2607.15265v1 Announce Type: cross Abstract: We present SceneBind, an omni-modal representation of realistic scenes with joint semantic and 3D spatial understanding across vision, audio and language. Existing omni-modal encoders excel at instance-level semantics (i.e., what is present), but often lack explicit spatial structure (i.e., where it is). SceneBind addresses this gap by representing each scene as a semantic-spatial entity, combining a global semantic embedding with object-centric semantic-spatial slots. This representation explicitly captures object-level semantics, spatial att…

arXiv cs.CVJul 17, 2026

Structural-Semantic Reciprocal Learning for Unsupervised Visible-Infrared Person Re-Identification

arXiv:2607.15220v1 Announce Type: new Abstract: Unsupervised visible-infrared person re-identification (USVI-ReID) is challenging due to the large modality gap and the lack of cross-modal identity annotations. Progressive association paradigms have been proposed to gradually bridge the gap, but they suffer from two critical bottlenecks: reliance on ambiguous global representations and unchecked propagation of pseudo-label noise in an open-loop manner. To address these issues, we propose Structural-Semantic Reciprocal Learning (SSRL), a framework that transforms open-loop association into a se…

arXiv cs.ROJul 17, 2026

Temporal Cascading of Planning and Control for Quadrotor MPC

arXiv:2512.12427v2 Announce Type: replace Abstract: Many aerial tasks involving quadrotors demand both instant reactivity and long-horizon planning for obstacle avoidance, energy efficiency, or trajectory tracking. High-fidelity models enable accurate control but are too slow for long horizons. Low-fidelity planners scale but cannot directly control the system, necessitating cascaded architectures. Prevailing hierarchical approaches plan with a simplified model and use a high-fidelity controller for tracking, yet this decomposition is inherently suboptimal. The controller is limited by the co…

arXiv cs.AIJul 17, 2026

Beyond Generalist LLMs: Specialist Agentic Systems for Structured Code Workflow Execution

arXiv:2607.14456v1 Announce Type: cross Abstract: Large Language Models (LLMs) have accelerated the adoption of software development agents, now widely available as Integrated Development Environment (IDE) extensions and standalone applications. While these agents are typically general-purpose, it remains unclear whether specialist agents justify their additional development effort. We investigate this question in the context of business process automation, focusing on the transformation of Business Process Model and Notation (BPMN) diagrams into executable agentic workflows. Since BPMN speci…

arXiv cs.CLJul 17, 2026

D-cut: Adaptive Verification Depth Pruning for Batched Speculative Decoding

arXiv:2607.14647v1 Announce Type: new Abstract: Speculative decoding accelerates large language model (LLM) inference without compromising output quality. Recent parallel drafting methods further improve single-request performance by decoupling draft length from drafting latency, enabling longer drafts and higher mean accepted tokens (MAT). However, under high request concurrency, long drafts waste substantial computation on rejected tokens, increasing verification cost and potentially making speculative decoding slower than autoregressive decoding. We present D-Cut, an adaptive pruning metho…

arXiv cs.CVJul 17, 2026

Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise

arXiv:2602.22568v2 Announce Type: replace Abstract: Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC em…

arXiv cs.LGJul 17, 2026

Stabilizing Native Low-Rank LLM Pretraining

arXiv:2602.12429v2 Announce Type: replace Abstract: Foundation models have achieved remarkable success, yet their growing parameter counts pose significant computational and memory challenges. Low-rank factorization offers a promising route to reduce training and inference costs, but the community lacks a stable recipe for training models from scratch using exclusively low-rank weights while matching the performance of the dense model. We demonstrate that Large Language Models (LLMs) can be trained from scratch using exclusively low-rank factorized weights for all non-embedding matrices witho…

arXiv cs.ROJul 17, 2026

Communication-Efficient Relative Pose Estimation with Vision Foundation Models for Ephemeral Collaborative Perception

arXiv:2607.14539v1 Announce Type: new Abstract: Relative pose estimation is a fundamental capability for collaborative perception and coordination in multi-robot systems. However, robots encountering each other in real-world environments often operate in short interaction windows and must operate under limited communication bandwidth with intermittent or missing visual overlap caused by occlusions or limited fields of view. Existing approaches typically rely on global reference frames, assume sustained view overlap, or incur prohibitive communication costs, thereby limiting their applicabilit…

arXiv cs.IRJul 17, 2026

Bridge Evidence: Static Retrieval Utility Does Not Predict Causal Utility in Multi-Step Agentic Search

arXiv:2607.15253v1 Announce Type: new Abstract: Retrieval systems are trained and evaluated on a static idea of usefulness: hand a document and a question to a reader model, see whether the answer improves, and score the document accordingly. The idea holds up when a document is read on its own. It breaks when a language model works as a search agent, issuing several queries and reasoning across turns, because a document can matter for what it lets the agent do next rather than for what it says about the current question. We measure that gap rather than argue it. Using a ReAct style agent ove…

arXiv cs.AIJul 17, 2026

Global drivers and barriers to the public acceptance of autonomous vehicles: Evidence from 17 countries

arXiv:2607.14436v1 Announce Type: cross Abstract: This study investigated the public acceptance of Society of Automotive Engineers Level 3 conditionally automated cars, which can self-drive under certain specified conditions but require the human driver to remain ready to resume control when requested. Previous Unified Theory of Acceptance and Use of Technology 2 (UTAUT2)-based research has focused mainly on European samples, and so it is still unclear whether the same factors shape acceptance across broader world regions. This knowledge gap was addressed using the L3Pilot Global User Accepta…

arXiv cs.LGJul 17, 2026

BadWAM: When World-Action Models Dream Right but Act Wrong

arXiv:2607.15207v1 Announce Type: new Abstract: World-action models (WAMs) are emerging as a promising foundation for embodied control: rather than predicting actions alone, they learn representations that couple action generation with future world prediction. This coupling is often viewed as a source of robustness, interpretability, and safety, as a robot's action can in principle be checked against its imagined future. In this paper, we show that this assumption is fragile. We introduce BadWAM, a unified framework for modeling and evaluating World-Action Drift Attacks: a new class of WAM-sp…

arXiv cs.CVJul 17, 2026

AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting

arXiv:2602.04043v2 Announce Type: replace Abstract: The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-…

arXiv cs.CVJul 17, 2026

Hierarchical Denoising For Multi-Step Visual Reasoning

arXiv:2607.15278v1 Announce Type: new Abstract: Video models are evolving into vision foundation models, yet they still lack human-like multi-step reasoning. Streaming autoregressive diffusion models are efficient but limited in reasoning, while bidirectional diffusion enables global revision with high inference costs due to dense frame-level denoising. Both paradigms struggle to achieve logical consistency and low-latency streaming for complex reasoning tasks. We propose HDR (Hierarchical Denoising for Visual Reasoning), a unified framework that integrates hierarchical latents into causal vi…

arXiv cs.AIJul 17, 2026

Symbal: Detecting Systematic Misalignments in Model-Generated Captions

arXiv:2607.15216v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) often introduce errors when generating image captions, resulting in misaligned image-text pairs. Our work focuses on a class of captioning errors that we refer to as systematic misalignments, where a recurring error in MLLM-generated captions is closely associated with the presence of a specific visual feature in the paired image. Given a vision-language dataset with MLLM-generated captions, our aim in this work is to detect such errors, a task we refer to as systematic misalignment detection. As our fi…

arXiv cs.CVJul 17, 2026

MAGiSt3R: Multi-Agent Feed-forward 3D Reconstruction from Monocular RGB Videos

arXiv:2607.15211v1 Announce Type: new Abstract: This paper presents MAGiSt3R, a multi-agent 3D reconstruction framework performing reconstruction and camera tracking for monocular RGB videos at almost 10 FPS. MAGiSt3R relies on a feed-forward model from the 3R family to process RGB videos and regress local point maps, and on a merging model, MAGMA, that combines local maps at both intra-agent and inter-agent levels to obtain the final global point map. Furthermore, MAGiSt3R performs pose graph optimization to mitigate cumulative camera drift occurring along the feed-forward pipeline. We evalu…

arXiv cs.LGJul 17, 2026

LATTICE: Graph Self-Supervised Learning for Multimodal Spatial Omics Integration

arXiv:2607.14410v1 Announce Type: new Abstract: Spatially resolved omics studies increasingly combine transcriptomic and epigenomic assays, yet downstream analysis is often still performed using single-modality pipelines. We present LATTICE (Latent Alignment of Tissue-level and Transcriptomic Information for Cross-modal Embedding), a graph-based self-supervised framework that learns spot-level representations from harmonized multimodal features. LATTICE integrates five aligned modality blocks per Visium spot: Visium RNA, scMultiome RNA, scMultiome ATAC, spatial ATAC, and spatial CUT\&Tag. The…

Load more papers