Paper Tracking · Archive

Every weekly run is preserved. Click a month to expand the weeks inside. Latest weeks at the top. ← Back to current week

June 2026 48 papers

Week of June 29

Graph × LLM

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

↑ 1 📚 20084 ★ 0 Jun 23

Problem: Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Comp
Model:
Code: atinpothiraj/pqsg

A Fair Evaluation of Graph Foundation Models for Node Property Prediction

↑ 0 📚 1890 ★ 0 Jun 23

Problem: Due to the wide use of graph-structured data in different fields of industry and science, the development of Graph Foundation Models (GFMs) has recently attract
Model:
Code: not released

Understanding Rollout Error in Graph World Models

↑ 0 ★ 0 Jun 26

Problem: World models are often used for planning by rolling learned dynamics forward. Many planning environments, however, are not vectors or images; they are graphs of
Model:
Code: Hik289/graph_

Vision-Language Models

Wan-Streamer v0.1: End-to-end Real-time Interactive Foundation Models

↑ 98 📚 83 ★ 0 Jun 22

Problem: We present Wan-Streamer, a native-streaming, end-to-end interactive foundation model designed from the ground up for real-time, low-latency, full-duplex audio-v
Model:
Code: not released

ShutterMuse: Capture-Time Photography Guidance with MLLMs

↑ 45 📚 7140 ★ 0 Jun 23

Problem: Real-world photography requires capture-time guidance for both camera framing and subject pose. Yet existing aesthetic cropping benchmarks mainly evaluate post-
Model:
Code: not released

ViQ: Text-Aligned Visual Quantized Representations at Any Resolution

↑ 38 📚 7100 ★ 0 Jun 24

Problem: A unified representation for text and vision is a natural pursuit, as it enables simpler multimodal modeling and more efficient training. However, representing
Model:
Code: yuxumin/ViQ

World Models

Qwen-AgentWorld: Language World Models for General Agents

↑ 139 📚 554 ★ 0 Jun 23

Problem: A world model predicts environment dynamics based on current observations and actions, serving as a core cognitive mechanism for reasoning and planning. In this
Model:
Code: expressjs/express.git

DomainShuttle: Freeform Open Domain Subject-driven Text-to-video Generation

↑ 64 📚 8106 ★ 0 Jun 23

Problem: Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios
Model:
Code: not released
⚠ Interested, but agent could not fetch the PDF — summary based on abstract only.

Hallucination in World Models is Predictable and Preventable

↑ 8 📚 21396 ★ 0 Jun 25

Problem: Modern generative world models render increasingly realistic action-controllable futures, yet they frequently hallucinate: rollouts remain visually fluent while
Model:
Code: google-research/robodesk

Spatial Single-Cell Study

SP-Mind: An Autonomous Reasoning Agent for Spatial Proteomics Analysis

↑ 0 📚 95 ★ 0 Jun 23

Problem: Spatial proteomics enables single-cell-resolution characterization of protein expression within tissue architecture, playing a critical role in understanding tu
Model:
Code: not released

Week of June 22

Graph × LLM

Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework

↑ 0 📚 24101 ★ 0 Jun 15

Problem: Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge
Model:
Code: not released

AdaSTORM: Scaling LLM Reasoning on Dynamic Graphs via Adaptive Spatio-Temporal Multi-Agent Collaboration

↑ 0 📚 19129 ★ 0 Jun 15

Problem: Large Language Models (LLMs) demonstrate remarkable potential in dynamic graph reasoning, but suffer from a scaling bottleneck: current models can only handle g
Model:
Code: not released

What Should a Streaming Video Model Remember?

↑ 0 📚 3581 ★ 0 Jun 15

Problem: Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memor
Model:
Code: not released

Vision-Language Models

Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

↑ 115 📚 34132 ★ 0 Jun 16

Problem: While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical dep
Model:
Code: black-forest-labs/flux

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

↑ 36 📚 46068 ★ 0 Jun 18

Problem: Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to s
Model:
Code: not released

FreeStyle: Free Control of Style-Content Dual-Reference Generation from Community LoRA Mining

↑ 26 📚 27146 ★ 0 Jun 18

Problem: Style-content dual-reference generation aims to synthesize an image that preserves the structure and semantics of a content reference while adopting the style o
Model:
Code: not released
⚠ Interested, but agent could not fetch the PDF — summary based on abstract only.

World Models

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

↑ 15 📚 15824 ★ 0 Jun 17

Problem: World Action Models (WAMs) commonly rely on video generation to bridge visual world modeling and robot control. However, video-based WAMs face three coupled lim
Model:
Code: yuyangalin/ImageWAM

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

↑ 15 📚 13282 ★ 0 Jun 16

Problem: The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video
Model:
Code: not released

Kairos: A Native World Model Stack for Physical AI

↑ 35 ★ 0 Jun 15

Problem: World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world know
Model:
Code: not released
⚠ Interested, but agent could not fetch the PDF — summary based on abstract only.

Week of June 15

Graph × LLM

APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations

↑ 1 📚 150 ★ 0 Jun 9

Problem: Generic time-series foundation models fail to capture wireless network telemetry characteristics—bursty, zero-inflated signals with cross-protocol dependencies.
Model: APEX: network-native decoder-only transformer for forecasting enterprise AP telemetry and anomaly detection, available in cloud (269M) and edge (10.5M) variants.
Code: not released

Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism

↑ 0 ★ 0 Jun 8

Problem: LLMs achieve high accuracy on graph isomorphism detection but fail to maintain permutation invariance when nodes are relabeled, suggesting pattern exploitation rather than genuine structural reasoning.
Model: approach: diagnostic evaluation protocol testing permutation invariance of LLMs (GPT-4o, Gemini, Llama) on graph isomorphism tasks across multiple serialization formats and prompting strategies.
Code: not released

Vision-Language Models

InterleaveThinker: Reinforcing Agentic Interleaved Generation

↑ 77 📚 42325 ★ 0 Jun 10

Problem: Image generators cannot produce interleaved text-image sequences due to architectural constraints, limiting applications in visual narratives and embodied manipulation.
Model: InterleaveThinker: multi-agent framework with planner and critic agents that retrofits existing image generators for interleaved generation using GRPO-based trajectory optimization.
Code: zhengdian1/InterleaveThinker

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

↑ 53 📚 8682 ★ 0 Jun 11

Problem: Existing vision-language-action models lack laboratory-specific training data and cannot accommodate diverse robot embodiments needed for scientific protocol execution.
Model: LabVLA: vision-language-action model combining Qwen3-VL-4B-Instruct backbone with FAST action token pretraining and flow matching posttraining via DiT action expert
Code: not released

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

↑ 89 ★ 0 Jun 11

Problem: Vision-language models struggle with spatial reasoning tasks; existing action interfaces for tool-augmented agents limit flexible composition of perception results.
Model: SpatialClaw: training-free framework using persistent stateful Python kernel as action interface for iterative spatial reasoning with VLM-backed agents
Code: Video-MME/Video-MME-v2

World Models

Avatar V: Scaling Video-Reference Avatar Video Generation

↑ 4 ★ 0 Jun 10

Problem: Existing avatar video generation methods condition on single static images, failing to capture dynamic behavioral patterns and identity nuances required for production-quality talking avatars.
Model: Avatar V: production-scale framework for video-reference-conditioned avatar generation using Diffusion Transformer with Sparse Reference Attention, motion representation stream, and identity-aware super-resolution refinement.
Code: not released

Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

↑ 0 📚 17151 ★ 0 Jun 11

Problem: World models lack per-prediction trustworthiness certificates and prediction horizons; average error does not indicate whether specific predictions can be trusted or for how long.
Model: approach: Equivariant latent world models with computable multi-step certification via Lyapunov spectrum stratification, proving orbit-constant error under equivariance and horizon bounds T_j(ε)∼log(1/ε)/λ_j
Code: not released

$\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

↑ 2 ★ 0 Jun 11

Problem: Existing robot world models fail to simultaneously achieve high fidelity, long-horizon consistency, and efficient inference for manipulation tasks.
Model: WEAVER (World Estimation Across Views for Embodied Reasoning): a multi-view world model combining flow matching, diffusion forcing, pretrained encoders, and latent reward prediction for robot manipulation.
Code: Lightning-AI/torchmetrics

Spatial Single-Cell Study

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

↑ 0 📚 329 ★ 0 Jun 11

Problem: Predicting single-cell transcriptional responses to perturbations requires models that capture population-level distribution shifts without relying on complex auxiliary encoders or specialized latent spaces.
Model: OCOO-T: flow-matching-based Transformer model that directly denoises continuous gene expression profiles conditioned on perturbation embeddings, dosage, and cell covariates via adaptive layer normalization.
Code: not released

HiST: A Hierarchical Sparse Transformer for Cross-Modal Spatial Transcriptomics Modeling

↑ 0 ★ 0 Jun 12

Problem: Inferring gene expression from gigapixel H&E histology images at sparse, irregular spatial locations requires efficient multiscale modeling without dense-grid overhead.
Model: HiST: hierarchical sparse transformer with dyadic encoder-decoder on sparse tissue footprint, sparse window attention, and slide calibration token for cross-slide robustness.
Code: wwyi1828/HiST

Adaptive spatial blocking for scalable clustering inference with applications to high-throughput spatial proteomics

↑ 0 ★ 0 Jun 10

Problem: Existing Ripley's K-function methods for spatial clustering are computationally prohibitive for large-scale spatial proteomics data due to O(n²) complexity.
Model: "B-KAMP" (block-based KAMP): adaptive spatial blocking algorithm aggregating clustering evidence across disjoint rectangular blocks with asymptotic normal inference.
Code: mingyugo/B_KAMP

Week of June 8

Graph × LLM

When Graph Tokens Sink: A Mechanistic Analysis of Graph Language Models

↑ 2 📚 1201 ★ 0 Jun 2

Problem: Graph Language Models may develop internal pathologies where graph tokens become activation outliers without meaningfully representing graph structure.
Model: approach: mechanistic interpretability analysis of Graph Language Models (LLaGA, TEA-GLM) through graph sink token detection and intervention experiments
Code: not released

The Post-GCN Decade Revisited: Curvature-Stratified Evaluation of Relational Learning

↑ 0 📚 6199 ★ 0 Jun 4

Problem: Standard benchmarks average model performance across heterogeneous datasets, obscuring geometry-dependent performance variations and misleading conclusions about generalization.
Model: "CurvBench": curvature-stratified evaluation framework partitioning datasets by intrinsic geometry (positive, negative, near-zero curvature) to reveal geometry-dependent model performance trade-offs.
Code: https://sirbabbage.github.io/CurvBench_HOME/

A Graph Foundation Model with Spectral Parsing and Prototype-Guided Spatial Propagation

↑ 0 📚 12963 ★ 0 Jun 2

Problem: Graph foundation models struggle with cross-graph transfer due to diverse graph structures and entangled spectral components requiring different propagation behaviors.
Model: SPG: graph foundation model combining learnable Chebyshev spectral filters for feature decomposition with Gromov-Wasserstein prototype geometry for transferable structural knowledge.
Code: not released

Vision-Language Models

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

↑ 21 📚 27459 ★ 0 Jun 3

Problem: Existing unified video generation and editing models are computationally expensive, relying on massive parameters and token concatenation that quadruples self-attention complexity.
Model: LoomVideo: 5B-parameter unified video generation and editing architecture using MLLM encoder, Deepstack injection, and zero-overhead Scale-and-Add conditioning.
Code: MSALab-PKU/LoomVideo

Personal AI Agent for Camera Roll VQA

↑ 19 📚 7113 ★ 0 Jun 2

Problem: AI assistants cannot efficiently answer personalized questions over thousands of personal camera roll photos spanning years.
Model: camroll-agent: conversational AI agent with hierarchical memory and tools for efficient navigation over large personalized visual memory
Code: not released

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

↑ 9 📚 3973 ★ 0 Jun 4

Problem: Vision-Language Models struggle with spatial reasoning beyond observed images, failing to infer unobserved layouts and reason from alternative viewpoints with limited egocentric observations.
Model: "Astra: agentic spatial reasoning framework coupling Astra-VL (RL-trained Qwen3-VL policy) with Astra-WM (Bagel-based world simulator for action-conditioned novel-view generation with view consistency tuning)"
Code: not released

World Models

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

↑ 16 📚 4009 ★ 0 Jun 3

Problem: Standard video generation benchmarks measure visual quality, not whether generated robot manipulation videos produce executable physical behavior.
Model: "Dream.exe: evaluation framework for video-to-execution grounding in robotic manipulation, combining video assessment, trajectory extraction, and physics simulator execution.
Code: not released

Towards World Models in Biomedical Research

↑ 0 📚 38743 ★ 0 Jun 4

Problem: Current biomedical AI focuses on static pattern recognition rather than simulating how biological systems evolve under interventions and perturbations.
Model: approach: biomedical world models that learn latent representations of biological states and intervention-conditioned dynamics to simulate future trajectories
Code: not released

PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation

↑ 0 📚 5409 ★ 0 Jun 4

Problem: Existing world models for robot evaluation use open-loop prediction, but VLA policies operate in closed-loop feedback; a world model for policy-in-the-loop evaluation is needed.
Model: "PiL-World": chunk-wise world model for closed-loop VLA evaluation using action-derived visual control, latent multi-view history conditioning, and joint multi-view prediction.
Code: not released

Spatial Single-Cell Study

Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma

↑ 0 📚 3679 ★ 0 Jun 3

Problem: Whether attention maps from pathology foundation models capture genuine biological signals remains unknown, hindering clinical trust and regulatory approval.
Model: approach: spatial transcriptomics-based framework for objective evaluation of attention coherence in pathology foundation models using attention-based multiple instance learning
Code: not released

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

↑ 0 📚 1102 ★ 0 Jun 1

Problem: Predict gene expression for individual cells from histology images and cell locations, accounting for cell-type-dependent expression variability.
Model: GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts that routes cells to type-specific experts and incorporates cell-type co-expression priors from scRNA-seq data.
Code: not released

Week of June 1

Graph × LLM

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

↑ 10 📚 42163 ★ 0 May 27

Problem: Lightweight mobile GUI agents struggle with end-to-end planning due to limited model capacity; large VLMs are costly and privacy-invasive.
Model: UI-KOBE: framework that constructs app-specific knowledge graphs through autonomous exploration and uses them to guide lightweight agents via local decision-making at runtime.
Code: YuxiangChai/UI-KOBE

Vision-Language Models

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

↑ 117 📚 674 ★ 0 May 27

Problem: Embodied AI systems are fragmented across task families and robot embodiments, limiting generalization across manipulation, navigation, and diverse platforms.
Model: Qwen-VLA: unified vision-language-action model extending Qwen3.5-4B with DiT-based flow-matching action decoder for cross-task, cross-embodiment embodied control.
Code: QwenLM/Qwen-VLA

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

↑ 45 📚 9553 ★ 0 May 27

Problem: Vision-language models achieve high spatial reasoning benchmark scores but may rely on statistical shortcuts rather than structured 3D understanding.
Model: approach: Representation-level analysis framework using minimal contrastive pairs to measure spatial axis organization and disentanglement in VLM embeddings, plus SpatialTunnel synthetic benchmark.
Code: not released

EarlyTom: Early Token Compression Completes Fast Video Understanding

↑ 28 📚 19794 ★ 0 May 27

Problem: Video-LLMs suffer from inefficient vision encoding, which dominates time-to-first-token latency despite existing token compression methods.
Model: "EarlyTom: Early Token Compression framework for video LLMs" using inner vision encoder frame merging and decoupled spatial token selection.
Code: not released

World Models

minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

↑ 50 📚 36656 ★ 0 May 27

Problem: Converting video diffusion foundation models into real-time interactive world models requires scattered techniques across data, training, and inference pipelines.
Model: minWM: full-stack framework converting T2V/TI2V bidirectional diffusion models into camera-controllable few-step autoregressive video world models via Causal Forcing/Forcing++ distillation
Code: shengshu-ai/minWM

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

↑ 42 📚 16919 ★ 0 May 27

Problem: Video diffusion models may perceive temporal direction without understanding causality; existing benchmarks rely on synthetic data with limited real-world generalization.
Model: YoCausal: two-level benchmark with Reverse Surprise Index (RSI) and Causality Cognition Index (CCI) metrics for evaluating causal cognition in video diffusion models.
Code: genmoai/models

AdaState: Self-Evolving Anchors for Streaming Video Generation

↑ 6 📚 1390 ★ 0 May 27

Problem: Autoregressive video diffusion models anchor to static first frames, suppressing dynamics and locking scene composition despite natural evolution during generation.
Model: "AdaState": replaces static first-frame anchor with adaptive latent state that denoises alongside content at each chunk, evolving with generated scenes via relative time formulation.
Code: not released

May 2026 22 papers

Week of May 25

Graph × LLM

GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT

↑ 0 📚 7548 ★ 0 May 21

Problem: Aligning free-text radiology report descriptions to 3D CT volumes for precise lesion localization and verifiable clinical interpretation.
Model: GLeVE: Graph-guided lesion grounding framework using relation-aware graph reasoning, anatomy-aware proposal verification, and octree-based refinement for lesion-wise text-image alignment.
Code: JSLiam94/GLeVE

S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs

↑ 0 📚 2080 ★ 0 May 18

Problem: Graph foundation models struggle with sparse text-attributed graphs where node texts are missing, noisy, or uneven, causing unreliable structure-semantics alignment and transfer bias.
Model: S2Aligner: sparsity-aware and structure-enhanced LLM-as-Aligner framework that decouples semantic alignment from structural modeling via content-structure factorization and sparsity-aware cross-domain risk balancing.
Code: not released

Deep Neural Sheaf Diffusion

↑ 0 📚 1013 ★ 0 May 18

Problem: Scaling Graph Neural Networks to depth is hindered by representation collapse and vanishing signals in existing sheaf diffusion methods.
Model: Deep Neural Sheaf Diffusion (DNSD): sheaf-based GNN replacing sheaf Laplacian with adjacency operator, adding normalization, odd nonlinearities, and gating to maintain informative signals across layers.
Code: not released

Vision-Language Models

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

↑ 81 📚 57 ★ 0 May 19

Problem: MLLMs struggle with domain-misaligned reasoning and hallucinated inferences in open-vocabulary industrial anomaly detection across unseen products.
Model: IndusAgent: tool-augmented agentic framework with supervised fine-tuning on Indus-CoT dataset and accuracy-gated reinforcement learning for industrial anomaly detection.
Code: not released

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

↑ 42 📚 15499 ★ 0 May 20

Problem: Current multimodal LLMs struggle with audio-visual reasoning because text-based chain-of-thought compresses continuous signals into discrete tokens, losing temporal grounding.
Model: LatentOmni: cross-modal reasoning framework interleaving textual reasoning with audio-visual latent states, using feature-level supervision and Omni-Sync Position Embedding for temporal alignment.
Code: not released

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

↑ 9 📚 19747 ★ 0 May 19

Problem: Discrete autoregressive text-to-image models suffer from latent covariate shift when optimizing only the policy with a frozen decoder, causing alignment-fidelity trade-offs.
Model: RankE: End-to-end post-training framework for discrete text-to-image generation that co-evolves the AR policy and VQ decoder through alternating ranking-based optimization.
Code: not released

World Models

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

↑ 24 📚 17822 ★ 0 May 19

Problem: Extending video diffusion models to generate long sequences without training, while avoiding quality degradation and temporal inconsistency.
Model: FlowLong: inference-time framework using manifold-constrained Tweedie matching and stochastic early-phase sampling for long video generation
Code: not released

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

↑ 19 📚 13800 ★ 0 May 20

Problem: Quantizing autoregressive video diffusion models is unexplored; standard quantization schemes designed for bidirectional diffusion transformers perform suboptimally on ARVDs.
Model: Q-ARVD: quantization framework for autoregressive video diffusion models using final-quality-guided frame-weighting and outlier-aware adaptive dual-scale quantization
Code: not released

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

↑ 7 📚 41260 ★ 0 May 21

Problem: Current agentic LLMs lack control over when and how to plan, causing inefficient token use without reliable accuracy gains.
Model: "SR²AM" (Self-Regulated Simulative Reasoning Agentic LLM): decomposes decision-making into three systems—simulative reasoning via world model, self-regulation via learned configurator, and reactive execution—implemented as distinct chain-of-thought stages within an LLM.
Code: sailing-lab/sr2am

Spatial Single-Cell Study

AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

↑ 0 📚 20431 ★ 0 May 19

Problem: Multi-agent workflow design in open-ended scientific settings lacks curated training sets, reliable metrics, and standardized interfaces between tools and agents.
Model: AgentCo-op: retrieval-based synthesis framework that composes reusable skills, tools, and external agents into executable workflows through typed artifact handoffs and bounded evidence-guided local repair.
Code: ma-compbio-lab/AgentCo-Op

Week of May 18

Graph × LLM

Decoupled and Divergence-Conditioned Prompt for Multi-domain Dynamic Graph Foundation Models

↑ 0 📚 136813 ★ 0 May 13

Problem: Multi-domain dynamic graphs have incompatible temporal semantics and divergent patterns, causing negative transfer in unified pre-training and adaptation.
Model: DyGFM: Dynamic Graph Foundation Model with semantic-temporal decoupled pre-training, divergence-aware expert routing, and divergence-conditioned prompt generation.
Code: RingBDStack/DyGFM

GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning

↑ 0 📚 13731 ★ 0 May 14

Problem: Existing graph foundation model prompts entangle source domain information and pre-training strategies, limiting generalization to unseen target domains and neglecting unlabelled test data.
Model: GFMate: pre-training-agnostic test-time graph prompt tuning using centroid and layer prompts with complementary learning objective
Code: YanJiangJerry/GFMate

A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning

↑ 0 📚 23386 ★ 0 May 12

Problem: Existing graph language models lack unified alignment of multi-domain, multi-task GNN representations with LLM token spaces.
Model: "UniGraphLM: Unified Graph Language Model" with graph-text pair pretraining and curriculum alignment tuning for multi-domain, multi-task graph representation alignment with LLMs.
Code: not released

Vision-Language Models

MMSkills: Towards Multimodal Skills for General Visual Agents

↑ 99 📚 30599 ★ 0 May 13

Problem: Visual agents need reusable multimodal procedural knowledge that binds actions to visual state recognition and decision-making, beyond text-only skills.
Model: "MMSkills: framework for representing, generating, and utilizing reusable multimodal procedures for visual agents. Each skill couples textual procedures with runtime state cards and multi-view keyframes, generated via trajectory-to-skill generator and consulted via branch loading.
Code: DeepExperience/MMSkills

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

↑ 71 📚 3089 ★ 0 May 13

Problem: No benchmark systematically compares long-context LVLMs and memory-augmented agents on multimodal multi-session conversations requiring visual evidence.
Model: "MemLens": benchmark with 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, answer refusal) at four context lengths (32K-256K tokens)
Code: xrenaf/MEMLENS

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

↑ 58 📚 1421 ★ 0 May 13

Problem: Existing multimodal agent memory evaluations fail to assess whether agents preserve fine-grained visual evidence needed for reasoning over time.
Model: "MemEye: a visual-centric evaluation framework measuring visual evidence granularity and reasoning complexity in multimodal agent memory"
Code: not released
⚠ Interested, but agent could not fetch the PDF — summary based on abstract only.

World Models

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

↑ 87 📚 36802 ★ 0 May 14

Problem: Existing AR diffusion distillation methods require 4+ sampling steps; frame-wise 1–2 step generation needs efficient, scalable student initialization.
Model: "Causal Forcing++": AR diffusion distillation pipeline using causal consistency distillation for few-step student initialization, avoiding expensive full PF-ODE trajectory precomputation.
Code: thu-ml/Causal-Forcing

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

↑ 38 📚 18942 ★ 0 May 13

Problem: Existing camera-controlled video generation methods require large-scale camera-annotated data or expensive test-time optimization; no simple way to leverage pretrained video models' latent camera-control capability.
Model: "Warp-as-History": converts camera-induced geometric warps into camera-warped pseudo-history fed through pretrained video models' native history pathway, with target-frame positional alignment and visible-token selection.
Code: yyfz/Warp-as-History

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

↑ 7 📚 19915 ★ 0 May 13

Problem: Existing video rewards fail to reliably score human motion realism because they operate in 2D pixel space without explicitly modeling 3D body physics and constraints.
Model: "PhyMotion": physics-grounded motion reward that recovers SMPL meshes, retargets to MuJoCo simulator, and evaluates motion via three axes (kinematic plausibility, contact/balance, dynamic feasibility)
Code: not released

Spatial Single-Cell Study

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

↑ 0 📚 5699 ★ 0 May 13

Problem: Existing methods for inferring spatial gene expression from histology images oversimplify morphology-to-expression mapping and underutilize large-scale single-cell data as biological constraints.
Model: DUET: dual-paradigm framework synergizing parametric regression and memory-based retrieval with cellular inductive priors and adaptive expert triage for spatial transcriptomics prediction.
Code: Junchao-Zhu/DUET

AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents

↑ 0 📚 2120 ★ 0 May 11

Problem: No standard benchmark exists for predicting phenotypic outcomes of cellular perturbations across diverse assays in drug discovery.
Model: approach: LLM and agentic systems evaluated on gene ranking prediction from free-text screen descriptions using adjusted nDCG metric
Code: Genentech/AssayBench

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

↑ 0 ★ 0 May 15

Problem: Predicting drug-induced cellular state changes at single-cell resolution under out-of-distribution conditions with limited multimodal information.
Model: StateXDiff: cell State-contextualized multimodal Diffusion framework integrating transcriptomic and pseudo-protein representations with mechanism-aware drug templates via latent conditional diffusion.
Code: not released

Yuzhou Chang

Paper Tracking · Archive

Week of June 29

Graph × LLM

Vision-Language Models

World Models

Spatial Single-Cell Study

Week of June 22

Graph × LLM

Vision-Language Models

World Models

Week of June 15

Graph × LLM

Vision-Language Models

World Models

Spatial Single-Cell Study

Week of June 8

Graph × LLM

Vision-Language Models

World Models

Spatial Single-Cell Study

Week of June 1

Graph × LLM

Vision-Language Models

World Models

Week of May 25

Graph × LLM

Vision-Language Models

World Models

Spatial Single-Cell Study

Week of May 18

Graph × LLM

Vision-Language Models

World Models

Spatial Single-Cell Study