Paper Tracking ยท Archive

Every weekly run is preserved. Click a month to expand the weeks inside. Latest weeks at the top. โ† Back to current week

May 2026 12 papers

Week of

Graph ร— LLM

Vision-Language Models

MMSkills: Towards Multimodal Skills for General Visual Agents

โ†‘ 99 ๐Ÿ“š 30599 โ˜… 0 May 13
  • Problem: Visual agents need reusable multimodal procedural knowledge that binds actions to visual state recognition and decision-making, beyond text-only skills.
  • Model: "MMSkills: framework for representing, generating, and utilizing reusable multimodal procedures for visual agents. Each skill couples textual procedures with runtime state cards and multi-view keyframes, generated via trajectory-to-skill generator and consulted via branch loading.
  • Code: DeepExperience/MMSkills

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

โ†‘ 71 ๐Ÿ“š 3089 โ˜… 0 May 13
  • Problem: No benchmark systematically compares long-context LVLMs and memory-augmented agents on multimodal multi-session conversations requiring visual evidence.
  • Model: "MemLens": benchmark with 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, answer refusal) at four context lengths (32K-256K tokens)
  • Code: xrenaf/MEMLENS

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

โ†‘ 58 ๐Ÿ“š 1421 โ˜… 0 May 13
  • Problem: Existing multimodal agent memory evaluations fail to assess whether agents preserve fine-grained visual evidence needed for reasoning over time.
  • Model: "MemEye: a visual-centric evaluation framework measuring visual evidence granularity and reasoning complexity in multimodal agent memory"
  • Code: not released
  • โš  Interested, but agent could not fetch the PDF โ€” summary based on abstract only.

World Models

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

โ†‘ 87 ๐Ÿ“š 36802 โ˜… 0 May 14
  • Problem: Existing AR diffusion distillation methods require 4+ sampling steps; frame-wise 1โ€“2 step generation needs efficient, scalable student initialization.
  • Model: "Causal Forcing++": AR diffusion distillation pipeline using causal consistency distillation for few-step student initialization, avoiding expensive full PF-ODE trajectory precomputation.
  • Code: thu-ml/Causal-Forcing

Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

โ†‘ 38 ๐Ÿ“š 18942 โ˜… 0 May 13
  • Problem: Existing camera-controlled video generation methods require large-scale camera-annotated data or expensive test-time optimization; no simple way to leverage pretrained video models' latent camera-control capability.
  • Model: "Warp-as-History": converts camera-induced geometric warps into camera-warped pseudo-history fed through pretrained video models' native history pathway, with target-frame positional alignment and visible-token selection.
  • Code: yyfz/Warp-as-History

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

โ†‘ 7 ๐Ÿ“š 19915 โ˜… 0 May 13
  • Problem: Existing video rewards fail to reliably score human motion realism because they operate in 2D pixel space without explicitly modeling 3D body physics and constraints.
  • Model: "PhyMotion": physics-grounded motion reward that recovers SMPL meshes, retargets to MuJoCo simulator, and evaluates motion via three axes (kinematic plausibility, contact/balance, dynamic feasibility)
  • Code: not released

Spatial Single-Cell Study

DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

โ†‘ 0 ๐Ÿ“š 5699 โ˜… 0 May 13
  • Problem: Existing methods for inferring spatial gene expression from histology images oversimplify morphology-to-expression mapping and underutilize large-scale single-cell data as biological constraints.
  • Model: DUET: dual-paradigm framework synergizing parametric regression and memory-based retrieval with cellular inductive priors and adaptive expert triage for spatial transcriptomics prediction.
  • Code: Junchao-Zhu/DUET

StateXDiff: Cell State-Contextualized Multimodal Diffusion for Single-Cell Perturbation Prediction

โ†‘ 0 โ˜… 0 May 15
  • Problem: Predicting drug-induced cellular state changes at single-cell resolution under out-of-distribution conditions with limited multimodal information.
  • Model: StateXDiff: cell State-contextualized multimodal Diffusion framework integrating transcriptomic and pseudo-protein representations with mechanism-aware drug templates via latent conditional diffusion.
  • Code: not released