Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
ðĄ This research optimizes computer vision.
We propose Process-Aware Policy Optimization (PAPO) to address two limitations of existing reward designs . Outcome reward models evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality . Process reward models offer richer supervision but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses .
CR-Eyes: A Computational Rational Model of Visual Sampling Behavior in Atari Games
ðĄ This research presents techniques for computer vision.
CR-Eyes is a computationally rational model that simulates visual sampling and gameplay behavior in Atari games . It is a step toward scalable, theory-grounded user models that support design and evaluation of interactive systems .
Can AI Models Direct Each Other? Organizational Structure as a Probe into Training Limitations
ðĄ This research explores techniques in machine learning.
Can an expensive AI model effectively direct a cheap one to solve software engineering tasks? We study this question by introducing ManagerWorker, a two-agent pipeline where an expensive "manager" model analyzes issues, dispatches exploration tasks, and reviews implementations . Our findings reveal both the promise and the limits of multi-agent direction .
From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs
ðĄ This research presents techniques for language AI.
Large language models' performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics . We examine internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions . We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior .
GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
ðĄ This research explores techniques in language AI.
GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise) is a plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline .
Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models
ðĄ This research explores techniques in machine learning.
While Late Interaction models exhibit strong retrieval performance, many of their underlying dynamics remain understudied, potentially hiding performance bottlenecks . In this work, we focus on two topics: a length bias that arises when using multi-vector scoring and the similarity distribution beyond the best scores pooled by the MaxSim operator .
Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding
ðĄ This research explores techniques in language AI.
Autoregressive vision-language models have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding . We evaluate whether discrete DVLMs can serve as a viable alternative to AR models for GUI grounding . The authors propose a hybrid masking schedule that combines linear and deterministic masking .
Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI
ðĄ This research enhances computer vision.
Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence . However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and lack of anatomical constraints, often leading to non-reliable predictions .
Sticky and Magnetic: Evaluating Error Correction and User Adaptation in Gaze and Pinch Interaction
ðĄ This research explores techniques in machine learning.
The gaze-and-pinch framework offers a high-fidelity interaction modality for spatial computing in virtual reality . It remains vulnerable to coordination errors--timing misalignments between gaze fixation and pinch gestures . We investigate two heuristics--STICKY selection (temporal buffer) and MAGNETIC selection (spatial field)
Beyond Code Snippets: Benchmarking LLMs on Repository-Level Question Answering
ðĄ This research explores techniques in language AI.
Large Language Models (LLMs) have shown impressive capabilities across software engineering tasks, including question answering . We introduce StackRepoQA, the first multi-project, repository-level question answering dataset constructed from 1,318 real developer questions and accepted answers across 134 open-source Java projects .
The Multi-AMR Buffer Storage, Retrieval, and Reshuffling Problem: Exact and Heuristic Approaches
ðĄ This research explores techniques in machine learning.
Buffer zones are essential in production systems to decouple sequential processes . Automating these zones requires solving the Buffer Storage, Retrieval, and Reshuffling Problem (BSRRP)
Characterizing Scam-Driven Human Trafficking Across Chinese Borders and Online Community Responses on RedNote
ðĄ This research explores techniques in machine learning.
A new form of human trafficking has emerged across Chinese borders, where individuals are lured to Southeast Asia with fraudulent job offers and then coerced into operating online scams . Despite its massive economic and human toll, this scam-driven trafficking remains underexplored in academic research .
CADSmith: Multi-Agent CAD Generation with Programmatic Geometric Validation
ðĄ This research presents techniques for language AI.
CADSmith is a multi-agent pipeline that generates CadQuery code from natural language . It then undergoes an iterative refinement process through two nested correction loops . The outer loop combines exact measurements from the OpenCASCADE kernel with holistic visual assessment from an independent vision-language model .
AIRA_2: Overcoming Bottlenecks in AI Research Agents
ðĄ This research explores techniques in language AI.
AIRA$_2$ addresses three structural performance bottlenecks in AI research agents: synchronous single-GPU execution constrains sample throughput . A Hidden Consistent Evaluation protocol delivers a reliable evaluation signal . ReAct agents that dynamically scope their actions and debug interactively .
CA-TCN: A Causal-Anticausal Temporal Convolutional Network for Direct Auditory Attention Decoding
ðĄ This research explores techniques in speech processing.
Auditory Attention Decoding (AAD) aims to identify the attended speech stream in a multiple speaker scenario from neural recordings . Entrainment-based AAD approaches assume access to clean speech sources and electroencephalography (EEG) signals . In this study, we propose CA-TCN, a Causal-Anticausal Temporal Temporal Convolutional Network that directly classifies the attended speaker .
Reflect to Inform: Boosting Multimodal Reasoning via Information-Gain-Driven Verification
ðĄ This research achieves better language AI.
Multimodal Large Language Models (MLLMs) achieve strong multimodal reasoning performance . As outputs grow longer, models drift away from image evidence and fall back on textual priors, resulting in ungrounded reasoning and hallucinations . We propose Visual Re-Examination (VRE), a self-evolving training framework that enables MLLMs to autonomously perform visual introspection during reasoning without additional visual inputs . VRE promotes iterative self-improvement by leveraging the
CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law
ðĄ This research explores techniques in language AI.
Legal reasoning requires understanding of the context in which rules operate . CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning . The benchmark is based on the legal system in Korean .
Mitigating the Reasoning Tax in Vision-Language Fine-Tuning with Input-Adaptive Depth Aggregation
ðĄ This research improves language AI.
Supervised fine-tuning on visual instruction data often improves perceptual capabilities in vision-language models . We propose Input-Adaptive Depth Aggregation (IADA) to make cross-depth retrieval input-adaptive, modality-aware, and efficiently parameterized through a low-rank bottleneck .
PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management
ðĄ This research presents techniques for edge computing.
Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation . This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability . This paper proposes the PATOS--Lector--PRISMA infrastructure as a normative information architecture for responsible pharmaceutical knowledge management .
findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding
ðĄ This research presents techniques for language AI.
Findsylls unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation . The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined .
GeoGuide: Hierarchical Geometric Guidance for Open-Vocabulary 3D Semantic Segmentation
ðĄ This research presents techniques for edge computing.
GeoGuide is a novel framework that leverages pretrained 3D models to integrate geometry-semantic consistency for open-vocabulary 3D segmentation . Extensive experiments on ScanNet v2, Matterport3D, nuScenes demonstrate the superior performance of GeoGuide .
Automatic Speech Recognition for Documenting Endangered Languages: Case Study of Ikema Miyakoan
ðĄ This research explores techniques in speech processing.
Language endangerment poses a major challenge to linguistic diversity worldwide . Automatic speech recognition (ASR) has shown increasing potential to assist in the transcription of endangered language data . This study focuses on Ikema, a severely endangered Ryukyuan language spoken in Okinawa, Japan .
Physics-Informed Neural Networks and Sequence Encoder: Application to heating and early cooling of thermo-stamping process
ðĄ This research explores techniques in computer vision.
In a previous work, the Sequence Encoder for online dynamical system identification (Elaarabi et al., 2025a) and its combination with PINN (PINN-SE) were introduced and tested on both synthetic and real data case scenarios . The results show that combining multiple encoders with the previously proposed method is feasible . Training the model on synthetic data generated based on experimental data can help the model to generalize well for real experimental data .
Sparse Auto-Encoders and Holism about Large Language Models
ðĄ This research explores techniques in language AI.
Large Language Model (LLM) technology suggests a meta-semantic picture of how words and complex expressions come to have the meaning that they do . It has previously been argued that LLMs adopt a form of holism about meaning . Recent work in mechanistic interpretability presents a challenge to these arguments .
Simulating Novice Students Using Machine Unlearning and Relearning in Large Language Models
ðĄ This research explores techniques in language AI.
Recent research often relies on prompt engineering with large language models to simulate novice student behaviour, but it is difficult to keep the AI-simulated student at a stable novice knowledge level . Many LLMs are trained to be broadly capable, so even when prompted to "act like a novice," the LLMs can still produce expert-level explanations . We propose a knowledge-level simulation approach based on machine unlearning .