Scripts & Frames
- Oceania > Australia (0.14)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine (0.52)
- Leisure & Entertainment > Games (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.66)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
Paper: Generalization of Reinforcement Learners with Working and Episodic Memory
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMP ALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
- North America > United States > Iowa > Johnson County > Iowa City (0.14)
- North America > United States > California > San Diego County > San Diego (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Health & Medicine > Consumer Health (0.41)
- Education > Educational Setting > Continuing Education (0.41)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.41)
FLEX: A Largescale Multimodal, Multiview Dataset for Learning Structured Representations for Fitness Action Quality Assessment
Yin, Hao, Gu, Lijun, Parmar, Paritosh, Xu, Lin, Guo, Tianxiao, Fu, Weiwei, Zhang, Yang, Zheng, Tianyou
Action Quality Assessment (AQA) -- the task of quantifying how well an action is performed -- has great potential for detecting errors in gym weight training, where accurate feedback is critical to prevent injuries and maximize gains. Existing AQA datasets, however, are limited to single-view competitive sports and RGB video, lacking multimodal signals and professional assessment of fitness actions. We introduce FLEX, the first large-scale, multimodal, multiview dataset for fitness AQA that incorporates surface electromyography (sEMG). FLEX contains over 7,500 multiview recordings of 20 weight-loaded exercises performed by 38 subjects of diverse skill levels, with synchronized RGB video, 3D pose, sEMG, and physiological signals. Expert annotations are organized into a Fitness Knowledge Graph (FKG) linking actions, key steps, error types, and feedback, supporting a compositional scoring function for interpretable quality assessment. FLEX enables multimodal fusion, cross-modal prediction -- including the novel Video$\rightarrow$EMG task -- and biomechanically oriented representation learning. Building on the FKG, we further introduce FLEX-VideoQA, a structured question-answering benchmark with hierarchical queries that drive cross-modal reasoning in vision-language models. Baseline experiments demonstrate that multimodal inputs, multiview video, and fine-grained annotations significantly enhance AQA performance. FLEX thus advances AQA toward richer multimodal settings and provides a foundation for AI-powered fitness assessment and coaching. Dataset and code are available at \href{https://github.com/HaoYin116/FLEX}{https://github.com/HaoYin116/FLEX}. Link to Project \href{https://haoyin116.github.io/FLEX_Dataset}{page}.
- Research Report > New Finding (0.67)
- Research Report > Experimental Study (0.46)
- Health & Medicine > Consumer Health (1.00)
- Leisure & Entertainment > Sports (0.88)
- Health & Medicine > Diagnostic Medicine > Imaging (0.48)
- Health & Medicine > Therapeutic Area > Neurology (0.34)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.40)
- Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.34)
GENESIS: A Generative Model of Episodic-Semantic Interaction
D'Alessandro, Marco, D'Amato, Leo, Elkano, Mikel, Uriz, Mikel, Pezzulo, Giovanni
A central challenge in cognitive neuroscience is to explain how semantic and episodic memory, two major forms of declarative memory, typically associated with cortical and hippocampal processing, interact to support learning, recall, and imagination. Despite significant advances, we still lack a unified computational framework that jointly accounts for core empirical phenomena across both semantic and episodic processing domains. Here, we introduce the Generative Episodic-Semantic Integration System (GENESIS), a computational model that formalizes memory as the interaction between two limited-capacity generative systems: a Cortical-VAE, supporting semantic learning and generalization, and a Hippocampal-VAE, supporting episodic encoding and retrieval within a retrieval-augmented generation (RAG) architecture. GENESIS reproduces hallmark behavioral findings, including generalization in semantic memory, recognition, serial recall effects and gist-based distortions in episodic memory, and constructive episodic simulation, while capturing their dynamic interactions. The model elucidates how capacity constraints shape the fidelity and memorability of experiences, how semantic processing introduces systematic distortions in episodic recall, and how episodic replay can recombine previous experiences. Together, these results provide a principled account of memory as an active, constructive, and resource-bounded process. GENESIS thus advances a unified theoretical framework that bridges semantic and episodic memory, offering new insights into the generative foundations of human cognition.
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Consumer Health (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.79)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Aman, Euhid, Carlin, Esteban, Pao, Hsing-Kuo, Beltrame, Giovanni, Sari, Ghaluh Indah Permata, Chen, Yie-Tarng
Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.86)
A Biologically Interpretable Cognitive Architecture for Online Structuring of Episodic Memories into Cognitive Maps
Dzhivelikian, E. A., Panov, A. I.
Cognitive maps provide a powerful framework for understanding spatial and abstract reasoning in biological and artificial agents. While recent computational models link cognitive maps to hippocampal-entorhinal mechanisms, they often rely on global optimization rules (e.g., backpropagation) that lack biological plausibility. In this work, we propose a novel cognitive architecture for structuring episodic memories into cognitive maps using local, Hebbian-like learning rules, compatible with neural substrate constraints. Our model integrates the Successor Features framework with episodic memories, enabling incremental, online learning through agent-environment interaction. We demonstrate its efficacy in a partially observable grid-world, where the architecture autonomously organizes memories into structured representations without centralized optimization. This work bridges computational neuroscience and AI, offering a biologically grounded approach to cognitive map formation in artificial adaptive agents.
- Health & Medicine > Therapeutic Area > Neurology (0.87)
- Health & Medicine > Consumer Health (0.86)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Paper: Generalization of Reinforcement Learners with Working and Episodic Memory
We thank the reviewers for their thoughtful and constructive feedback on our manuscript. This should help both contextualize each task's difficulty and illustrate what it involves. Reviewer 3 noted the Section 2 task descriptions could be better presented. We have reformatted it so that "the order We also changed our description of IMP ALA to match Reviewer 5's suggestion. Regarding the task suite, Reviewer 4 raised a thoughtful consideration on whether "most of the findings translate when Some 3D tasks in the suite already have '2D-like' semi-counterparts that do not require navigation, '2D-like' because everything is fully observable and the agent has a first-person point of view from a fixed point, without Spot the Difference level, was overall harder than Change Detection for our ablation models.
STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation
Zhu, Haokun, Li, Zongtai, Liu, Zhixuan, Wang, Wenshan, Zhang, Ji, Francis, Jonathan, Oh, Jean
Figure 1: STRIVE can conduct zero-shot object navigation in diverse and complex real-world environments by leveraging our novel multi-layer representation and efficient two-stage navigation policy. Abstract-- Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation presents two key challenges: effectively parsing and structuring complex environment information and determining when and how to query VLMs. T o address these challenges, we propose a novel framework that incrementally constructs a multi-layer environment representation consisting of viewpoints, object nodes, and room nodes during navigation. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room planning. Building on this structured representation, we propose a novel two-stage navigation policy, integrating high-level planning guided by VLM reasoning with low-level VLM-assisted exploration to efficiently and reliably locate a goal object. Object navigation is a fundamental task in robotics, where an agent must locate an instance of a given object category in unknown environments. This task is particularly challenging, as it requires the agent to understand complex visual information, reason about spatial relationships, and make decisions based on both current and past observations. Advances in Vision-Language Models (VLMs) [1], [2], [3] have demonstrated strong capabilities in contextual visual understanding and common-sense reasoning. However, existing approaches often face two significant challenges: First, the input to VLMs typically lacks a structured representation of the environment and is often restricted to local observations.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.81)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.68)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue
Kim, Sangyeop, Lee, Yohan, Kim, Sanghwa, Kim, Hyunjong, Cho, Sungzoon
Effective long-term memory in conversational AI requires synthesizing information across multiple sessions. However, current systems place excessive reasoning burden on response generation, making performance significantly dependent on model sizes. We introduce PREMem (Pre-storage Reasoning for Episodic Memory), a novel approach that shifts complex reasoning processes from inference to memory construction. PREMem extracts fine-grained memory fragments categorized into factual, experiential, and subjective information; it then establishes explicit relationships between memory items across sessions, capturing evolution patterns like extensions, transformations, and implications. By performing this reasoning during pre-storage rather than when generating a response, PREMem creates enriched representations while reducing computational demands during interactions. Experiments show significant performance improvements across all model sizes, with smaller models achieving results comparable to much larger baselines while maintaining effectiveness even with constrained token budgets. Code and dataset are available at https://github.com/sangyeop-kim/PREMem.
- Asia > Middle East > UAE (0.46)
- North America > United States (0.46)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.48)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.97)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.88)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Scripts & Frames (0.84)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)