AITopics

Genre: Research Report (0.59)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsDec-26-2025, 19:51:44 GMT

Revealing the unseen: Benchmarking video action recognition under occlusion

In this work, we study the effect of occlusion on video action recognition. Tofacilitate this study, we propose three benchmark datasets and experiment withseven different video action recognition models. These datasets include two synthetic benchmarks, UCF-101-O and K-400-O, which enabled understanding the effects of fundamental properties of occlusion via controlled experiments. We also propose a real-world occlusion dataset, UCF-101-Y-OCC, which helps in further validating the findings of this study. We find several interesting insights such as 1) transformers are more robust than CNN counterparts, 2) pretraining make modelsrobust against occlusions, and 3) augmentation helps, but does not generalize well to real-world occlusions. In addition, we propose a simple transformer based compositional model, termed as CTx-Net, which generalizes well under this distribution shift. We observe that CTx-Net outperforms models which are trained using occlusions as augmentation, performing significantly better under natural occlusions.

benchmarking video action recognition, name change, occlusion, (5 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.40)

Tekaya, Nidham, Waldner, Manuela, Zeppelzauer, Matthias

A Matter of Time: Revealing the Structure of Time in Vision-Language Models

arXiv.org Artificial IntelligenceOct-23-2025

Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline'' representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

large language model, machine learning, natural language, (15 more...)

doi: 10.1145/3746027.3758163

2510.19559

Country: Europe > Austria (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)

arXiv.org Artificial IntelligenceOct-22-2025

AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification

Leung, Ho Fai, Xi, Xiaoyan, Zuo, Fei

On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced benchmark, state-of-the-art models achieve success rates nearing 75% on complex tasks (15% improvement), reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We introduce our new SOTA model, Magma-R1- 3B, post-trained on just 2.4k curated samples using 60 hours of an H20 GPU (approximately $60). Despite being 200 times smaller in parameters, this model delivers performance comparable to Qwen3- VL-235B. We release both AndroidControl-Curated benchmark and Magma-R1 model to the research community, encouraging adoption of this enhanced benchmark to better reflect model capabilities and accelerate the development of robust, on-device virtual assistants.

artificial intelligence, chatbot, natural language, (14 more...)

2510.18488

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.94)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.68)

Bassini, Chiara Fusar, Xu, Alice Lixuan, Canales, Jorge Sánchez, Hirth, Lion, Kaack, Lynn H.

Revealing the empirical flexibility of gas units through deep clustering

arXiv.org Artificial IntelligenceSep-5-2025

The flexibility of a power generation unit determines how quickly and often it can ramp up or down. In energy models, it depends on assumptions on the technical characteristics of the unit, such as its installed capacity or turbine technology. In this paper, we learn the empirical flexibility of gas units from their electricity generation, revealing how real-world limitations can lead to substantial differences between units with similar technical characteristics. Using a novel deep clustering approach, we transform 5 years (2019-2023) of unit-level hourly generation data for 49 German units from 100 MWp of installed capacity into low-dimensional embeddings. Our unsupervised approach identifies two clusters of peaker units (high flexibility) and two clusters of non-peaker units (low flexibility). The estimated ramp rates of non-peakers, which constitute half of the sample, display a low empirical flexibility, comparable to coal units. Non-peakers, predominantly owned by industry and municipal utilities, show limited response to low residual load and negative prices, generating on average 1.3 GWh during those hours. As the transition to renewables increases market variability, regulatory changes will be needed to unlock this flexibility potential.

artificial intelligence, data mining, machine learning, (18 more...)

2504.16943

Country: Europe > Germany (0.29)

Genre: Research Report > New Finding (0.94)

Industry:

Energy > Power Industry (1.00)
Government > Regional Government > Europe Government (0.46)
Materials > Metals & Mining > Coal (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Neural Information Processing SystemsJan-20-2025, 00:28:53 GMT

Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective

Recent studies have discovered that Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs), particularly when dealing with complex tasks involving mathematics or reasoning. Despite the enormous empirical success, the underlying mechanisms behind CoT and how it unlocks the potential of LLMs remain elusive. In this paper, we take a first step towards theoretically answering these questions. Specifically, we examine the expressivity of LLMs with CoT in solving fundamental mathematical and decision-making problems. By using circuit complexity theory, we first give impossibility results showing that bounded-depth Transformers are unable to directly produce correct answers for basic arithmetic/equation tasks unless the model size grows super-polynomially with respect to the input length. In contrast, we then prove by construction that autoregressive Transformers of constant size suffice to solve both tasks by generating CoT derivations using a commonly used math language format.

decision-making problem, revealing, theoretical perspective, (4 more...)

Genre: Research Report (0.62)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsJan-19-2025, 22:48:02 GMT

Revealing the unseen: Benchmarking video action recognition under occlusion

In this work, we study the effect of occlusion on video action recognition. Tofacilitate this study, we propose three benchmark datasets and experiment withseven different video action recognition models. These datasets include two synthetic benchmarks, UCF-101-O and K-400-O, which enabled understanding the effects of fundamental properties of occlusion via controlled experiments. We also propose a real-world occlusion dataset, UCF-101-Y-OCC, which helps in further validating the findings of this study. We find several interesting insights such as 1) transformers are more robust than CNN counterparts, 2) pretraining make modelsrobust against occlusions, and 3) augmentation helps, but does not generalize well to real-world occlusions.

benchmarking video action recognition, occlusion, video action recognition, (4 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Vision (0.96)

Daily Mail - Science & techJan-7-2025, 21:36:01 GMT

World's first 'city of the future' welcomes first residents who'll live there rent-free... but there's a catch

The world's first'city of the future' is nearly ready to welcome its first residents. Developed by car maker Toyota, 'Woven City' sits at the base of Mount Fuji in Japan and features at least 11 'smart' homes powered by hydrogen, AI and other technologies. CEO Akio Toyoda said the 10 billion utopia would serve as a'lab' for innovators to develop the technologies of tomorrow. The city is poised to welcome its first 100 residents, which will be employees, this fall, who will live there for free -- though they'll need to already be Toyota employees and work on developing experimental tech for the company. The program will then expand to 2,200 more people, who will include innovators and their families, parents and pets.

resident, toyota, woven city, (10 more...)

Daily Mail - Science & tech

Country:

Asia > Japan (0.63)
North America > United States > Nevada > Clark County > Las Vegas (0.06)

Genre: Press Release (0.37)

Industry: Automobiles & Trucks > Manufacturer (1.00)

Technology: Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.80)

Islam, Md Mirajul, Uddin, Md Nahiyan, Hasana, Maoyejatun, Pandit, Debojit, Rahman, Nafis Mahmud, Chellappan, Sriram, Azam, Sami, Islam, A. B. M. Alim Al

Revealing the Self: Brainwave-Based Human Trait Identification

arXiv.org Artificial IntelligenceDec-25-2024

People exhibit unique emotional responses. In the same scenario, the emotional reactions of two individuals can be either similar or vastly different. For instance, consider one person's reaction to an invitation to smoke versus another person's response to a query about their sleep quality. The identification of these individual traits through the observation of common physical parameters opens the door to a wide range of applications, including psychological analysis, criminology, disease prediction, addiction control, and more. While there has been previous research in the fields of psychometrics, inertial sensors, computer vision, and audio analysis, this paper introduces a novel technique for identifying human traits in real time using brainwave data. To achieve this, we begin with an extensive study of brainwave data collected from 80 participants using a portable EEG headset. We also conduct a statistical analysis of the collected data utilizing box plots. Our analysis uncovers several new insights, leading us to a groundbreaking unified approach for identifying diverse human traits by leveraging machine learning techniques on EEG data. Our analysis demonstrates that this proposed solution achieves high accuracy. Moreover, we explore two deep-learning models to compare the performance of our solution. Consequently, we have developed an integrated, real-time trait identification solution using EEG data, based on the insights from our analysis. To validate our approach, we conducted a rigorous user evaluation with an additional 20 participants. The outcomes of this evaluation illustrate both high accuracy and favorable user ratings, emphasizing the robust potential of our proposed method to serve as a versatile solution for human trait identification.

accuracy, artificial intelligence, machine learning, (19 more...)

2412.19041

Country: North America > United States (0.68)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Consumer Health (0.94)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.54)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.95)

arXiv.org Artificial IntelligenceNov-14-2024

Revealing the Evolution of Order in Materials Microstructures Using Multi-Modal Computer Vision

Ter-Petrosyan, Arman, Holden, Michael, Bilbrey, Jenna A., Akers, Sarah, Doty, Christina, Yano, Kayla H., Wang, Le, Paudel, Rajendra, Lang, Eric, Hattar, Khalid, Comes, Ryan B., Du, Yingge, Matthews, Bethany E., Spurgeon, Steven R.

The development of high-performance materials for microelectronics, energy storage, and extreme environments depends on our ability to describe and direct property-defining microstructural order. Our present understanding is typically derived from laborious manual analysis of imaging and spectroscopy data, which is difficult to scale, challenging to reproduce, and lacks the ability to reveal latent associations needed for mechanistic models. Here, we demonstrate a multi-modal machine learning (ML) approach to describe order from electron microscopy analysis of the complex oxide La$_{1-x}$Sr$_x$FeO$_3$. We construct a hybrid pipeline based on fully and semi-supervised classification, allowing us to evaluate both the characteristics of each data modality and the value each modality adds to the ensemble. We observe distinct differences in the performance of uni- and multi-modal models, from which we draw general lessons in describing crystal order using computer vision.

artificial intelligence, evolution, machine learning, (19 more...)

2411.09896

Country:

North America > United States > Tennessee > Knox County > Knoxville (0.14)
North America > United States > Delaware > New Castle County > Newark (0.14)
North America > United States > Colorado > Boulder County > Boulder (0.14)
(7 more...)

Genre: Research Report (0.64)

Industry:

Government > Regional Government > North America Government > United States Government (0.68)
Energy > Renewable (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)