AITopics | available dataset

Collaborating Authors

available dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automated Facility Enumeration for Building Compliance Checking using Door Detection and Large Language Models

Zhang, Licheng, Le, Bach, Akhtar, Naveed, Ngo, Tuan

arXiv.org Artificial IntelligenceSep-29-2025

ABSTRACT Building compliance checking (BCC) is a critical process for ensuring that constructed facilities meet regulatory standards. A core component of BCC is the accurate enumeration of facility types and their spatial distribution. Despite its importance, this problem has been largely overlooked in the literature, posing a significant challenge for BCC and leaving a critical gap in existing workflows. Performing this task manually is time-consuming and labor-intensive. Recent advances in large language models (LLMs) offer new opportunities to enhance automation by combining visual recognition with reasoning capabilities. In this paper, we introduce a new task for BCC: automated facility enumeration, which involves validating the quantity of each facility type against statutory requirements. To address it, we propose a novel method that integrates door detection with LLM-based reasoning. We are the first to apply LLMs to this task and further enhance their performance through a Chain-of-Thought (CoT) pipeline. Experiments on both real-world and synthetic floor plan data demonstrate the effectiveness and robustness of our method. PRACTICAL APPLICATIONS This work demonstrates the potential of LLMs to achieve accurate and generalizable automated facility enumeration.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.17283

Country: Europe > Switzerland (0.28)

Genre:

Research Report (1.00)
Workflow (0.88)

Industry:

Law (1.00)
Construction & Engineering (1.00)
Government (0.68)
Materials > Construction Materials (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models

Rooein, Donya, Plaza-del-Arco, Flor Miriam, Nozza, Debora, Hovy, Dirk

arXiv.org Artificial IntelligenceSep-9-2025

Given Farsi's speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and find significant challenges in data availability and quality, despite the overall increase in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings indicate that the volume of data is insufficient to significantly improve a language's prospects in NLP.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.05719

Country:

Asia (0.93)
North America > Canada (0.46)
North America > Mexico (0.28)

Genre:

Overview (1.00)
Research Report > New Finding (0.66)

Industry: Information Technology > Services (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(2 more...)

Add feedback

Political Leaning and Politicalness Classification of Texts

Volf, Matous, Simko, Jakub

arXiv.org Artificial IntelligenceJul-21-2025

This paper addresses the challenge of automatically classifying text according to political leaning and politicalness using transformer models. We compose a comprehensive overview of existing datasets and models for these tasks, finding that current approaches create siloed solutions that perform poorly on out-of-distribution texts. To address this limitation, we compile a diverse dataset by combining 12 datasets for political leaning classification and creating a new dataset for politicalness by extending 18 existing datasets with the appropriate label. Through extensive benchmarking with leave-one-in and leave-one-out methodologies, we evaluate the performance of existing models and train new ones with enhanced generalization capabilities.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2507.13913

Country:

Europe (1.00)
North America > United States > Maryland (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Government (1.00)
Media > News (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Generalization of Video-Based Heart Rate Estimation Methods To Low Illumination and Elevated Heart Rates

Acharya, Bhargav, Saakyan, William, Hammer, Barbara, Drimalla, Hanna

arXiv.org Artificial IntelligenceMar-11-2025

Heart rate is a physiological signal that provides information about an individual's health and affective state. Remote photoplethysmography (rPPG) allows the estimation of this signal from video recordings of a person's face. Classical rPPG methods make use of signal processing techniques, while recent rPPG methods utilize deep learning networks. Methods are typically evaluated on datasets collected in well-lit environments with participants at resting heart rates. However, little investigation has been done on how well these methods adapt to variations in illumination and heart rate. In this work, we systematically evaluate representative state-of-the-art methods for remote heart rate estimation. Specifically, we evaluate four classical methods and four deep learning-based rPPG estimation methods in terms of their generalization ability to changing scenarios, including low lighting conditions and elevated heart rates. For a thorough evaluation of existing approaches, we collected a novel dataset called CHILL, which systematically varies heart rate and lighting conditions. The dataset consists of recordings from 45 participants in four different scenarios. The video data was collected under two different lighting conditions (high and low) and normal and elevated heart rates. In addition, we selected two public datasets to conduct within- and cross-dataset evaluations of the rPPG methods. Our experimental results indicate that classical methods are not significantly impacted by low-light conditions. Meanwhile, some deep learning methods were found to be more robust to changes in lighting conditions but encountered challenges in estimating high heart rates. The cross-dataset evaluation revealed that the selected deep learning methods underperformed when influencing factors such as elevated heart rates and low lighting conditions were not present in the training set.

dataset, heart rate, scenario, (16 more...)

arXiv.org Artificial Intelligence

2503.11697

Country:

Oceania > Australia (0.04)
Europe > United Kingdom > England > Tyne and Wear > Newcastle (0.04)
Europe > Germany > North Rhine-Westphalia (0.04)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

The Heap: A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models

Katzy, Jonathan, Popescu, Razvan Mihai, van Deursen, Arie, Izadi, Maliheh

arXiv.org Artificial IntelligenceJan-16-2025

To cover more specific use cases, we also include The data-intensive training process of Large Language Models domain-specific languages such as Mathematica, Emacs-Lisp, (LLMs) has driven the release of numerous large-scale and Coq. A complete list of all languages included in the datasets, particularly for code, to facilitate the development dataset is presented in Table I. of new models. This rapid increase in the amount of training B. Query data used to pre-train LLMs has resulted in extensive datasets covering almost all publicly available code [1]-[3]. We focus on repositories that have one of the targeted To assess the success of such LLMs in downstream tasks, languages as the main language of the repository. We further fresh data not seen during training is needed. Otherwise such select only repositories that are licensed under non-permissive evaluations are contaminated, possibly resulting in overly optimistic licenses. We choose non-permissive licenses as an initial filter results. Unfortunately, obtaining such non-contaminated for repositories, as many large-scale datasets focus on exclusively data is increasingly difficult. In fact, a recent study establishes unlicensed or permissively licensed code [2], [3], [5].

dataset, deduplication, duplicate, (12 more...)

arXiv.org Artificial Intelligence

2501.09653

Country:

Europe > Netherlands > South Holland > Delft (0.06)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ArEEG_Words: Dataset for Envisioned Speech Recognition using EEG for Arabic Words

Darwish, Hazem, Malah, Abdalrahman Al, Jallad, Khloud Al, Ghneim, Nada

arXiv.org Artificial IntelligenceNov-27-2024

Brain-Computer-Interface (BCI) aims to support communication-impaired patients by translating neural signals into speech. A notable research topic in BCI involves Electroencephalography (EEG) signals that measure the electrical activity in the brain. While significant advancements have been made in BCI EEG research, a major limitation still exists: the scarcity of publicly available EEG datasets for non-English languages, such as Arabic. To address this gap, we introduce in this paper ArEEG_Words dataset, a novel EEG dataset recorded from 22 participants with mean age of 22 years (5 female, 17 male) using a 14-channel Emotiv Epoc X device. The participants were asked to be free from any effects on their nervous system, such as coffee, alcohol, cigarettes, and so 8 hours before recording. They were asked to stay calm in a clam room during imagining one of the 16 Arabic Words for 10 seconds. The words include 16 commonly used words such as up, down, left, and right. A total of 352 EEG recordings were collected, then each recording was divided into multiple 250ms signals, resulting in a total of 15,360 EEG signals. To the best of our knowledge, ArEEG_Words data is the first of its kind in Arabic EEG domain. Moreover, it is publicly available for researchers as we hope that will fill the gap in Arabic EEG research.

artificial intelligence, participant, speech recognition, (9 more...)

arXiv.org Artificial Intelligence

2411.18888

Country:

Europe > Portugal (0.04)
Europe > Netherlands (0.04)
Europe > Belgium > Flanders (0.04)
Asia > Middle East > Syria (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.69)
Leisure & Entertainment > Sports > Golf (0.48)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.95)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.51)

Add feedback

MLP-SLAM: Multilayer Perceptron-Based Simultaneous Localization and Mapping With a Dynamic and Static Object Discriminator

Li, Taozhe, Sun, Wei

arXiv.org Artificial IntelligenceOct-14-2024

The Visual Simultaneous Localization and Mapping (V-SLAM) system has seen significant development in recent years, demonstrating high precision in environments with limited dynamic objects. However, their performance significantly deteriorates when deployed in settings with a higher presence of movable objects, such as environments with pedestrians, cars, and buses, which are common in outdoor scenes. To address this issue, we propose a Multilayer Perceptron (MLP)-based real-time stereo SLAM system that leverages complete geometry information to avoid information loss. Moreover, there is currently no publicly available dataset for directly evaluating the effectiveness of dynamic and static feature classification methods, and to bridge this gap, we have created a publicly available dataset containing over 50,000 feature points. Experimental results demonstrate that our MLP-based dynamic and static feature point discriminator has achieved superior performance compared to other methods on this dataset. Furthermore, the MLP-based real-time stereo SLAM system has shown the highest average precision and fastest speed on the outdoor KITTI tracking datasets compared to other dynamic SLAM systems.The open-source code and datasets are available at https://github.com/TaozheLi/MLP-SLAM.

artificial intelligence, feature point, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2410.10669

Country:

North America > United States > Oklahoma > Cleveland County > Norman (0.14)
North America > United States > Virginia (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (1.00)

Add feedback

Personalized Knowledge Tracing through Student Representation Reconstruction and Class Imbalance Mitigation

Chen, Zhiyu, Ji, Wei, Xiao, Jing, Liu, Zitao

arXiv.org Artificial IntelligenceSep-10-2024

Knowledge tracing is a technique that predicts students' future performance by analyzing their learning process through historical interactions with intelligent educational platforms, enabling a precise evaluation of their knowledge mastery. Recent studies have achieved significant progress by leveraging powerful deep neural networks. These models construct complex input representations using questions, skills, and other auxiliary information but overlook individual student characteristics, which limits the capability for personalized assessment. Additionally, the available datasets in the field exhibit class imbalance issues. The models that simply predict all responses as correct without substantial effort can yield impressive accuracy. In this paper, we propose PKT, a novel approach for personalized knowledge tracing. PKT reconstructs representations from sequences of interactions with a tutoring platform to capture latent information about the students. Moreover, PKT incorporates focal loss to improve prioritize minority classes, thereby achieving more balanced predictions. Extensive experimental results on four publicly available educational datasets demonstrate the advanced predictive performance of PKT in comparison with 16 state-of-the-art models. To ensure the reproducibility of our research, the code is publicly available at https://anonymous.4open.science/r/PKT.

dataset, knowledge, student, (15 more...)

arXiv.org Artificial Intelligence

2409.06745

Country:

Asia > China > Guangdong Province > Guangzhou (0.04)
Africa > Middle East > Morocco (0.04)

Genre: Research Report > Promising Solution (0.68)

Industry: Education (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Generation of Granular-Balls for Clustering Based on the Principle of Justifiable Granularity

Jia, Zihang, Zhang, Zhen, Pedrycz, Witold

arXiv.org Artificial IntelligenceMay-15-2024

Efficient and robust data clustering remains a challenging task in the field of data analysis. Recent efforts have explored the integration of granular-ball (GB) computing with clustering algorithms to address this challenge, yielding promising results. However, existing methods for generating GBs often rely on single indicators to measure GB quality and employ threshold-based or greedy strategies, potentially leading to GBs that do not accurately capture the underlying data distribution. To address these limitations, this article introduces a novel GB generation method. The originality of this method lies in leveraging the principle of justifiable granularity to measure the quality of a GB for clustering tasks. To be precise, we define the coverage and specificity of a GB and introduce a comprehensive measure for assessing GB quality. Utilizing this quality measure, the method incorporates a binary tree pruning-based strategy and an anomaly detection method to determine the best combination of sub-GBs for each GB and identify abnormal GBs, respectively. Compared to previous GB generation methods, the new method maximizes the overall quality of generated GBs while ensuring alignment with the data distribution, thereby enhancing the rationality of the generated GBs. Experimental results obtained from both synthetic and publicly available datasets underscore the effectiveness of the proposed GB generation method, showcasing improvements in clustering accuracy and normalized mutual information.

algorithm, dataset, information granule, (12 more...)

arXiv.org Artificial Intelligence

2405.06904

Country:

Asia > China > Liaoning Province > Dalian (0.04)
North America > Canada > Alberta > Census Division No. 11 > Edmonton Metropolitan Region > Edmonton (0.04)
Europe > Poland > Masovia Province > Warsaw (0.04)
(3 more...)

Genre: Research Report (0.82)

Industry: Education (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.93)

Add feedback

Publicly available datasets of breast histopathology H&E whole-slide images: A scoping review

Tafavvoghi, Masoud, Bongo, Lars Ailo, Shvetsov, Nikita, Busund, Lill-Tove Rasmussen, Møllersen, Kajsa

arXiv.org Artificial IntelligenceDec-6-2023

Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this scoping review, we identified the publicly available datasets of breast H&E stained whole-slide images (WSI) that can be used to develop deep learning algorithms. We systematically searched nine scientific literature databases and nine research data repositories and found 17 publicly available datasets containing 10385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled two lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in developing accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata.

breast cancer, dataset, wsis, (14 more...)

arXiv.org Artificial Intelligence

2306.01546

Country:

North America > United States > New York > New York County > New York City (0.14)
Europe > Switzerland > Basel-City > Basel (0.05)
Europe > Norway (0.04)
(14 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (1.00)

Industry: Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback