AITopics | comprehensive dataset

Collaborating Authors

comprehensive dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Why Do Multi-Agent LLM Systems Fail?

Neural Information Processing SystemsJun-13-2026, 21:47:42 GMT

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems.

artificial intelligence, mast-data, proceedings, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

A Comprehensive Dataset for Underground Miner Detection in Diverse Scenario

Addy, Cyrus, Gurumadaiah, Ajay Kumar, Gao, Yixiang, Awuah-Offei, Kwame

arXiv.org Artificial IntelligenceJun-27-2025

Underground mining operations face significant safety challenges that make emergency response capabilities crucial. While robots have shown promise in assisting with search and rescue operations, their effectiveness depends on reliable miner detection capabilities. Deep learning algorithms offer potential solutions for automated miner detection, but require comprehensive training datasets, which are currently lacking for underground mining environments. This paper presents a novel thermal imaging dataset specifically designed to enable the development and validation of miner detection systems for potential emergency applications. We systematically captured thermal imagery of various mining activities and scenarios to create a robust foundation for detection algorithms. To establish baseline performance metrics, we evaluated several state-of-the-art object detection algorithms including YOLOv8, YOLOv10, YOLO11, and RT-DETR on our dataset. While not exhaustive of all possible emergency situations, this dataset serves as a crucial first step toward developing reliable thermal-based miner detection systems that could eventually be deployed in real emergency scenarios. This work demonstrates the feasibility of using thermal imaging for miner detection and establishes a foundation for future research in this critical safety application.

artificial intelligence, deep learning, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2506.21451

Country: North America > United States > Missouri (0.15)

Genre: Research Report > New Finding (0.47)

Industry: Materials > Metals & Mining (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.86)

Add feedback

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

Ji, Huawei, Deng, Cheng, Xue, Bo, Jin, Zhouyang, Ding, Jiaxin, Gan, Xiaoying, Fu, Luoyi, Wang, Xinbing, Zhou, Chenghu

arXiv.org Artificial IntelligenceSep-16-2024

With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.

aceparse, dataset, literature, (14 more...)

arXiv.org Artificial Intelligence

2409.10016

Country:

Asia > China > Shanghai > Shanghai (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Russia (0.04)
(2 more...)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)

Add feedback

LegiLM: A Fine-Tuned Legal Language Model for Data Compliance

Zhu, Linkai, Yang, Lu, Li, Chaofan, Hu, Shanwen, Liu, Lu, Yin, Bin

arXiv.org Artificial IntelligenceSep-8-2024

Ensuring compliance with international data protection standards for privacy and data security is a crucial but complex task, often requiring substantial legal expertise. This paper introduces LegiLM, a novel legal language model specifically tailored for consulting on data or information compliance. LegiLM leverages a pre-trained GDPR Fines dataset and has been fine-tuned to automatically assess whether particular actions or events breach data security and privacy regulations. By incorporating a specialized dataset that includes global data protection laws, meticulously annotated policy documents, and relevant privacy policies, LegiLM is optimized for addressing data compliance challenges. The model integrates advanced legal reasoning methods and information retrieval enhancements to enhance accuracy and reliability in practical legal consulting scenarios. Our evaluation using a custom benchmark dataset demonstrates that LegiLM excels in detecting data regulation breaches, offering sound legal justifications, and recommending necessary compliance modifications, setting a new benchmark for AI-driven legal compliance solutions. Our resources are publicly available at https://github.com/DAOLegalAI/LegiLM

compliance, language model, legilm, (16 more...)

arXiv.org Artificial Intelligence

2409.13721

Country:

North America > United States > California (0.14)
Asia > Macao (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (0.40)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Add feedback

Intellecta Cognitiva: A Comprehensive Dataset for Advancing Academic Knowledge and Machine Reasoning

PS, Ajmal, PS, Ditto, VG, Jithin

arXiv.org Artificial IntelligenceApr-13-2024

Intellecta dataset emerges as an innovative synthetic dataset, engineered to enhance the cognitive processing capabilities of contemporary language models. With a composition of 11.53 billion tokens, integrating 8.01 billion tokens of synthetic data with 3.52 billion tokens of rich textbook data, Intellecta is crafted to foster advanced reasoning and comprehensive educational narrative generation. Leveraging the Mixtral-8x7B-Instruct-v0.1 model, the dataset facilitates the generation of complex thought processes and detailed, textbook-style explanations, thus enabling language models to engage in both critical thinking and profound educational discourse. This hybrid dataset stands as a testament to the potential of synthetic data in pushing the boundaries of AI, offering a repository that is not only vast and varied but also refined to align with ethical standards and intellectual rigor.

dataset, instruction, language model, (12 more...)

arXiv.org Artificial Intelligence

2404.13065

Country: Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code

Florath, Andreas

arXiv.org Artificial IntelligenceApr-2-2024

In the realm of formal theorem proving, the Coq proof assistant stands out for its rigorous approach to verifying mathematical assertions and software correctness. Despite the advances in artificial intelligence and machine learning, the specialized nature of Coq syntax and semantics poses unique challenges for Large Language Models (LLMs). Addressing this gap, we present a comprehensive dataset specifically designed to enhance LLMs' proficiency in interpreting and generating Coq code. This dataset, derived from a collection of over 10,000 Coq source files, encompasses a wide array of propositions, proofs, and definitions, enriched with metadata including source references and licensing information. Our primary aim is to facilitate the development of LLMs capable of generating syntactically correct and semantically meaningful Coq constructs, thereby advancing the frontier of automated theorem proving. Initial experiments with this dataset have showcased its significant potential; models trained on this data exhibited enhanced accuracy in Coq code generation. Notably, a particular experiment revealed that a fine-tuned LLM was capable of generating 141 valid proofs for a basic lemma, highlighting the dataset's utility in facilitating the discovery of diverse and valid proof strategies. This paper discusses the dataset's composition, the methodology behind its creation, and the implications of our findings for the future of machine learning in formal verification. The dataset is accessible for further research and exploration: https://huggingface.co/datasets/florath/coq-facts-props-proofs-gen0-v1

qed, reflexivity, simpl, (13 more...)

arXiv.org Artificial Intelligence

2403.12627

Country: Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

Abdallah, Abdelrahman, Kasem, Mahmoud, Abdalla, Mahmoud, Mahmoud, Mohamed, Elkasaby, Mohamed, Elbendary, Yasser, Jatowt, Adam

arXiv.org Artificial IntelligenceMar-26-2024

In this paper, we address the significant gap in Arabic natural language processing (NLP) resources by introducing ArabicaQA, the first large-scale dataset for machine reading comprehension and open-domain question answering in Arabic. This comprehensive dataset, consisting of 89,095 answerable and 3,701 unanswerable questions created by crowdworkers to look similar to answerable ones, along with additional labels of open-domain questions marks a crucial advancement in Arabic NLP resources. We also present AraDPR, the first dense passage retrieval model trained on the Arabic Wikipedia corpus, specifically designed to tackle the unique challenges of Arabic text retrieval. Furthermore, our study includes extensive benchmarking of large language models (LLMs) for Arabic question answering, critically evaluating their performance in the Arabic language context. In conclusion, ArabicaQA, AraDPR, and the benchmarking of LLMs in Arabic question answering offer significant advancements in the field of Arabic NLP. The dataset and code are publicly accessible for further research https://github.com/DataScienceUIBK/ArabicaQA.

arabicaqa, arxiv preprint arxiv, dataset, (14 more...)

arXiv.org Artificial Intelligence

2403.17848

Country:

Europe > Austria > Tyrol > Innsbruck (0.04)
Africa > Middle East > Egypt > Cairo Governorate > Cairo (0.04)
North America > United States > New York > New York County > New York City (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Question Answering (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Big Data and Deep Learning in Smart Cities: A Comprehensive Dataset for AI-Driven Traffic Accident Detection and Computer Vision Systems

Adewopo, Victor, Elsayed, Nelly, Elsayed, Zag, Ozer, Murat, Zekios, Constantinos, Abdelgawad, Ahmed, Bayoumi, Magdy

arXiv.org Artificial IntelligenceJan-7-2024

In the dynamic urban landscape, where the interplay of vehicles and pedestrians defines the rhythm of life, integrating advanced technology for safety and efficiency is increasingly crucial. This study delves into the application of cutting-edge technological methods in smart cities, focusing on enhancing public safety through improved traffic accident detection. Action recognition plays a pivotal role in interpreting visual data and tracking object motion such as human pose estimation in video sequences. The challenges of action recognition include variability in rapid actions, limited dataset, and environmental factors such as (Weather, Illumination, and Occlusions). In this paper, we present a novel comprehensive dataset for traffic accident detection. This datasets is specifically designed to bolster computer vision and action recognition systems in predicting and detecting road traffic accidents. We integrated datasets from wide variety of data sources, road networks, weather conditions, and regions across the globe. This approach is underpinned by empirical studies, aiming to contribute to the discourse on how technology can enhance the quality of life in densely populated areas. This research aims to bridge existing research gaps by introducing benchmark datasets that leverage state-of-the-art algorithms tailored for traffic accident detection in smart cities. These dataset is expected to advance academic research and also enhance real-time accident detection applications, contributing significantly to the evolution of smart urban environments. Our study marks a pivotal step towards safer, more efficient smart cities, harnessing the power of AI and machine learning to transform urban living.

ai-driven traffic accident detection, big data and deep learning, detection and computer vision system, (2 more...)

arXiv.org Artificial Intelligence

2401.03587

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.40)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.40)

Add feedback

BaitBuster-Bangla: A Comprehensive Dataset for Clickbait Detection in Bangla with Multi-Feature and Multi-Modal Analysis

Imran, Abdullah Al, Shovon, Md Sakib Hossain, Mridha, M. F.

arXiv.org Artificial IntelligenceOct-13-2023

This study presents a large multi-modal Bangla YouTube clickbait dataset consisting of 253,070 data points collected through an automated process using the YouTube API and Python web automation frameworks. The dataset contains 18 diverse features categorized into metadata, primary content, engagement statistics, and labels for individual videos from 58 Bangla YouTube channels. A rigorous preprocessing step has been applied to denoise, deduplicate, and remove bias from the features, ensuring unbiased and reliable analysis. As the largest and most robust clickbait corpus in Bangla to date, this dataset provides significant value for natural language processing and data science researchers seeking to advance modeling of clickbait phenomena in low-resource languages. Its multi-modal nature allows for comprehensive analyses of clickbait across content, user interactions, and linguistic dimensions to develop more sophisticated detection methods with cross-linguistic applications.

clickbait detection, comprehensive dataset, multi-feature and multi-modal analysis, (1 more...)

arXiv.org Artificial Intelligence

2310.11465

Genre: Research Report (0.40)

Industry: Marketing (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.87)

Add feedback

ivrit.ai: A Comprehensive Dataset of Hebrew Speech for AI Research and Development

Marmor, Yanir, Misgav, Kinneret, Lifshitz, Yair

arXiv.org Artificial IntelligenceJul-17-2023

We introduce "ivrit.ai", a comprehensive Hebrew speech dataset, addressing the distinct lack of extensive, high-quality resources for advancing Automated Speech Recognition (ASR) technology in Hebrew. With over 3,300 speech hours and a over a thousand diverse speakers, ivrit.ai offers a substantial compilation of Hebrew speech across various contexts. It is delivered in three forms to cater to varying research needs: raw unprocessed audio; data post-Voice Activity Detection, and partially transcribed data. The dataset stands out for its legal accessibility, permitting use at no cost, thereby serving as a crucial resource for researchers, developers, and commercial entities. ivrit.ai opens up numerous applications, offering vast potential to enhance AI capabilities in Hebrew. Future efforts aim to expand ivrit.ai further, thereby advancing Hebrew's standing in AI research and technology.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2307.0872

Country:

North America > United States > Pennsylvania (0.04)
Europe > Norway > Central Norway > Trøndelag > Trondheim (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.50)

Industry:

Law (0.67)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback