AITopics | dataset card

Collaborating Authors

dataset card

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A Dataset Card

Neural Information Processing SystemsFeb-8-2026, 14:26:24 GMT

Table 4 contains the full set of topics for the k " 30 LDA model introduced in 4. Personal 7.96% ive, didnt, thing, bit, thought, week, wanted, started, pretty, id Art 2.70% art, design, de, images, ikea, image, painting, collection, piano, photo 14 C Most Frequent T op-Level Domains Figure 8: Manually labeled images with watermarks and images related to logos or ads. Sentence Image CLIP Similarity Our new service for teams to manage their fleets for racing.

artificial intelligence, social media, uk 0, (16 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Tuscany (0.04)
Asia > India (0.04)
Asia > China (0.04)

Industry: Retail (0.37)

Technology:

Information Technology > Communications > Social Media (0.30)
Information Technology > Artificial Intelligence (0.30)

Add feedback

No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Choi, Dasol, Park, Woomyoung, Song, Youngsook

arXiv.org Artificial IntelligenceOct-16-2025

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2507.04329

Country:

North America > United States > Minnesota (0.28)
Asia > East Asia (0.24)

Genre: Research Report > New Finding (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

SIEVE: Towards Verifiable Certification for Code-datasets

Mbodji, Fatou Ndiaye, Diallo, El-hacen, Samhi, Jordan, Liu, Kui, Klein, Jacques, Bissyande, Tegawendé F.

arXiv.org Artificial IntelligenceOct-3-2025

Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static 'dataset cards' inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.02166

Country:

North America > United States (0.31)
Europe > Middle East > Malta (0.14)

Genre:

Research Report (0.65)
Personal > Interview (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.71)
Information Technology > Artificial Intelligence > Machine Learning (0.47)

Add feedback

SAGE: A Realistic Benchmark for Semantic Understanding

Goel, Samarth, Lee, Reagan J., Ramchandran, Kannan

arXiv.org Artificial IntelligenceSep-26-2025

As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2509.2131

Country:

Europe (1.00)
North America > United States > California > Alameda County > Berkeley (0.14)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face

Yang, Xinyu, Liang, Weixin, Zou, James

arXiv.org Artificial IntelligenceJan-24-2024

Advances in machine learning are closely tied to the creation of datasets. While data documentation is widely recognized as essential to the reliability, reproducibility, and transparency of ML, we lack a systematic empirical understanding of current dataset documentation practices. To shed light on this question, here we take Hugging Face -- one of the largest platforms for sharing and collaborating on ML models and datasets -- as a prominent case study. By analyzing all 7,433 dataset documentation on Hugging Face, our investigation provides an overview of the Hugging Face dataset ecosystem and insights into dataset documentation practices, yielding 5 main findings: (1) The dataset card completion rate shows marked heterogeneity correlated with dataset popularity. (2) A granular examination of each section within the dataset card reveals that the practitioners seem to prioritize Dataset Description and Dataset Structure sections, while the Considerations for Using the Data section receives the lowest proportion of content. (3) By analyzing the subsections within each section and utilizing topic modeling to identify key topics, we uncover what is discussed in each section, and underscore significant themes encompassing both technical and social impacts, as well as limitations within the Considerations for Using the Data section. (4) Our findings also highlight the need for improved accessibility and reproducibility of datasets in the Usage sections. (5) In addition, our human annotation evaluation emphasizes the pivotal role of comprehensive dataset content in shaping individuals' perceptions of a dataset card's overall quality. Overall, our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis and underlines the need for more thorough dataset documentation in machine learning research.

dataset, dataset card, documentation, (16 more...)

arXiv.org Artificial Intelligence

2401.13822

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
South America > Paraguay > Asunción > Asunción (0.04)
North America > United States > New York > New York County > New York City (0.04)
(3 more...)

Genre:

Overview (1.00)
Research Report > Experimental Study (0.68)
Research Report > New Finding (0.48)

Industry: Social Sector (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

The State of Documentation Practices of Third-party Machine Learning Models and Datasets

Oreamuno, Ernesto Lang, Khan, Rohan Faiyaz, Bangash, Abdul Ali, Stinson, Catherine, Adams, Bram

arXiv.org Artificial IntelligenceDec-22-2023

Model stores offer third-party ML models and datasets for easy project integration, minimizing coding efforts. One might hope to find detailed specifications of these models and datasets in the documentation, leveraging documentation standards such as model and dataset cards. In this study, we use statistical analysis and hybrid card sorting to assess the state of the practice of documenting model cards and dataset cards in one of the largest model stores in use today--Hugging Face (HF). Our findings show that only 21,902 models (39.62\%) and 1,925 datasets (28.48\%) have documentation. Furthermore, we observe inconsistency in ethics and transparency-related documentation for ML models and datasets.

dataset, documentation, model card, (14 more...)

arXiv.org Artificial Intelligence

2312.15058

Country:

North America > United States (0.28)
North America > Canada > Alberta (0.14)

Genre: Research Report > New Finding (1.00)

Industry: Government > Regional Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

WikiMT++ Dataset Card

Zhou, Monan, Wu, Shangda, Wang, Yuan, Li, Wei

arXiv.org Artificial IntelligenceSep-23-2023

Table 1 shows the specific names and number of classes of genre and emotion labels. WikiMT++ is an expanded and refined version of WikiMusicText (WikiMT), featuring 1010 curated lead 2.1 Attributes from WikiMT or Information sheets in ABC notation. To expand application scenarios of The titles, artists, genres, and descriptions are directly inherited WikiMT, we add both objective (album, lyrics, video) and from WikiMT. However, as they were originally subjective emotion (12 emotion adjectives) and emo_4q curated from openly accessible sources, potential constraints (Russell 4Q) attributes, enhancing its usability for music and wrongs still exist. For better precision and information retrieval, conditional music generation, automatic completeness, we update these attributes through CLaMP composition, and emotion classification, etc.

dataset, information, lyric, (16 more...)

arXiv.org Artificial Intelligence

2309.13259

Country:

Europe > Italy > Lombardy > Milan (0.05)
Asia > China > Shanghai > Shanghai (0.05)
Asia > China > Beijing > Beijing (0.05)

Genre: Research Report (0.40)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.38)

Add feedback