AITopics | quality signal

Collaborating Authors

quality signal

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RedPajama: an Open Dataset for Training Large Language Models

Neural Information Processing SystemsMar-22-2026, 15:02:41 GMT

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

artificial intelligence, language model, natural language, (10 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.58)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

d34497330b1fd6530f7afd86d0df9f76-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 06:40:43 GMT

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
Africa > South Africa (0.14)
North America > United States > Virginia (0.04)
(7 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry:

Law (1.00)
Information Technology (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(4 more...)

Add feedback

d34497330b1fd6530f7afd86d0df9f76-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 17:37:10 GMT

dataset, quality signal, rpv2, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.14)
Africa > South Africa (0.14)
North America > United States > Virginia (0.04)
(8 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry:

Law (1.00)
Information Technology (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(4 more...)

Add feedback

AI Answer Engine Citation Behavior An Empirical Analysis of the GEO16 Framework

Kumar, Arlen, Palkhouski, Leanid

arXiv.org Artificial IntelligenceSep-16-2025

AI answer engines increasingly mediate access to domain knowledge by generating responses and citing web sources. We introduce GEO-16, a 16 pillar auditing framework that converts on page quality signals into banded pillar scores and a normalized GEO score G that ranges from 0 to 1. Using 70 product intent prompts, we collected 1,702 citations across three engines (Brave Summary, Google AI Overviews, and Perplexity) and audited 1,100 unique URLs. In our corpus, the engines differed in the GEO quality of the pages they cited, and pillars related to Metadata and Freshness, Semantic HTML, and Structured Data showed the strongest associations with citation. Logistic models with domain clustered standard errors indicate that overall page quality is a strong predictor of citation, and simple operating points (for example, G at least 0.70 combined with at least 12 pillar hits) align with substantially higher citation rates in our data. We report per engine contrasts, vertical effects, threshold analysis, and diagnostics, then translate findings into a practical playbook for publishers. The study is observational and focuses on English language B2B SaaS pages; we discuss limitations, threats to validity, and reproducibility considerations.

engine, large language model, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.10762

Country: North America > United States > California (0.15)

Genre:

Research Report > Experimental Study (0.69)
Research Report > New Finding (0.69)

Industry: Information Technology (0.50)

Technology:

Information Technology > Information Management (0.91)
Information Technology > Communications > Web (0.70)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.48)

Add feedback

RedPajama: an Open Dataset for Training Large Language Models

Neural Information Processing SystemsMay-27-2025, 17:56:31 GMT

dataset, language model, redpajama, (6 more...)

Neural Information Processing Systems

Country: North America > United States > Virginia (0.07)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.63)

Add feedback

RedPajama: an Open Dataset for Training Large Language Models

Weber, Maurice, Fu, Daniel, Anthony, Quentin, Oren, Yonatan, Adams, Shane, Alexandrov, Anton, Lyu, Xiaozhong, Nguyen, Huu, Yao, Xiaozhe, Adams, Virginia, Athiwaratkun, Ben, Chalamala, Rahul, Chen, Kezhen, Ryabinin, Max, Dao, Tri, Liang, Percy, Ré, Christopher, Rish, Irina, Zhang, Ce

arXiv.org Artificial IntelligenceNov-19-2024

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2411.12372

Country:

North America > United States > California (0.14)
Africa > South Africa (0.14)
North America > United States > Virginia (0.04)
(8 more...)

Genre: Research Report > New Finding (0.85)

Industry:

Law (1.00)
Information Technology (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

Guo, Geyang, Zhao, Ranchi, Tang, Tianyi, Zhao, Wayne Xin, Wen, Ji-Rong

arXiv.org Artificial IntelligenceNov-7-2023

Alignment with human preference is a desired property of large language models (LLMs). Currently, the main alignment approach is based on reinforcement learning from human feedback (RLHF). Despite the effectiveness of RLHF, it is intricate to implement and train, thus recent studies explore how to develop alternative alignment approaches based on supervised fine-tuning (SFT). A major limitation of SFT is that it essentially does imitation learning, which cannot fully understand what are the expected behaviors. To address this issue, we propose an improved alignment approach named FIGA. Different from prior methods, we incorporate fine-grained (i.e., token or phrase level) quality signals that are derived by contrasting good and bad responses. Our approach has made two major contributions. Firstly, we curate a refined alignment dataset that pairs initial responses and the corresponding revised ones. Secondly, we devise a new loss function can leverage fine-grained quality signals to instruct the learning of LLMs for alignment. Extensive experiments have demonstrated the effectiveness of our approaches by comparing a number of competitive baselines.

arxiv preprint arxiv, dataset, language model, (14 more...)

arXiv.org Artificial Intelligence

2311.04072

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.05)
Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.05)
Asia > India > Jharkhand > Ranchi (0.04)
Asia > China > Beijing > Beijing (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Frequency-Based Sleep Stage Detections by Single EEG Derivation in Healthy Human Subjects

Hirai, Nobuhide (Stanford University) | Nishino, Seiji (Stanford University)

AAAI ConferencesMar-25-2012

A need for sleep monitoring is increasing in modern society. However, sleep stage scoring is time consuming, and large inconsistencies may exist among scorers. The settings for the recordings are also complicated and usually need to be professionally prepared. If simple small equipment could record human EEG and detect sleep stages, it would bring significant benefits to a large population. We thus developed a simple frequency-based sleep stage classifier by single EEG derivation, and evaluated the performance of the classifier. It showed a potential to work as well as the other known automated classifiers. The classifier was not based on specific frequency bands or EEG patterns. It could perform as well with poor quality signals and could easily be adopted to score any other biological signals.

artificial intelligence, classifier, machine learning, (17 more...)

AAAI Conferences

2012 AAAI Spring Symposium Series

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
Asia > Japan (0.04)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.70)

Add feedback

Mechanisms for Making Crowds Truthful

Jurca, R., Faltings, B.

Journal of Artificial Intelligence ResearchMar-17-2009

We consider schemes for obtaining truthful reports on a common but hidden signal from large groups of rational, self-interested agents. One example are online feedback mechanisms, where users provide observations about the quality of a product or service so that other users can have an accurate idea of what quality they can expect. However, (i) providing such feedback is costly, and (ii) there are many motivations for providing incorrect feedback. Both problems can be addressed by reward schemes which (i) cover the cost of obtaining and reporting feedback, and (ii) maximize the expected reward of a rational agent who reports truthfully. We address the design of such incentive-compatible rewards for feedback generated in environments with pure adverse selection. Here, the correlation between the true knowledge of an agent and her beliefs regarding the likelihoods of reports of other agents can be exploited to make honest reporting a Nash equilibrium. In this paper we extend existing methods for designing incentive-compatible rewards by also considering collusion. We analyze different scenarios, where, for example, some or all of the agents collude. For each scenario we investigate whether a collusion-resistant, incentive-compatible reward scheme exists, and use automated mechanism design to specify an algorithm for deriving an efficient reward mechanism.

agent, equilibrium, mechanism, (14 more...)

Journal of Artificial Intelligence Research

doi: 10.1613/jair.2621

AI Access Foundation

10590

Journal of Artificial Intelligence Research

Country: