AITopics | balanced dataset

Collaborating Authors

balanced dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

OCCGEN: Selection of Real-world Multilingual Parallel Data Balanced in Gender within Occupations

Neural Information Processing SystemsApr-24-2026, 11:29:24 GMT

This paper describes the OCCGEN toolkit, which allows extracting multilingual parallel data balanced in gender within occupations. OCCGEN can extract datasets that reflect gender diversity (beyond binary) more fairly in society to be further used to explicitly mitigate occupational gender stereotypes. We propose two use cases that extract evaluation datasets for machine translation in four high-resource languages from different linguistic families and in a low-resource African language. Our analysis of these use cases shows that translation outputs in high-resource languages tend to worsen in feminine subsets (compared to masculine), specially in the directions containing English. This is confirmed by the human evaluation. We hypothesize that a sound language generation may contribute to pay less attention to the source sentence and to overgeneralize to the most frequent gender forms.

artificial intelligence, natural language, occupation, (17 more...)

Neural Information Processing Systems

Country:

North America (0.68)
Europe > Spain (0.28)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Opinion Mining and Analysis Using Hybrid Deep Neural Networks

Hidri, Adel, Alsaif, Suleiman Ali, Alahmari, Muteeb, AlShehri, Eman, Hidri, Minyar Sassi

arXiv.org Artificial IntelligenceNov-20-2025

Understanding customer attitudes has become a critical component of decision-making due to the growing influence of social media and e-commerce. Text-based opinions are the most structured, hence playing an important role in sentiment analysis. Most of the existing methods, which include lexicon-based approaches and traditional machine learning techniques, are insufficient for handling contextual nuances and scalability. While the latter has limitations in model performance and generalization, deep learning (DL) has achieved improvement, especially on semantic relationship capturing with recurrent neural networks (RNNs) and convolutional neural networks (CNNs). The aim of the study is to enhance opinion mining by introducing a hybrid deep neural network model that combines a bidirectional gated recurrent unit (BGRU) and long short-term memory (LSTM) layers to improve sentiment analysis, particularly addressing challenges such as contextual nuance, scalability, and class imbalance. To substantiate the efficacy of the proposed model, we conducted comprehensive experiments utilizing benchmark datasets, encompassing IMDB movie critiques and Amazon product evaluations. The introduced hybrid BGRULSTM (HBGRU-LSTM) architecture attained a testing accuracy of 95%, exceeding the performance of traditional DL frameworks such as LSTM (93.06%), CNN+LSTM (93.31%), and GRU+LSTM (92.20%). Moreover, our model exhibited a noteworthy enhancement in recall for negative sentiments, escalating from 86% (unbalanced dataset) to 96% (balanced dataset), thereby ensuring a more equitable and just sentiment classification. Furthermore, the model diminished misclassification loss from 20.24% for unbalanced to 13.3% for balanced dataset, signifying enhanced generalization and resilience.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.3390/technologies13050175

2511.14796

Country:

Europe (1.00)
Asia > Middle East (0.67)
North America > United States (0.46)
Asia > India (0.46)

Genre:

Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Banking & Finance (0.93)
Media > Film (0.46)
Information Technology > Services > e-Commerce Services (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

Su, Zijin, Lyu, Huanzhu, Niu, Yuren, Liu, Yiming

arXiv.org Artificial IntelligenceNov-20-2025

Abstract--Multi-label sentiment classification plays a vital role in natural language processing by detecting multiple emotions within a single text. However, existing datasets like GoEmotions often suffer from severe class imbalance, which hampers model performance, especially for underrepresented emotions. T o address this, we constructed a balanced multi-label sentiment dataset by integrating the original GoEmotions data, emotion-labeled samples from Sentiment140 using a RoBERT a-base-GoEmotions model, and manually annotated texts generated by GPT -4 mini. Based on this dataset, we developed an enhanced multi-label classification model that combines pre-trained FastT ext embeddings, convolutional layers for local feature extraction, bidirectional LSTM for contextual learning, and an attention mechanism to highlight sentiment-relevant words. A sigmoid-activated output layer enables multi-label prediction, and mixed precision training improves computational efficiency. Experimental results demonstrate significant improvements in accuracy, precision, recall, F1-score, and AUC compared to models trained on imbalanced data, highlighting the effectiveness of our approach. Sentiment analysis, a core task in natural language processing, systematically identifies and categorizes opinions expressed in text, typically classifying them as positive, negative, or neutral [1].

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.14073

Country: Asia > China (0.29)

Genre: Research Report > New Finding (0.48)

Industry:

Health & Medicine (0.47)
Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

pressing issue for the research community and that our work is potentially a very important contribution and clearly

Neural Information Processing SystemsOct-3-2025, 02:11:54 GMT

A formal definition of EI is provided in Frazier et al. 2018

artificial intelligence, classifier, machine learning, (18 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Constructing balanced datasets for predicting failure modes in structural systems under seismic hazards

Kim, Jungho, Kim, Taeyong

arXiv.org Machine LearningFeb-26-2025

Accurate prediction of structural failure modes under seismic excitations is essential for seismic risk and resilience assessment. Traditional simulation-based approaches often result in imbalanced datasets dominated by non-failure or frequently observed failure scenarios, limiting the effectiveness in machine learning-based prediction. To address this challenge, this study proposes a framework for constructing balanced datasets that include distinct failure modes. The framework consists of three key steps. First, critical ground motion features (GMFs) are identified to effectively represent ground motion time histories. Second, an adaptive algorithm is employed to estimate the probability densities of various failure domains in the space of critical GMFs and structural parameters. Third, samples generated from these probability densities are transformed into ground motion time histories by using a scaling factor optimization process. A balanced dataset is constructed by performing nonlinear response history analyses on structural systems with parameters matching the generated samples, subjected to corresponding transformed ground motion time histories. Deep neural network models are trained on balanced and imbalanced datasets to highlight the importance of dataset balancing. To further evaluate the framework's applicability, numerical investigations are conducted using two different structural models subjected to recorded and synthetic ground motions. The results demonstrate the framework's robustness and effectiveness in addressing dataset imbalance and improving machine learning performance in seismic failure mode prediction.

artificial intelligence, failure mode, machine learning, (17 more...)

arXiv.org Machine Learning

2503.01882

Country:

North America > United States > California > Alameda County > Berkeley (0.14)
Asia (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Materials > Construction Materials (0.68)
Energy > Oil & Gas > Upstream (0.68)
Government > Regional Government > North America Government > United States Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)

Add feedback

Online Social Support Detection in Spanish Social Media Texts

Tash, Moein Shahiki, Ramos, Luis, Ahani, Zahra, Monroy, Raul, kolesnikova, Olga, Calvo, Hiram, Sidorov, Grigori

arXiv.org Artificial IntelligenceFeb-9-2025

The advent of social media has transformed communication, enabling individuals to share their experiences, seek support, and participate in diverse discussions. While extensive research has focused on identifying harmful content like hate speech, the recognition and promotion of positive and supportive interactions remain largely unexplored. This study proposes an innovative approach to detecting online social support in Spanish-language social media texts. We introduce the first annotated dataset specifically created for this task, comprising 3,189 YouTube comments classified as supportive or non-supportive. To address data imbalance, we employed GPT-4o to generate paraphrased comments and create a balanced dataset. We then evaluated social support classification using traditional machine learning models, deep learning architectures, and transformer-based models, including GPT-4o, but only on the unbalanced dataset. Subsequently, we utilized a transformer model to compare the performance between the balanced and unbalanced datasets. Our findings indicate that the balanced dataset yielded improved results for Task 2 (Individual and Group) and Task 3 (Nation, Other, LGBTQ, Black Community, Women, Religion), whereas GPT-4o performed best for Task 1 (Social Support and Non-Support). This study highlights the significance of fostering a supportive online environment and lays the groundwork for future research in automated social support detection.

accuracy, f1-score, weighted f1-score, (16 more...)

arXiv.org Artificial Intelligence

2502.0964

Country:

South America (0.04)
North America > Mexico > Mexico City > Mexico City (0.04)
North America > Central America (0.04)
(2 more...)

Genre: Research Report > New Finding (0.89)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
Law > Civil Rights & Constitutional Law (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Reduced-order modeling and classification of hydrodynamic pattern formation in gravure printing

Rothmann-Brumm, Pauline, Brunton, Steven L., Scherl, Isabel

arXiv.org Artificial IntelligenceJan-24-2025

Hydrodynamic pattern formation phenomena in printing and coating processes are still not fully understood. However, fundamental understanding is essential to achieve high-quality printed products and to tune printed patterns according to the needs of a specific application like printed electronics, graphical printing, or biomedical printing. The aim of the paper is to develop an automated pattern classification algorithm based on methods from supervised machine learning and reduced-order modeling. We use the HYPA-p dataset, a large image dataset of gravure-printed images, which shows various types of hydrodynamic pattern formation phenomena. It enables the correlation of printing process parameters and resulting printed patterns for the first time. 26880 images of the HYPA-p dataset have been labeled by a human observer as dot patterns, mixed patterns, or finger patterns; 864000 images (97%) are unlabeled. A singular value decomposition (SVD) is used to find the modes of the labeled images and to reduce the dimensionality of the full dataset by truncation and projection. Selected machine learning classification techniques are trained on the reduced-order data. We investigate the effect of several factors, including classifier choice, whether or not fast Fourier transform (FFT) is used to preprocess the labeled images, data balancing, and data normalization. The best performing model is a k-nearest neighbor (kNN) classifier trained on unbalanced, FFT-transformed data with a test error of 3%, which outperforms a human observer by 7%. Data balancing slightly increases the test error of the kNN-model to 5%, but also increases the recall of the mixed class from 90% to 94%. Finally, we demonstrate how the trained models can be used to predict the pattern class of unlabeled images and how the predictions can be correlated to the printing process parameters, in the form of regime maps.

artificial intelligence, data quality, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2501.16381

Country:

Europe (0.94)
North America > United States (0.93)

Genre: Research Report (0.82)

Industry: Energy > Oil & Gas > Upstream (1.00)

Add feedback

Making Bias Amplification in Balanced Datasets Directional and Interpretable

Tokas, Bhanu, Nair, Rahul, Kerner, Hannah

arXiv.org Artificial IntelligenceDec-15-2024

Most of the ML datasets we use today are biased. When we train models on these biased datasets, they often not only learn dataset biases but can also amplify them -- a phenomenon known as bias amplification. Several co-occurrence-based metrics have been proposed to measure bias amplification between a protected attribute A (e.g., gender) and a task T (e.g., cooking). However, these metrics fail to measure biases when A is balanced with T. To measure bias amplification in balanced datasets, recent work proposed a predictability-based metric called leakage amplification. However, leakage amplification cannot identify the direction in which biases are amplified. In this work, we propose a new predictability-based metric called directional predictability amplification (DPA). DPA measures directional bias amplification, even for balanced datasets. Unlike leakage amplification, DPA is easier to interpret and less sensitive to attacker models (a hyperparameter in predictability-based metrics). Our experiments on tabular and image datasets show that DPA is an effective metric for measuring directional bias amplification. The code will be available soon.

amplification, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2412.1106

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Arizona (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Northeast Materials Database (NEMAD): Enabling Discovery of High Transition Temperature Magnetic Compounds

Itani, Suman, Zhang, Yibo, Zang, Jiadong

arXiv.org Artificial IntelligenceSep-23-2024

The discovery of novel magnetic materials with greater operating temperature ranges and optimized performance is essential for advanced applications. Current data-driven approaches are challenging and limited due to the lack of accurate, comprehensive, and feature-rich databases. This study aims to address this challenge by introducing a new approach that uses Large Language Models (LLMs) to create a comprehensive, experiment-based, magnetic materials database named the Northeast Materials Database (NEMAD), which consists of 26,706 magnetic materials (www.nemad.org). The database incorporates chemical composition, magnetic phase transition temperatures, structural details, and magnetic properties. Enabled by NEMAD, machine learning models were developed to classify materials and predict transition temperatures. Our classification model achieved an accuracy of 90% in categorizing materials as ferromagnetic (FM), antiferromagnetic (AFM), and non-magnetic (NM). The regression models predict Curie (N\'eel) temperature with a coefficient of determination (R2) of 0.86 (0.85) and a mean absolute error (MAE) of 62K (32K). These models identified 62 (19) FM (AFM) candidates with a predicted Curie (N\'eel) temperature above 500K (100K) from the Materials Project. This work shows the feasibility of combining LLMs for automated data extraction and machine learning models in accelerating the discovery of magnetic materials.

compound, database, magnetic material, (16 more...)

arXiv.org Artificial Intelligence

2409.15675

Country:

North America > United States > New Hampshire (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.82)

Industry: Energy > Renewable (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Filters

Collaborating Authors

balanced dataset

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

OCCGEN: Selection of Real-world Multilingual Parallel Data Balanced in Gender within Occupations

09933f07ae2ccbca7212bb4e43de8db0-Paper-Datasets_and_Benchmarks.pdf

Opinion Mining and Analysis Using Hybrid Deep Neural Networks

Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement

pressing issue for the research community and that our work is potentially a very important contribution and clearly

Constructing balanced datasets for predicting failure modes in structural systems under seismic hazards

Online Social Support Detection in Spanish Social Media Texts

Reduced-order modeling and classification of hydrodynamic pattern formation in gravure printing

Making Bias Amplification in Balanced Datasets Directional and Interpretable

Northeast Materials Database (NEMAD): Enabling Discovery of High Transition Temperature Magnetic Compounds