authentic data
Bemba Speech Translation: Exploring a Low-Resource African Language
Farouq, Muhammad Hazim Al, Wassie, Aman Kassahun, Moslem, Yasmin
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2025), low-resource languages track, namely for Bemba-to-English speech translation. We built cascaded speech translation systems based on Whisper and NLLB-200, and employed data augmentation techniques, such as back-translation. We investigate the effect of using synthetic data and discuss our experimental setup.
Synthetic vs. Gold: The Role of LLM-Generated Labels and Data in Cyberbullying Detection
Kazemi, Arefeh, Kalaivendan, Sri Balaaji Natarajan, Wagner, Joachim, Qadeer, Hamza, Davis, Brian
This study investigates the role of LLM-generated synthetic data in cyberbullying detection. We conduct a series of experiments where we replace some or all of the authentic data with synthetic data, or augment the authentic data with synthetic data. We find that synthetic cyberbullying data can be the basis for training a classifier for harm detection that reaches performance close to that of a classifier trained with authentic data. Combining authentic with synthetic data shows improvements over the baseline of training on authentic data alone for the test data for all three LLMs tried. These results highlight the viability of synthetic data as a scalable, ethically viable alternative in cyberbullying detection while emphasizing the critical impact of LLM selection on performance outcomes.
Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data
Marchesi, Raffaele, Micheletti, Nicolo, Jurman, Giuseppe, Osmani, Venet
Several approaches have been developed to mitigate algorithmic bias stemming from health data poverty, where minority groups are underrepresented in training datasets. Augmenting the minority class using resampling (such as SMOTE) is a widely used approach due to the simplicity of the algorithms. However, these algorithms decrease data variability and may introduce correlations between samples, giving rise to the use of generative approaches based on GAN. Generation of high-dimensional, time-series, authentic data that provides a wide distribution coverage of the real data, remains a challenging task for both resampling and GAN-based approaches. In this work we propose CA-GAN architecture that addresses some of the shortcomings of the current approaches, where we provide a detailed comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-series, real dataset of 3343 hypotensive Caucasian and Black patients. We show that our approach is better at both generating authentic data of the minority class and remaining within the original distribution of the real data.
When Is TTS Augmentation Through a Pivot Language Useful?
Robinson, Nathaniel, Ogayo, Perez, Gangu, Swetha, Mortensen, David R., Watanabe, Shinji
Developing Automatic Speech Recognition (ASR) for low-resource languages is a challenge due to the small amount of transcribed audio data. For many such languages, audio and text are available separately, but not audio with transcriptions. Using text, speech can be synthetically produced via text-to-speech (TTS) systems. However, many low-resource languages do not have quality TTS systems either. We propose an alternative: produce synthetic audio by running text from the target language through a trained TTS system for a higher-resource pivot language. We investigate when and how this technique is most effective in low-resource settings. In our experiments, using several thousand synthetic TTS text-speech pairs and duplicating authentic data to balance yields optimal results. Our findings suggest that searching over a set of candidate pivot languages can lead to marginal improvements and that, surprisingly, ASR performance can by harmed by increases in measured TTS quality. Application of these findings improves ASR by 64.5\% and 45.0\% character error reduction rate (CERR) respectively for two low-resource languages: Guaran\'i and Suba.
Can Synthetic Data Make AI Better? Discover the Benefits of Synthetic Data
Although artificial intelligence (AI) is getting more advanced due to an exponential rate of development, limitations to this modern technology still exist. So, can synthetic data be the solution for all AI-related concerns? In the fourth industrial revolution, every industry sector has discovered the potential of modern technologies; such as artificial intelligence (AI) and machine learning (ML). Almost every other organization is deploying AI to create more efficient business processes and to ensure better customer satisfaction. But, startups, SOHOs, and small and medium businesses (SMBs) face a major issue while adopting AI- it's called the cold start problem.
Discovering the Benefits of Synthetic Data
Although artificial intelligence (A)I is getting more advanced due to an exponential rate of development, limitations to this modern technology still exist. So, can synthetic data be the solution for all AI-related concerns? In the fourth industrial revolution, every industry sector has discovered the potential of modern technologies; such as AI and ML. Almost every other organization is deploying AI to create more efficient business processes and to ensure better customer satisfaction. But, startups, SOHOs, and small and medium businesses (SMBs) face a major issue while adopting AI- it's called the cold start problem.
EETimes - What Is Synthetic Data and Why Is It Critical for the Future of AI?
Advanced AI development today is still deeply rooted in 1950s computer science philosophies, including the phrase "garbage in, garbage out." The adage reminds us that an AI model is only as good as the data it's trained on. For everything from advanced cancer screenings to suggesting a new movie, data scientists need large and diverse datasets to train AI models. This can be a significant challenge with real-world data. Often protected for privacy reasons, authentic data can be hard to come by and can also be expensive to source, and potentially not as diverse as desired.
Does Synthetic Data Hold The Secret To Artificial Intelligence?
Could synthetic data be the solution to rapidly train artificial intelligence (AI) algorithms? There are advantages and disadvantages to synthetic data; however, many technology experts believe that synthetic data is the key to democratizing machine learning and to accelerate testing and adoption of artificial intelligence algorithms into our daily lives. When a computer artificially manufactures data rather than measures and collects it from real-world situations it's called synthetic data. The data is anonymized and created based on the user-specified parameters so that it's as close as possible to the properties of data from real-world scenarios. One way to create synthetic data is to use real-world data but strip the identifying aspects such as names, emails, social security numbers and addresses from the data set so that it is anonymized.