global data
- Europe (0.04)
- South America > Brazil > São Paulo (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Europe (0.04)
- South America > Brazil > São Paulo (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (9 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Tuning LLMs by RAG Principles: Towards LLM-native Memory
Wei, Jiale, Wu, Shuchi, Liu, Ruochen, Ying, Xiang, Shang, Jingbo, Tao, Fangbo
Memory, additional information beyond the training of large language models (LLMs), is crucial to various real-world applications, such as personal assistant. The two mainstream solutions to incorporate memory into the generation process are long-context LLMs and retrieval-augmented generation (RAG). In this paper, we first systematically compare these two types of solutions on three renovated/new datasets and show that (1) long-context solutions, although more expensive, shall be easier to capture the big picture and better answer queries which require considering the memory as a whole; and (2) when the queries concern specific information, RAG solutions shall be more competitive especially when the keywords can be explicitly matched. Therefore, we propose a novel method RAG-Tuned-LLM which fine-tunes a relative small (e.g., 7B) LLM using the data generated following the RAG principles, so it can combine the advantages of both solutions. Extensive experiments on three datasets demonstrate that RAG-Tuned-LLM can beat long-context LLMs and RAG methods across a wide range of query types.
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- Research Report > New Finding (0.46)
- Research Report > Promising Solution (0.34)
Contrastive Federated Learning with Tabular Data Silos
Ginanjar, Achmad, Li, Xue, Hua, Wen
Learning from data silos is a difficult task for organizations that need to obtain knowledge of objects that appeared in multiple independent data silos. Objects in multi-organizations, such as government agents, are referred by different identifiers, such as driver license, passport number, and tax file number. The data distributions in data silos are mostly non-IID (Independently and Identically Distributed), labelless, and vertically partitioned (i.e., having different attributes). Privacy concerns harden the above issues. Conditions inhibit enthusiasm for collaborative work. While Federated Learning (FL) has been proposed to address these issues, the difficulty of labeling, namely, label costliness, often hinders optimal model performance. A potential solution lies in contrastive learning, an unsupervised self-learning technique to represent semantic data by contrasting similar data pairs. However, contrastive learning is currently not designed to handle tabular data silos that existed within multiple organizations where data linkage by quasi identifiers are needed. To address these challenges, we propose using semi-supervised contrastive federated learning, which we refer to as Contrastive Federated Learning with Data Silos (CFL). Our approach tackles the aforementioned issues with an integrated solution. Our experimental results demonstrate that CFL outperforms current methods in addressing these challenges and providing improvements in accuracy. Additionally, we present positive results that showcase the advantages of our contrastive federated learning approach in complex client environments.
- Research Report > New Finding (0.89)
- Research Report > Promising Solution (0.88)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Pouget, Angéline, Beyer, Lucas, Bugliarello, Emanuele, Wang, Xiao, Steiner, Andreas Peter, Zhai, Xiaohua, Alabdulmohsin, Ibrahim
We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs). Using a broad range of benchmark datasets and evaluation metrics, we bring to attention several important findings. First, the common filtering of training data to English image-text pairs disadvantages communities of lower socioeconomic status and negatively impacts cultural understanding. Notably, this performance gap is not captured by - and even at odds with - the currently popular evaluation metrics derived from the Western-centric ImageNet and COCO datasets. Second, pretraining with global, unfiltered data before fine-tuning on English content can improve cultural understanding without sacrificing performance on said popular benchmarks. Third, we introduce the task of geo-localization as a novel evaluation metric to assess cultural diversity in VLMs. Our work underscores the value of using diverse data to create more inclusive multimodal systems and lays the groundwork for developing VLMs that better represent global perspectives.
- Europe (0.04)
- South America > Brazil > São Paulo (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (9 more...)
Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Yang, Karren, Hu, Ting-Yao, Chang, Jen-Hao Rick, Koppula, Hema Swetha, Tuzel, Oncel
Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > Middle East > Jordan (0.04)
Leveraging the power of AI and machine learning for more resilient data centers - ET CIO
By Sachin Bhalla According to Market Research Company Technavio, the global data center market size is poised to grow by $304.87 million between 2020 and 2024. It will grow at an even faster pace in the Asia-Pacific region. An S&P study reveals that between 2017 and 2022, the Asia-Pacific region would reach an estimated 10% CAGR compared to the global data center industry that is expected to clock a 7% CAGR. The data center industry is undergoing changes to serve the needs in today's business landscape. It comes to no surprise when we hear organisations discuss their plans to enhance their data center infrastructure with technologies like Artificial Intelligence (AI) and focus on automation to improve uptime while controlling costs--all of which are important for companies to drive operational efficiency and business resiliency.
- North America > United States (0.18)
- Asia > India (0.06)
On-Device Learning with Cloud-Coordinated Data Augmentation for Extreme Model Personalization in Recommender Systems
Gu, Renjie, Niu, Chaoyue, Yan, Yikai, Wu, Fan, Tang, Shaojie, Jia, Rongfeng, Lyu, Chengfei, Chen, Guihai
Data heterogeneity is an intrinsic property of recommender systems, making models trained over the global data on the cloud, which is the mainstream in industry, non-optimal to each individual user's local data distribution. To deal with data heterogeneity, model personalization with on-device learning is a potential solution. However, on-device training using a user's small size of local samples will incur severe overfitting and undermine the model's generalization ability. In this work, we propose a new device-cloud collaborative learning framework, called CoDA, to break the dilemmas of purely cloud-based learning and on-device learning. The key principle of CoDA is to retrieve similar samples from the cloud's global pool to augment each user's local dataset to train the recommendation model. Specifically, after a coarse-grained sample matching on the cloud, a personalized sample classifier is further trained on each device for a fine-grained sample filtering, which can learn the boundary between the local data distribution and the outside data distribution. We also build an end-to-end pipeline to support the flows of data, model, computation, and control between the cloud and each device. We have deployed CoDA in a recommendation scenario of Mobile Taobao. Online A/B testing results show the remarkable performance improvement of CoDA over both cloud-based learning without model personalization and on-device training without data augmentation. Overhead testing on a real device demonstrates the computation, storage, and communication efficiency of the on-device tasks in CoDA.
- Asia > China > Shanghai > Shanghai (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Texas (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Services (0.69)
What's Ahead for Data Processing, Hosting Industry?
Artificial intelligence (AI) may have popularized big data but applications like the internet of things (IoT) are taking it to the next level. The need to analyze large datasets to come up with insights and the increasing internet traffic is increasing the demand for data and digital storage services. The global data center market should expand by $304.87 billion between 2020 and 2024, according to a Technavio report. This puts the current global data center market size to more than $300 billion. According to Internet World Stats, in mid-2019 there were more than 4.5 billion Internet users worldwide, That's 58.8% of the global population with Internet access and the numbers are growing as internet penetration improves.
- Europe > Lithuania (0.07)
- Asia > Singapore (0.07)
- North America > United States (0.05)
- Information Technology > Cloud Computing (0.84)
- Information Technology > Communications > Networks (0.58)
- Information Technology > Artificial Intelligence (0.53)
How India can become an AI powerhouse
Data is turning out to be more valuable than we thought. Google and Facebook's ad revenues exceeded $200 billion last year. They can hope to have a bigger source of income soon, thanks to the income generated by the Artificial Intelligence (AI) business built using the data of billions of individuals. No wonder, getting hold of data by paying top dollars is the new game in the digital world. This may explain the sudden investments of Google, Facebook, Intel, and many others in India, one of the largest data generators of the world.
- Health & Medicine > Therapeutic Area (0.52)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.52)
- Information Technology > Services (0.51)