candidate data
LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection
Wu, Jian, Yu, Hang, Liu, Bingchang, Yang, Wenjie, Di, Peng, Li, Jianguo, Zhang, Yue
Adapting large language models (LLMs) to specific domains often faces a critical bottleneck: the scarcity of high-quality, human-curated data. While large volumes of unchecked data are readily available, indiscriminately using them for fine-tuning risks introducing noise and degrading performance. Strategic data selection is thus crucial, requiring a method that is both accurate and efficient. Existing approaches, categorized as similarity-based and direct optimization methods, struggle to simultaneously achieve these goals. In this paper, we introduce LAMDAS (LLM As an iMplicit classifier for domain-specific DAta Selection), a novel approach that leverages the pre-trained LLM itself as an implicit classifier, thereby bypassing explicit feature engineering and computationally intensive optimization process. LAMDAS reframes data selection as a one-class classification problem, identifying candidate data that "belongs" to the target domain defined by a small reference dataset. Extensive experimental results demonstrate that LAMDAS not only exceeds the performance of full-data training using a fraction of the data but also outperforms nine state-of-the-art (SOTA) baselines under various scenarios. Furthermore, LAMDAS achieves the most compelling balance between performance gains and computational efficiency compared to all evaluated baselines.
InsBank: Evolving Instruction Subset for Ongoing Alignment
Shi, Jiayi, Li, Yiwei, Feng, Shaoxiong, Yuan, Peiwen, Wang, Xinglin, Zhang, Yueqi, Tan, Chuyi, Pan, Boyuan, Ren, Huan, Hu, Yao, Li, Kan
Large language models (LLMs) typically undergo instruction tuning to enhance alignment. Recent studies emphasize that quality and diversity of instruction data are more crucial than quantity, highlighting the need to select diverse, high-quality subsets to reduce training costs. However, how to evolve these selected subsets alongside the development of new instruction data remains insufficiently explored. To achieve LLMs' ongoing alignment, we introduce Instruction Bank (InsBank), a continuously updated repository that integrates the latest valuable instruction data. We further propose Progressive Instruction Bank Evolution (PIBE), a novel framework designed to evolve InsBank effectively and efficiently over time. PIBE employs a gradual data selection strategy to maintain long-term efficiency, leveraging a representation-based diversity score to capture relationships between data points and retain historical information for comprehensive diversity evaluation. This also allows for flexible combination of diversity and quality scores during data selection and ranking. Extensive experiments demonstrate that PIBE significantly outperforms baselines in InsBank evolution and is able to extract budget-specific subsets, demonstrating its effectiveness and adaptability.
Active machine learning for spatio-temporal predictions using feature embedding
Aryandoust, Arsam, Pfenninger, Stefan
Active learning (AL) could contribute to solving critical environmental problems through improved spatiotemporal predictions. Yet such predictions involve high-dimensional feature spaces with mixed data types and missing data, which existing methods have difficulties dealing with. Here, we propose a novel batch AL method that fills this gap. We encode and cluster features of candidate data points, and query the best data based on the distance of embedded features to their cluster centers. We introduce a new metric of informativeness that we call embedding entropy and a general class of neural networks that we call embedding networks for using it. Empirical tests on forecasting electricity demand show a simultaneous reduction in average prediction RMSE by up to 63-88% and data usage by up to 50-69% compared to passive learning (PL) benchmarks. Examples include the electricity consumption of buildings, required to operate sustainable power grids; the travel time between city zones, required for the smart charging of electric vehicles; and meteorological conditions, required for weather-based forecasting of wind and solar electricity generation. Sensing and labeling the ground truth data that is necessary for making these predictions in time and space usually comes at a high cost. This cost constrains the total number of sensors that we can place and use to query new data. A fundamental question that arises for many spatiotemporal prediction tasks is where and when to measure and query the data required to make the best possible predictions while staying within a maximum budget for sensors and data.
How is AI Changing the World of Assessments?
Artificial Intelligence was existed only in the domain of science fiction and fantasy until last few years. However, it has become a part of our normal lives today, in social as well as the business environment. From military, automotive, agriculture, legal, healthcare to education, this technology has touched in almost every field and sector impacting human lives to a great extend. AI systems are capable enough to reduce human efforts in numerous areas. Its applications help to get the work done faster and with accurate results.
Interview: Ashutosh Garg, CEO at Eightfold.ai - insideBIGDATA
I recently caught up with Ashutosh Garg, CEO at Eightfold.ai to discuss how he and his team have deployed AI and machine learning to help with the needs of the talent management industry. With 6000 research citations, 50 patents, 35 peer-reviewed research publications, and the outstanding Ph.D. thesis award from UIUC for his Ph.D. thesis in Machine Learning, it's fair to say that Ashutosh is one of the world's experts in machine learning. After his time managing Search and Personalization efforts at both Google and IBM Research, Ashutosh founded Bloomreach, a leading vendor for Digital Experience Platforms. Now, he is applying his experience to the problem he is most truly passionate about --helping the world's talent find their most meaningful and fulfilling work. Can you give us a sense for what form of AI/machine learning is being used in your product?
3 Recruitment Tasks Supercharged with Artificial Intelligence
The recruitment process continues to lengthen as the search for highly skilled talent increases, the fear of making a bad hire remains and the quality of active candidates is lacking. The average time to fill doubled from 2014 (22.9 days) compared to 2010 (12.6 days), and many reports point to the fact those numbers have increased even more from 2014 to 2016. The average time to fill in 2016 is now at a record high of 29 days according to DHI-DFH Vacancy Duration Measure which analyzed the entire US labor market. Well, reviewing a resume is only the beginning. The mundane tasks of pre-screening, interviewing, validating and reference/background checking candidates is where the hold up lies.