South America
The Zeno's Paradox of `Low-Resource' Languages
Nigatu, Hellina Hailu, Tonja, Atnafu Lambebo, Rosman, Benjamin, Solorio, Thamar, Choudhury, Monojit
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a `low-resource language.' To understand how NLP papers define and study `low resource' languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword `low-resource.' Based on our analysis, we show how several interacting axes contribute to `low-resourcedness' of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training
Pieler, Michael, Bellagente, Marco, Teufel, Hannah, Phung, Duy, Cooper, Nathan, Tow, Jonathan, Rocha, Paulo, Adithyan, Reshinth, Alyafeai, Zaid, Pinnaparaju, Nikhil, Zhuravinskyi, Maksym, Riquelme, Carlos
Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results on C4 and extending them with our optimized rephrasing pipeline to the English, German, Italian, and Spanish Oscar subsets of CulturaX. Our pipeline leads to increased performance on standard evaluation benchmarks in both the mono- and multilingual setup. In addition, we provide a detailed study of our pipeline, investigating the choice of the base dataset and LLM for the rephrasing, as well as the relationship between the model size and the performance after pre-training. By exploring data with different perceived quality levels, we show that gains decrease with higher quality. Furthermore, we find the difference in performance between model families to be bigger than between different model sizes. This highlights the necessity for detailed tests before choosing an LLM to rephrase large amounts of data. Moreover, we investigate the effect of pre-training with synthetic data on supervised fine-tuning. Here, we find increasing but inconclusive results that highly depend on the used benchmark. These results (again) highlight the need for better benchmarking setups. In summary, we show that rephrasing multilingual and low-quality data is a very promising direction to extend LLM pre-training data.
Magnetic Milli-spinner for Robotic Endovascular Surgery
Wu, Shuai, Leanza, Sophie, Lu, Lu, Chang, Yilong, Li, Qi, Stone, Diego, Zhao, Ruike Renee
Vascular diseases such as thrombosis, atherosclerosis, and aneurysm, which can lead to blockage of blood flow or blood vessel rupture, are common and life-threatening. Conventional minimally invasive treatments utilize catheters, or long tubes, to guide small devices or therapeutic agents to targeted regions for intervention. Unfortunately, catheters suffer from difficult and unreliable navigation in narrow, winding vessels such as those found in the brain. Magnetically actuated untethered robots, which have been extensively explored as an alternative, are promising for navigation in complex vasculatures and vascular disease treatments. Most current robots, however, cannot swim against high flows or are inadequate in treating certain conditions. Here, we introduce a multifunctional and magnetically actuated milli-spinner robot for rapid navigation and performance of various treatments in complicated vasculatures. The milli-spinner, with a unique hollow structure including helical fins and slits for propulsion, generates a distinct flow field upon spinning. The milli-spinner is the fastest-ever untethered magnetic robot for movement in tubular environments, easily achieving speeds of 23 cm/s, demonstrating promise as an untethered medical device for effective navigation in blood vessels and robotic treatment of numerous vascular diseases.
LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Wang, Hanyu, Suri, Saksham, Ren, Yixuan, Chen, Hao, Shrivastava, Abhinav
In the first stage, LARP tokenizer is trained with a lightweight AR prior model to learn an AR-friendly latent space. In the second stage, an AR generative model is trained on LARP's discrete tokens to synthesize high-fidelity videos. We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark.
Decoding Reading Goals from Eye Movements
Shubi, Omer, Hadar, Cfir Avraham, Berzak, Yevgeni
Readers can have different goals with respect to the text they are reading. Can these goals be decoded from the pattern of their eye movements over the text? In this work, we examine for the first time whether it is possible to decode two types of reading goals that are common in daily life: information seeking and ordinary reading. Using large scale eye-tracking data, we apply to this task a wide range of state-of-the-art models for eye movements and text that cover different architectural and data representation strategies, and further introduce a new model ensemble. We systematically evaluate these models at three levels of generalization: new textual item, new participant, and the combination of both. We find that eye movements contain highly valuable signals for this task. We further perform an error analysis which builds on prior empirical findings on differences between ordinary reading and information seeking and leverages rich textual annotations. This analysis reveals key properties of textual items and participant eye movements that contribute to the difficulty of the task.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Orgad, Hadas, Toker, Michael, Gekhman, Zorik, Reichart, Roi, Szpektor, Idan, Kotek, Hadas, Belinkov, Yonatan
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
A Tutorial on Clinical Speech AI Development: From Data Collection to Model Validation
Ng, Si-Ioi, Xu, Lingfeng, Siegert, Ingo, Cummins, Nicholas, Benway, Nina R., Liss, Julie, Berisha, Visar
There has been a surge of interest in leveraging speech as a marker of health for a wide spectrum of conditions. The underlying premise is that any neurological, mental, or physical deficits that impact speech production can be objectively assessed via automated analysis of speech. Recent advances in speech-based Artificial Intelligence (AI) models for diagnosing and tracking mental health, cognitive, and motor disorders often use supervised learning, similar to mainstream speech technologies like recognition and verification. However, clinical speech AI has distinct challenges, including the need for specific elicitation tasks, small available datasets, diverse speech representations, and uncertain diagnostic labels. As a result, application of the standard supervised learning paradigm may lead to models that perform well in controlled settings but fail to generalize in real-world clinical deployments. With translation into real-world clinical scenarios in mind, this tutorial paper provides an overview of the key components required for robust development of clinical speech AI. Specifically, this paper will cover the design of speech elicitation tasks and protocols most appropriate for different clinical conditions, collection of data and verification of hardware, development and validation of speech representations designed to measure clinical constructs of interest, development of reliable and robust clinical prediction models, and ethical and participant considerations for clinical speech AI. The goal is to provide comprehensive guidance on building models whose inputs and outputs link to the more interpretable and clinically meaningful aspects of speech, that can be interrogated and clinically validated on clinical datasets, and that adhere to ethical, privacy, and security considerations by design.
Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification
Kuo, Hsun-Yu, Liao, Yin-Hsiang, Chao, Yu-Chieh, Ma, Wei-Yun, Cheng, Pu-Jen
Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with realworld distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training. The quantity and quality of data play a significant role in many tasks of Natural Language Processing (NLP). However, due to the scarcity of data in a particular domain for a specific task, we may need expertise to collect such data, resulting in budget limitations. Fortunately, Large language models (LLMs) provide a practical solution to this problem. LLMs, such as GPT series (Brown et al., 2020; OpenAI, 2022; OpenAI et al., 2024), can be leveraged to generate synthetic data that mimics real-world examples, thereby enriching the training set (Wang et al., 2023). However, training models with LLM-generated data can lead to drawbacks such as model collapse (Shumailov et al., 2023; Dohmatob et al., 2024), tail phenomena, reinforcing LM biases (Wang et al., 2023). Moreover, based on our empirical study, the performance of models trained on synthetic data without proper processing can be lower than models trained on much smaller real-world data (Sec. Previous works took data filtering strategy to get high quality or variant data (Dubey et al., 2024; MetaAI, 2024; Chiang et al., 2023; West et al., 2022).
LongReward: Improving Long-context Large Language Models with AI Feedback
Zhang, Jiajie, Hou, Zhongni, Lv, Xin, Cao, Shulin, Hou, Zhenyu, Niu, Yilin, Hou, Lei, Dong, Yuxiao, Feng, Ling, Li, Juanzi
Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models' capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one's performance.
CycleResearcher: Improving Automated Research via Automated Review
Weng, Yixuan, Zhu, Minjun, Bao, Guangsheng, Zhang, Hongbo, Wang, Jindong, Zhang, Yue, Yang, Linyi
The automation of scientific discovery has been a long-standing goal within the research community, driven by the potential to accelerate knowledge creation. While significant progress has been made using commercial large language models (LLMs) as research assistants or idea generators, the possibility of automating the entire research process with open-source LLMs remains largely unexplored. This paper explores the feasibility of using open-source post-trained LLMs as autonomous agents capable of performing the full cycle of automated research and review, from literature review and manuscript preparation to peer review and paper revision. Our iterative preference training framework consists of CycleResearcher, which conducts research tasks, and CycleReviewer, which simulates the peer review process, providing iterative feedback via reinforcement learning. To train these models, we develop two new datasets, Review-5k and Research-14k, reflecting real-world machine learning research and peer review dynamics. Our results demonstrate that CycleReviewer achieves a 26.89\% improvement in mean absolute error (MAE) over individual human reviewers in predicting paper scores, indicating that LLMs can surpass expert-level performance in research evaluation. In research, the papers generated by the CycleResearcher model achieved a score of 5.36 in simulated peer reviews, surpassing the preprint level of 5.24 from human experts and approaching the accepted paper level of 5.69. This work represents a significant step toward fully automated scientific inquiry, providing ethical safeguards and advancing AI-driven research capabilities. The code, dataset and model weight are released at \url{http://github/minjun-zhu/Researcher}.