South America
Tradutor: Building a Variety Specific Translation Model
Sousa, Hugo, Almasian, Satya, Campos, Ricardo, Jorge, Alípio
Language models have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.
Forecasting Local Ionospheric Parameters Using Transformers
Alford-Lago, Daniel J., Curtis, Christopher W., Ihler, Alexander T., Zawdie, Katherine A., Drob, Douglas P.
Accurate and efficient modeling of Earth's ionosphere has a significant impact on research and operational communities due to its effects on radio communications, radar performance, [1, 2, 3] and satellite drag [4]. Success in forecasting key parameters such as the F2 layer critical frequency (foF2) and height (hmF2) and the total electron content (TEC) allows one to anticipate and mitigate the impacts of ionospheric variability on such systems. Over the past decades, many modeling approaches have been developed to predict these ionospheric parameters with increasing accuracy and skill. These models may be broadly categorized as empirical, physics-based, and, more recently, machine learning methods. Empirical models often rely on extensive historical datasets to establish statistical relationships between ionospheric parameters and geophysical variables. The International Reference Ionosphere (IRI) model [5] is a widely used standard that provides monthly averages of various ionospheric parameters based on many decades of past observations. IRI has seen continual development and improvements over the years, adding a host of submodels used to capture specific aspects of the ionosphere such as the CCIR [6, 7] and URSI [8] foF2 models for representing the diurnal variations of the peak plasma density across the globe, the AMTB [9] and SHU-2015 [10] models for even more harmonic expansions of hmF2, and NeQuick 2 [11] for improved topside electron density accuracy and thus better estimates of TEC [12, 13]. So, while large empirical models like IRI continue to improve, the number of these available options needed to address each domain and source of variance in the ionosphere also grows, and choosing the appropriate settings may be prohibitive without expert knowledge of each submodel.
Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs
Li, Zongxia, Calvo-Bartolomé, Lorena, Hoyle, Alexander, Xu, Paiheng, Dima, Alden, Fung, Juan Francisco, Boyd-Graber, Jordan
A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.
Deriving Representative Structure from Music Corpora
Shapiro, Ilana, Ruanqianqian, null, Huang, null, Novack, Zachary, Wang, Cheng-i, Dong, Hao-Wen, Berg-Kirkpatrick, Taylor, Dubnov, Shlomo, Lerner, Sorin
Western music is an innately hierarchical system of interacting levels of structure, from fine-grained melody to high-level form. In order to analyze music compositions holistically and at multiple granularities, we propose a unified, hierarchical meta-representation of musical structure called the structural temporal graph (STG). For a single piece, the STG is a data structure that defines a hierarchy of progressively finer structural musical features and the temporal relationships between them. We use the STG to enable a novel approach for deriving a representative structural summary of a music corpus, which we formalize as a dually NP-hard combinatorial optimization problem extending the Generalized Median Graph problem. Our approach first applies simulated annealing to develop a measure of structural distance between two music pieces rooted in graph isomorphism. Our approach then combines the formal guarantees of SMT solvers with nested simulated annealing over structural distances to produce a structurally sound, representative centroid STG for an entire corpus of STGs from individual pieces. To evaluate our approach, we conduct experiments verifying that structural distance accurately differentiates between music pieces, and that derived centroids accurately structurally characterize their corpora.
Advancing Out-of-Distribution Detection via Local Neuroplasticity
Canevaro, Alessandro, Schmidt, Julian, Marvi, Mohammad Sajad, Yu, Hang, Martius, Georg, Jordan, Julian
In the domain of machine learning, the assumption that training and test data share the same distribution is often violated in real-world scenarios, requiring effective out-of-distribution (OOD) detection. This paper presents a novel OOD detection method that leverages the unique local neuroplasticity property of Kolmogorov-Arnold Networks (KANs). Unlike traditional multilayer perceptrons, KANs exhibit local plasticity, allowing them to preserve learned information while adapting to new tasks. Our method compares the activation patterns of a trained KAN against its untrained counterpart to detect OOD samples. We validate our approach on benchmarks from image and medical domains, demonstrating superior performance and robustness compared to state-of-the-art techniques. These results underscore the potential of KANs in enhancing the reliability of machine learning systems in diverse environments.
mStyleDistance: Multilingual Style Embeddings and their Evaluation
Qiu, Justin, Zhu, Jiacheng, Patel, Ajay, Apidianaki, Marianna, Callison-Burch, Chris
Style embeddings are useful for stylistic analysis and style transfer; however, only English style embeddings have been made available. We introduce Multilingual StyleDistance (mStyleDistance), a multilingual style embedding model trained using synthetic data and contrastive learning. We train the model on data from nine languages and create a multilingual STEL-or-Content benchmark (Wegmann et al., 2022) that serves to assess the embeddings' quality. We also employ our embeddings in an authorship verification task involving different languages. Our results show that mStyleDistance embeddings outperform existing models on these multilingual style benchmarks and generalize well to unseen features and languages. We make our model publicly available at https://huggingface.co/StyleDistance/mstyledistance .
Visualizing Machine Learning Models for Enhanced Financial Decision-Making and Risk Management
Ganguly, Priyam, Garine, Ramakrishna, Mukherjee, Isha
This study emphasizes how crucial it is to visualize machine learning models, especially for the banking industry, in order to improve interpretability and support predictions in high stakes financial settings. Visual tools enable performance improvements and support the creation of innovative financial models by offering crucial insights into the algorithmic decision-making processes. Within a financial machine learning framework, the research uses visually guided experiments to make important concepts, such risk assessment and portfolio allocation, more understandable. The study also examines variations in trading tactics and how they relate to risk appetite, coming to the conclusion that the frequency of portfolio rebalancing is negatively correlated with risk tolerance. Finding these ideas is made possible in large part by visualization. The study concludes by presenting a novel method of locally stochastic asset weighing, where visualization facilitates data extraction and validation. This highlights the usefulness of these methods in furthering the field of financial machine learning research.
Voter Model Meets Rumour Spreading: A Study of Consensus Protocols on Graphs with Agnostic Nodes [Extended Version]
Gauy, Marcelo Matheus, Abramishvili, Anna, Colli, Eduardo, Madeira, Tiago, Mallmann-Trenn, Frederik, Vasconcelos, Vinícius Franco, Marzagão, David Kohan
Problems of consensus in multi-agent systems are often viewed as a series of independent, simultaneous local decisions made between a limited set of options, all aimed at reaching a global agreement. Key challenges in these protocols include estimating the likelihood of various outcomes and finding bounds for how long it may take to achieve consensus, if it occurs at all. To date, little attention has been given to the case where some agents have no initial opinion. In this paper, we introduce a variant of the consensus problem which includes what we call `agnostic' nodes and frame it as a combination of two known and well-studied processes: voter model and rumour spreading. We show (1) a martingale that describes the probability of consensus for a given colour, (2) bounds on the number of steps for the process to end using results from rumour spreading and voter models, (3) closed formulas for the probability of consensus in a few special cases, and (4) that the computational complexity of estimating the probability with a Markov chain Monte Carlo process is $O(n^2 \log n)$ for general graphs and $O(n\log n)$ for Erd\H{o}s-R\'enyi graphs, which makes it an efficient method for estimating probabilities of consensus. Furthermore, we present experimental results suggesting that the number of runs needed for a given standard error decreases when the number of nodes increases.
Obliviate: Efficient Unmemorization for Protecting Intellectual Property in Large Language Models
Russinovich, Mark, Salem, Ahmed
Recent copyright agreements between AI companies and content creators have highlighted the need for precise control over language models' ability to reproduce copyrighted content. While existing approaches rely on either complete concept removal through unlearning or simple output filtering, we propose Obliviate, a novel post-training technique that selectively prevents verbatim reproduction of specific text while preserving semantic understanding. Obliviate operates by selecting tokens within memorized sequences and modifying the model's probability distribution to prevent exact reproduction while maintaining contextual understanding. We evaluate Obliviate on multiple large language models (LLaMA-3.1 8B, LLaMA-3.1-instruct 8B, Qwen-2.5-7B, and Yi-1.5 6B) across both synthetic memorization tasks and organic copyright content. Our results demonstrate that Obliviate achieves orders of magnitude reduction, e.g., 100x, in verbatim memorization while maintaining model performance within 1% of baseline on standard benchmarks (HellaSwag, MMLU, TruthfulQA, and Winogrande). This makes Obliviate particularly suitable for practical deployment scenarios where companies need to efficiently address copyright concerns in pretrained models without compromising their general capabilities.
A Rapid Test for Accuracy and Bias of Face Recognition Technology
Knott, Manuel, Serna, Ignacio, Mann, Ethan, Perona, Pietro
Measuring the accuracy of face recognition (FR) systems is essential for improving performance and ensuring responsible use. Accuracy is typically estimated using large annotated datasets, which are costly and difficult to obtain. We propose a novel method for 1:1 face verification that benchmarks FR systems quickly and without manual annotation, starting from approximate labels (e.g., from web search results). Unlike previous methods for training set label cleaning, ours leverages the embedding representation of the models being evaluated, achieving high accuracy in smaller-sized test datasets. Our approach reliably estimates FR accuracy and ranking, significantly reducing the time and cost of manual labeling. We also introduce the first public benchmark of five FR cloud services, revealing demographic biases, particularly lower accuracy for Asian women. Our rapid test method can democratize FR testing, promoting scrutiny and responsible use of the technology.