Yerevan
Synthetic Data for any Differentiable Target
Thrush, Tristan, Park, Sung Min, Brunborg, Herman, Bailey, Luke, Roed, Marcel, Band, Neil, Potts, Christopher, Hashimoto, Tatsunori
What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.
A dating app, a niqab and a 9mm gun - how a US woman was hired to end a UK family feud
Betro initially fled the scene but returned by taxi just after midnight and fired three shots at the family home. By 13:30 BST, she was at Manchester Airport and flew to the US, prosecutors said. Days later, Nazir followed and according to Betro, the pair rented a car and drove to Seattle "just for a road trip" with stops at an amusement park, Area 51 in Nevada, Los Angeles and San Francisco. She told jurors she did not know there had been a shooting in Measham Grove and Nazir had not mentioned it during his time in the States. The investigation to find Betro and bring her co-conspirators to justice not only spanned several years but was hampered by the pandemic and involved the FBI, National Crime Agency and two UK police forces.
Methods to Increase the Amount of Data for Speech Recognition for Low Resource Languages
Ayrapetyan, Alexan, Kostandian, Sofia, Yeroyan, Ara, Yerznkanyan, Mher, Karpov, Nikolay, Tadevosyan, Nune, Lavrukhin, Vitaly, Ginsburg, Boris
This study explores methods to increase data volume for low-resource languages using techniques such as crowdsourcing, pseudo-labeling, advanced data preprocessing and various permissive data sources such as audiobooks, Common Voice, YouTube. While these methods are well-explored for highresource languages, their application for low-resource languages remains underexplored. Using Armenian and Georgian as case studies, we demonstrate how linguistic and resource-specific characteristics influence the success of these methods. This work provides practical guidance for researchers to choose cost-effective and quality-driven dataset extension strategies for low-resource languages. The key takeaway from various data extension approaches is that paid crowd-sourcing offers the best balance between cost and quality, outperforming volunteer crowd-sourcing, open-source audiobooks, and unlabeled data usage. Ablation study shows that models trained on the expanded datasets outperform existing baselines and achieve 5.73% for Gergian and 9.9% for Armenian ASR word error rate using a relatively small FastConformer architecture. We open-sourced both the Armenian and Georgian models to allow further research and practical applications.
Scalable Temporal Anomaly Causality Discovery in Large Systems: Achieving Computational Efficiency with Binary Anomaly Flag Data
Asres, Mulugeta Weldezgina, Omlin, Christian Walter, Collaboration, The CMS-HCAL
Extracting anomaly causality facilitates diagnostics once monitoring systems detect system faults. Identifying anomaly causes in large systems involves investigating a more extensive set of monitoring variables across multiple subsystems. However, learning causal graphs comes with a significant computational burden that restrains the applicability of most existing methods in real-time and large-scale deployments. In addition, modern monitoring applications for large systems often generate large amounts of binary alarm flags, and the distinct characteristics of binary anomaly data -- the meaning of state transition and data sparsity -- challenge existing causality learning mechanisms. This study proposes an anomaly causal discovery approach (AnomalyCD), addressing the accuracy and computational challenges of generating causal graphs from binary flag data sets. The AnomalyCD framework presents several strategies, such as anomaly flag characteristics incorporating causality testing, sparse data and link compression, and edge pruning adjustment approaches. We validate the performance of this framework on two datasets: monitoring sensor data of the readout-box system of the Compact Muon Solenoid experiment at CERN, and a public data set for information technology monitoring. The results demonstrate the considerable reduction of the computation overhead and moderate enhancement of the accuracy of temporal causal discovery on binary anomaly data sets.
AI in the Cosmos
Artificial intelligence (AI) is revolutionizing research by enabling the efficient analysis of large datasets and the discovery of hidden patterns. In astrophysics, AI has become essential, transforming the classification of celestial sources, data modeling, and the interpretation of observations. In this review, I highlight examples of AI applications in astrophysics, including source classification, spectral energy distribution modeling, and discuss the advancements achievable through generative AI. However, the use of AI introduces challenges, including biases, errors, and the "black box" nature of AI models, which must be resolved before their application. These issues can be addressed through the concept of Human-Guided AI (HG-AI), which integrates human expertise and domain-specific knowledge into AI applications. This approach aims to ensure that AI is applied in a robust, interpretable, and ethical manner, leading to deeper insights and fostering scientific excellence.
Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction
Ghukasyan, Edvard, Khachatrian, Hrant, Mkrtchyan, Rafayel, Raptis, Theofanis P.
Vision Transformers (ViTs) have demonstrated remarkable success in achieving state-of-the-art performance across various image-based tasks and beyond. In this study, we employ a ViT-based neural network to address the problem of indoor pathloss radio map prediction. The network's generalization ability is evaluated across diverse settings, including unseen buildings, frequencies, and antennas with varying radiation patterns. By leveraging extensive data augmentation techniques and pretrained DINOv2 weights, we achieve promising results, even under the most challenging scenarios.
Formulation of probability theory problem with subtle condition
Problems in probability theory prove to be one of the most challenging for students. Here, we formulate and discuss four related problems in probability theory that proved difficult for first to fourth-year undergraduate students whose first language was not English. These examples emphasize how crucial it is to understand the conditions and requirements of the problems precisely before starting to solve them. We discuss the solutions to those problems in detail, complement them with numerical estimations, and link the conditions in the problems to the logical statements in Python programming language. We also tested two widely used chatbots (GPT-4o and Claude 3.5 Sonnet) by checking their responses to these problems.
In-context Learning in Presence of Spurious Correlations
Harutyunyan, Hrayr, Darbinyan, Rafayel, Karapetyan, Samvel, Khachatrian, Hrant
Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to task memorization and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.
Temporal Many-valued Conditional Logics: a Preliminary Report
Alviano, Mario, Giordano, Laura, Duprรฉ, Daniele Theseider
In this paper we propose a many-valued temporal conditional logic. We start from a many-valued logic with typicality, and extend it with the temporal operators of the Linear Time Temporal Logic (LTL), thus providing a formalism which is able to capture the dynamics of a system, trough strict and defeasible temporal properties. We also consider an instantiation of the formalism for gradual argumentation.
Speaker Tagging Correction With Non-Autoregressive Language Models
Kirakosyan, Grigor, Karamyan, Davit
Speech applications dealing with conversations require not only recognizing the spoken words but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate systems, namely, an automatic speech recognition (ASR) system and a speaker diarization (SD) system. In practical settings, speaker diarization systems can experience significant degradation in performance due to a variety of factors, including uniform segmentation with a high temporal resolution, inaccurate word timestamps, incorrect clustering and estimation of speaker numbers, as well as background noise. Therefore, it is important to automatically detect errors and make corrections if possible. We used a second-pass speaker tagging correction system based on a non-autoregressive language model to correct mistakes in words placed at the borders of sentences spoken by different speakers. We first show that the employed error correction approach leads to reductions in word diarization error rate (WDER) on two datasets: TAL and test set of Fisher. Additionally, we evaluated our system in the Post-ASR Speaker Tagging Correction challenge and observed significant improvements in cpWER compared to baseline methods.