Goto

Collaborating Authors

 elmo


Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain)

Mariya Toneva, Leila Wehbe

Neural Information Processing Systems

Weusebrainimagingrecordings ofsubjectsreading complex natural text to interpret word and sequence embeddings from4 recent NLP models - ELMo, USE, BERT and Transformer-XL. We study how their representations differ across layer depth, contextlength, and attention type.


ELMO: Efficiency via Low-precision and Peak Memory Optimization in Large Output Spaces

Zhang, Jinbin, Ullah, Nasib, Schultheis, Erik, Babbar, Rohit

arXiv.org Artificial Intelligence

Large output spaces, also referred to as Extreme multilabel classification (XMC), is a setting that arises, e.g., in large-scale tagging and product-to-product recommendation, and is characterized by the number of labels ranging from hundreds of thousands to millions. This means that the linear classification head, usually only a tiny fraction of the overall model, turns into the main driver for compute and memory demand. Current state-of-the-art XMC methods predominantly rely on FP16-FP32 mixed-precision training, which we show can be unstable, and inefficient in terms of memory usage and computational overhead. Meanwhile, existing low-precision methods typically retain higher precision for the classification layer. In this work, we propose ELMO, a pure low-precision training framework for XMC models using BFloat16 and Float8 data types. By leveraging Kahan summation and stochastic rounding, we demonstrate that XMC models can be effectively trained entirely in Float8, without relying on single-precision master weights or tensor scaling. Low-precision training, combined with our proposed memory optimizations -- gradient fusion and chunking -- enables significant reductions in GPU memory usage. For example, we train a 3-million-label XMC model with only 6.6 GiB of GPU memory, compared to the 39.7 GiB required by the optimized SOTA method, Renee without compromising accuracy.



Sesame Street puppet Elmo's X account posts anti-Jewish rant after hacking

Al Jazeera

The makers of Sesame Street have deleted a slew of offensive social media posts after hackers hijacked the puppet Elmo's X account to launch a tirade about Jews and Jeffrey Epstein. The posts on Elmo's account on Sunday called for the extermination of Jewish people, referred to United States President Donald Trump as a "puppet" of Israeli Prime Minister Benjamin Netanyahu and demanded the release of law enforcement files about Epstein, the accused sex trafficker who died in 2019. The posts attracted a flurry of attention online before being deleted a short time after they were uploaded on Sunday. "Elmo's X account was compromised by an unknown hacker who posted disgusting messages, including antisemitic and racist posts," a spokesperson for the Sesame Workshop told Al Jazeera in a statement on Monday. "The account has since been secured."


ELMO: Enhanced Real-time LiDAR Motion Capture through Upsampling

Jang, Deok-Kyeong, Yang, Dongseok, Jang, Deok-Yun, Choi, Byeoli, Shin, Donghoon, Lee, Sung-hee

arXiv.org Artificial Intelligence

This paper introduces ELMO, a real-time upsampling motion capture framework designed for a single LiDAR sensor. Modeled as a conditional autoregressive transformer-based upsampling motion generator, ELMO achieves 60 fps motion capture from a 20 fps LiDAR point cloud sequence. The key feature of ELMO is the coupling of the self-attention mechanism with thoughtfully designed embedding modules for motion and point clouds, significantly elevating the motion quality. To facilitate accurate motion capture, we develop a one-time skeleton calibration model capable of predicting user skeleton offsets from a single-frame point cloud. Additionally, we introduce a novel data augmentation technique utilizing a LiDAR simulator, which enhances global root tracking to improve environmental understanding. To demonstrate the effectiveness of our method, we compare ELMO with state-of-the-art methods in both image-based and point cloud-based motion capture. We further conduct an ablation study to validate our design principles. ELMO's fast inference time makes it well-suited for real-time applications, exemplified in our demo video featuring live streaming and interactive gaming scenarios. Furthermore, we contribute a high-quality LiDAR-mocap synchronized dataset comprising 20 different subjects performing a range of motions, which can serve as a valuable resource for future research. The dataset and evaluation code are available at {\blue \url{https://movin3d.github.io/ELMO_SIGASIA2024/}}


Reviews: GLoMo: Unsupervised Learning of Transferable Relational Graphs

Neural Information Processing Systems

This paper presents a method to transfer graph structures learned on unlabeled data to downstream tasks, which is a conceptual shift from existing research that aims to transfer features (e.g., embeddings). The method consists of jointly training a feature and graph predictor using an unsupervised objective (which are decoupled) and then extracting only the output of the graph predictor for downstream tasks, where it is multiplicatively applied to arbitrary features. The method yields small improvements on a variety of NLP and vision tasks, and the qualitative analysis of the learned graphs does not convince me that it learns "meaningful" substructures. Overall, however, the paper has a compelling and promising idea (graph transfer), and it seems like there is room to improve on its results, so I'm a weak accept. Detailed comments: - Is "unsupervisedly" a word? It sounds weird... - The objective function in eq 3 is interesting and could have potential uses outside of just graph induction, as it seems especially powerful from the ablations in table 2...


Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models using Minimal Pairs

He, Linyang, Chen, Peili, Nie, Ercong, Li, Yuanning, Brennan, Jonathan R.

arXiv.org Artificial Intelligence

Inspired by cognitive neuroscience studies, we introduce a novel `decoding probing' method that uses minimal pairs benchmark (BLiMP) to probe internal linguistic characteristics in neural language models layer by layer. By treating the language model as the `brain' and its representations as `neural activations', we decode grammaticality labels of minimal pairs from the intermediate layers' representations. This approach reveals: 1) Self-supervised language models capture abstract linguistic structures in intermediate layers that GloVe and RNN language models cannot learn. 2) Information about syntactic grammaticality is robustly captured through the first third layers of GPT-2 and also distributed in later layers. As sentence complexity increases, more layers are required for learning grammatical capabilities. 3) Morphological and semantics/syntax interface-related features are harder to capture than syntax. 4) For Transformer-based models, both embeddings and attentions capture grammatical features but show distinct patterns. Different attention heads exhibit similar tendencies toward various linguistic phenomena, but with varied contributions.


A Context-Sensitive Word Embedding Approach for The Detection of Troll Tweets

Yilmaz, Seyhmus, Zavrak, Sultan

arXiv.org Artificial Intelligence

In this study, we aimed to address the growing concern of trolling behavior on social media by developing and evaluating a set of model architectures for the automatic detection of troll tweets. Utilizing deep learning techniques and pre-trained word embedding methods such as BERT, ELMo, and GloVe, we evaluated the performance of each architecture using metrics such as classification accuracy, F1 score, AUC, and precision. Our results indicate that BERT and ELMo embedding methods performed better than the GloVe method, likely due to their ability to provide contextualized word embeddings that better capture the nuances and subtleties of language use in online social media. Additionally, we found that CNN and GRU encoders performed similarly in terms of F1 score and AUC, suggesting their effectiveness in extracting relevant information from input text. The best-performing method was found to be an ELMo-based architecture that employed a GRU classifier, with an AUC score of 0.929. This research highlights the importance of utilizing contextualized word embeddings and appropriate encoder methods in the task of troll tweet detection, which can assist social-based systems in improving their performance in identifying and addressing trolling behavior on their platforms.


On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

Nityasya, Made Nindyatama, Wibowo, Haryo Akbarianto, Aji, Alham Fikri, Winata, Genta Indra, Prasojo, Radityo Eko, Blunsom, Phil, Kuncoro, Adhiguna

arXiv.org Artificial Intelligence

This evidence-based position paper critiques current research practices within the language model pre-training literature. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between different models under comparable conditions. These practices (i) leave us ill-equipped to understand which pre-training approaches should be used under what circumstances; (ii) impede reproducibility and credit assignment; and (iii) render it difficult to understand: "How exactly does each factor contribute to the progress that we have today?" We provide a case in point by revisiting the success of BERT over its baselines, ELMo and GPT-1, and demonstrate how -- under comparable conditions where the baselines are tuned to a similar extent -- these baselines (and even-simpler variants thereof) can, in fact, achieve competitive or better performance than BERT. These findings demonstrate how disentangling different factors of model improvements can lead to valuable new insights. We conclude with recommendations for how to encourage and incentivize this line of work, and accelerate progress towards a better and more systematic understanding of what factors drive the progress of our foundation models today.


A Comprehensive Empirical Evaluation of Existing Word Embedding Approaches

Zaland, Obaidullah, Abulaish, Muhammad, Fazil, Mohd.

arXiv.org Artificial Intelligence

Vector-based word representations help countless Natural Language Processing (NLP) tasks capture both semantic and syntactic regularities of the language. In this paper, we present the characteristics of existing word embedding approaches and analyze them with regards to many classification tasks. We categorize the methods into two main groups - Traditional approaches mostly use matrix factorization to produce word representations, and they are not able to capture the semantic and syntactic regularities of the language very well. Neural-Network based approaches, on the other hand, can capture sophisticated regularities of the language and preserve the word relationships in the generated word representations. We report experimental results on multiple classification tasks and highlight the scenarios where one approach performs better than the rest.