lrec
Shoulders of Giants: A Look at the Degree and Utility of Openness in NLP Research
Ranathunga, Surangika, de Silva, Nisansa, Jayakody, Dilith, Fernando, Aloka
We analysed a sample of NLP research papers archived in ACL Anthology as an attempt to quantify the degree of openness and the benefit of such an open culture in the NLP community. We observe that papers published in different NLP venues show different patterns related to artefact reuse. We also note that more than 30% of the papers we analysed do not release their artefacts publicly, despite promising to do so. Further, we observe a wide language-wise disparity in publicly available NLP-related artefacts.
ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus
Hamed, Injy, Eryani, Fadhl, Palfreyman, David, Habash, Nizar
The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus presents a challenging set for automatic speech recognition (ASR), including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also code-switching between these languages and dialects. As part of our work, we take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages. We further enrich the corpus with two layers of annotations; (1) dialectness level annotation for the portion of the corpus where mixing occurs between different variants of Arabic, and (2) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging.
Can ChatGPT Compute Trustworthy Sentiment Scores from Bloomberg Market Wraps?
Lefort, Baptiste, Benhamou, Eric, Ohana, Jean-Jacques, Saltiel, David, Guez, Beatrice, Challet, Damien
We used a dataset of daily Bloomberg Financial Market Summaries from 2010 to 2023, reposted on large financial media, to determine how global news headlines may affect stock market movements using ChatGPT and a two-stage prompt approach. We document a statistically significant positive correlation between the sentiment score and future equity market returns over short to medium term, which reverts to a negative correlation over longer horizons. Validation of this correlation pattern across multiple equity markets indicates its robustness across equity regions and resilience to non-linearity, evidenced by comparison of Pearson and Spearman correlations. Finally, we provide an estimate of the optimal horizon that strikes a balance between reactivity to new information and correlation.
On the Effectiveness of Linear Models for One-Class Collaborative Filtering
Sedhain, Suvash (Australian National University) | Menon, Aditya Krishna (Australian National University and NICTA) | Sanner, Scott (Oregon State University and Australian National University) | Braziunas, Darius (Rakuten Kobo Inc)
In many personalised recommendation problems, there are examples of items users prefer or like, but no examples of items they dislike. A state-of-the-art method for such implicit feedback, or one-class collaborative filtering (OC-CF), problems is SLIM, which makes recommendations based on a learned item-item similarity matrix. While SLIM has been shown to perform well on implicit feedback tasks, we argue that it is hindered by two limitations: first, it does not produce user-personalised predictions, which hampers recommendation performance; second, it involves solving a constrained optimisation problem, which impedes fast training. In this paper, we propose LRec, a variant of SLIM that overcomes these limitations without sacrificing any of SLIM's strengths.At its core, LRec employs linear logistic regression; despite this simplicity, LRec consistently and significantly outperforms all existing methods on a range of datasets. Our results thus illustrate that the OC-CF problem can be effectively tackled via linear classification models.