Goto

Collaborating Authors

 trigram


Automatic essay scoring: leveraging Jaccard coefficient and Cosine similaritywith n-gram variation in vector space model approach

Cahyani, Andharini Dwi, Fathoni, Moh. Wildan, Rachman, Fika Hastarita, Basuki, Ari, Amin, Salman, Khotimah, Bain Khusnul

arXiv.org Artificial Intelligence

Automated essay scoring (AES) is a vital area of research aiming to provide efficient and accurate assessment tools for evaluating written content. This study investigates the effectiveness of two popular similarity metrics, Jaccard coefficient, and Cosine similarity, within the context of vector space models(VSM)employing unigram, bigram, and trigram representations. The data used in this research was obtained from the formative essay of the citizenship education subject in a junior high school. Each essay undergoes preprocessing to extract features using n-gram models, followed by vectorization to transform text data into numerical representations. Then, similarity scores are computed between essays using both Jaccard coefficient and Cosine similarity. The performance of the system is evaluated by analyzing the root mean square error (RMSE), which measures the difference between the scores given by human graders and those generated by the system. The result shows that the Cosine similarity outperformed the Jaccard coefficient. In terms of n-gram, unigrams have lower RMSE compared to bigrams and trigrams.


Complexity counts: global and local perspectives on Indo-Aryan numeral systems

Cathcart, Chundra

arXiv.org Artificial Intelligence

The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1--99 are highly non-transparent and are cannot be constructed using straightforward rules. As an example, Hindi/Urdu *ikyānve* `91' is not decomposable into the composite elements *ek* `one' and *nave* `ninety' in the way that its English counterpart is. This paper situates Indo-Aryan languages within the typology of cross-linguistic numeral systems, and explores the linguistic and non-linguistic factors that may be responsible for the persistence of complex systems in these languages. Using cross-linguistic data from multiple databases, we develop and employ a number of cross-linguistically applicable metrics to quantifies the complexity of languages' numeral systems, and demonstrate that Indo-Aryan languages have decisively more complex numeral systems than the world's languages as a whole, though individual Indo-Aryan languages differ from each other in terms of the complexity of the patterns they display. We investigate the factors (e.g., religion, geographic isolation, etc.) that underlie complexity in numeral systems, with a focus on South Asia, in an attempt to develop an account of why complex numeral systems developed and persisted in certain Indo-Aryan languages but not elsewhere. Finally, we demonstrate that Indo-Aryan numeral systems adhere to certain general pressures toward efficient communication found cross-linguistically, despite their high complexity. We call for this somewhat overlooked dimension of complexity to be taken seriously when discussing general variation in cross-linguistic numeral systems.


Historical and psycholinguistic perspectives on morphological productivity: A sketch of an integrative approach

Baayen, Harald, Berg, Kristian, Mohamed, Maziyah

arXiv.org Artificial Intelligence

In this study, we approach morphological productivity from two perspectives: a cognitive-computational perspective, and a diachronic perspective zooming in on an actual speaker, Thomas Mann. For developing the first perspective, we make use of a cognitive computational model of the mental lexicon, the discriminative lexicon model. For computational mappings between form and meaning to be productive, in the sense that novel, previously unencountered words, can be understood and produced, there must be systematicities between the form space and the semantic space. If the relation between form and meaning would be truly arbitrary, a model could memorize form and meaning pairings, but there is no way in which the model would be able to generalize to novel test data. For Finnish nominal inflection, Malay derivation, and English compounding, we explore, using the Discriminative Lexicon Model as a computational tool, to trace differences in the degree to which inflectional and word formation patterns are productive. We show that the DLM tends to associate affix-like sublexical units with the centroids of the embeddings of the words with a given affix. For developing the second perspective, we study how the intake and output of one prolific writer, Thomas Mann, changes over time. We show by means of an examination of what Thomas Mann is likely to have read, and what he wrote, that the rate at which Mann produces novel derived words is extremely low. There are far more novel words in his input than in his output. We show that Thomas Mann is less likely to produce a novel derived word with a given suffix the greater the average distance is of the embeddings of all derived words to the corresponding centroid, and discuss the challenges of using speaker-specific embeddings for low-frequency and novel words.


Reviews: Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm

Neural Information Processing Systems

If it was in the paper that F is set equal to the number of classes, I missed it. Please consider adding a sentence giving your explanation for the decrease, in any case. I also appreciate the authors reporting the accuracy of a simple LDA baseline in the author response. In general the author response addressed my major concerns. I discuss particular strengths and areas for improvement below.


GUIDEQ: Framework for Guided Questioning for progressive informational collection and classification

Mishra, Priya, Racha, Suraj, Ponkshe, Kaustubh, Akarsh, Adit, Ramakrishnan, Ganesh

arXiv.org Artificial Intelligence

Question Answering (QA) is an important part of tasks like text classification through information gathering. These are finding increasing use in sectors like healthcare, customer support, legal services, etc., to collect and classify responses into actionable categories. LLMs, although can support QA systems, they face a significant challenge of insufficient or missing information for classification. Although LLMs excel in reasoning, the models rely on their parametric knowledge to answer. However, questioning the user requires domain-specific information aiding to collect accurate information. Our work, GUIDEQ, presents a novel framework for asking guided questions to further progress a partial information. We leverage the explainability derived from the classifier model for along with LLMs for asking guided questions to further enhance the information. This further information helps in more accurate classification of a text. GUIDEQ derives the most significant key-words representative of a label using occlusions. We develop GUIDEQ's prompting strategy for guided questions based on the top-3 classifier label outputs and the significant words, to seek specific and relevant information, and classify in a targeted manner. Through our experimental results, we demonstrate that GUIDEQ outperforms other LLM-based baselines, yielding improved F1-Score through the accurate collection of relevant further information. We perform various analytical studies and also report better question quality compared to our method.


A longitudinal sentiment analysis of Sinophobia during COVID-19 using large language models

Wang, Chen, Chandra, Rohitash

arXiv.org Artificial Intelligence

The COVID-19 pandemic has exacerbated xenophobia, particularly Sinophobia, leading to widespread discrimination against individuals of Chinese descent. Large language models (LLMs) are pre-trained deep learning models used for natural language processing (NLP) tasks. The ability of LLMs to understand and generate human-like text makes them particularly useful for analysing social media data to detect and evaluate sentiments. We present a sentiment analysis framework utilising LLMs for longitudinal sentiment analysis of the Sinophobic sentiments expressed in X (Twitter) during the COVID-19 pandemic. The results show a significant correlation between the spikes in Sinophobic tweets, Sinophobic sentiments and surges in COVID-19 cases, revealing that the evolution of the pandemic influenced public sentiment and the prevalence of Sinophobic discourse. Furthermore, the sentiment analysis revealed a predominant presence of negative sentiments, such as annoyance and denial, which underscores the impact of political narratives and misinformation shaping public opinion. The lack of empathetic sentiment which was present in previous studies related to COVID-19 highlights the way the political narratives in media viewed the pandemic and how it blamed the Chinese community. Our study highlights the importance of transparent communication in mitigating xenophobic sentiments during global crises.


A Lean Transformer Model for Dynamic Malware Analysis and Detection

Quertier, Tony, Marais, Benjamin, Barrué, Grégoire, Morucci, Stéphane, Azé, Sévan, Salladin, Sébastien

arXiv.org Artificial Intelligence

Malware is a fast-growing threat to the modern computing world and existing lines of defense are not efficient enough to address this issue. This is mainly due to the fact that many prevention solutions rely on signature-based detection methods that can easily be circumvented by hackers. Therefore, there is a recurrent need for behavior-based analysis where a suspicious file is ran in a secured environment and its traces are collected to reports for analysis. Previous works have shown some success leveraging Neural Networks and API calls sequences extracted from these execution reports. Recently, Large Language Models and Generative AI have demonstrated impressive capabilities mainly in Natural Language Processing tasks and promising applications in the cybersecurity field for both attackers and defenders. In this paper, we design an Encoder-Only model, based on the Transformers architecture, to detect malicious files, digesting their API call sequences collected by an execution emulation solution. We are also limiting the size of the model architecture and the number of its parameters since it is often considered that Large Language Models may be overkill for specific tasks such as the one we are dealing with hereafter. In addition to achieving decent detection results, this approach has the advantage of reducing our carbon footprint by limiting training and inference times and facilitating technical operations with less hardware requirements. We also carry out some analysis of our results and highlight the limits and possible improvements when using Transformers to analyze malicious files.


T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Deiseroth, Björn, Brack, Manuel, Schramowski, Patrick, Kersting, Kristian, Weinbach, Samuel

arXiv.org Artificial Intelligence

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.


Promoting Constructive Deliberation: Reframing for Receptiveness

Kambhatla, Gauri, Lease, Matthew, Rajadesingan, Ashwin

arXiv.org Artificial Intelligence

To promote constructive discussion of controversial topics online, we propose automatic reframing of disagreeing responses to signal receptiveness to a preceding comment. Drawing on research from psychology, communications, and linguistics, we identify six strategies for reframing. We automatically reframe replies to comments according to each strategy, using a Reddit dataset. Through human-centered experiments, we find that the replies generated with our framework are perceived to be significantly more receptive than the original replies and a generic receptiveness baseline. We illustrate how transforming receptiveness, a particular social science construct, into a computational framework, can make LLM generations more aligned with human perceptions. We analyze and discuss the implications of our results, and highlight how a tool based on our framework might be used for more teachable and creative content moderation.


Word frequency and sentiment analysis of twitter messages during Coronavirus pandemic

Rajput, Nikhil Kumar, Grover, Bhavya Ahuja, Rathi, Vipin Kumar, Bansal, Riya

arXiv.org Artificial Intelligence

The COVID-19 epidemic has had a great impact on social media conversation, especially on sites like Twitter, which has emerged as a hub for public reaction and information sharing. This paper deals by analyzing a vast dataset of Twitter messages related to this disease, starting from January 2020. Two approaches were used: a statistical analysis of word frequencies and a sentiment analysis to gauge user attitudes. Word frequencies are modeled using unigrams, bigrams, and trigrams, with power law distribution as the fitting model. The validity of the model is confirmed through metrics like Sum of Squared Errors (SSE), R-squared ($R^2$), and Root Mean Squared Error (RMSE). High $R^2$ and low SSE/RMSE values indicate a good fit for the model. Sentiment analysis is conducted to understand the general emotional tone of Twitter users messages. The results reveal that a majority of tweets exhibit neutral sentiment polarity, with only 2.57\% expressing negative polarity.