In recent times, code-mixing has become prevalent in social networking as people communicate in multiple languages. This is become a trend and is significantly popular especially in multilingual countries. This has led to the generation of large code-mixed text having useful topics of information dispersed. However, it is very challenging as the code-mixed social media text suffers from its associated linguistic complexities. The main focus of this work is discovery of latent topics indicating useful information from code-mixed social media text overcoming the barriers of random language switch. We evaluate the resulting topic aspect clusters on standard lexical semantic evaluation tasks and show that our method produces substantially better semantic representations than code-mixed counter parts.
With a sharp rise in fluency and users of "Hinglish" in linguistically diverse country, India, it has increasingly become important to analyze social content written in this language in platforms such as Twitter, Reddit, Facebook. This project focuses on using deep learning techniques to tackle a classification problem in categorizing social content written in Hindi-English into Abusive, Hate-Inducing and Not offensive categories. We utilize bi-directional sequence models with easy text augmentation techniques such as synonym replacement, random insertion, random swap, and random deletion to produce a state of the art classifier that outperforms the previous work done on analyzing this dataset.
In order for our computer systems to be more human-like, with a higher emotional quotient, they need to be able to process and understand intrinsic human language phenomena like humour. In this paper, we consider a subtype of humour - puns, which are a common type of wordplay-based jokes. In particular, we consider code-mixed puns which have become increasingly mainstream on social media, in informal conversations and advertisements and aim to build a system which can automatically identify the pun location and recover the target of such puns. We first study and classify code-mixed puns into two categories namely intra-sentential and intra-word, and then propose a four-step algorithm to recover the pun targets for puns belonging to the intra-sentential category. Our algorithm uses language models, and phonetic similarity-based features to get the desired results. We test our approach on a small set of code-mixed punning advertisements, and observe that our system is successfully able to recover the targets for 67% of the puns.
Kumar, Upendra (Indian Institute of Information Technology, Sri City, AP) | Singh, Vishal (Indian Institute of Information Technology, Sri City, AP) | Andrew, Chris (Indian Institute of Information Technology, Sri City, AP) | Reddy, Santhoshini (Indian Institute of Information Technology, Sri City, AP) | Das, Amitava (Indian Institute of Information Technology, Sri City, AP)
In this research work, we develop a state-of-art model for identifying sentiment in Hindi-English code-mixed language. We introduce new phonemic sub-word units for Hindi-English code-mixed text along with a hierarchical deep learning model which uses these sub-word units for predicting sentiment. The results indicate that the model yields a significant increase in accuracy as compared to other models.
State-of-the-art unsupervised multilingual models (e.g., multilingual BERT) have been shown to generalize in a zero-shot cross-lingual setting. This generalization ability has been attributed to the use of a shared subword vocabulary and joint training across multiple languages giving rise to deep multilingual abstractions. We evaluate this hypothesis by designing an alternative approach that transfers a monolingual model to new languages at the lexical level. More concretely, we first train a transformer-based masked language model on one language, and transfer it to a new language by learning a new embedding matrix with the same masked language modeling objective--freezing parameters of all other layers. This approach does not rely on a shared vocabulary or joint training. However, we show that it is competitive with multilingual BERT on standard cross-lingual classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD). Our results contradict common beliefs of the basis of the generalization ability of multilingual models and suggest that deep monolingual models learn some abstractions that generalize across languages. We also release XQuAD as a more comprehensive cross-lingual benchmark, which comprises 240 paragraphs and 1190 question-answer pairs from SQuAD v1.1 translated into ten languages by professional translators.