Multilingual parallel text corpora provide a powerful means for propagating linguistic knowledge across languages. We present a model which jointly learns linguistic structure for each language while inducing links between them. Our model supports fully symmetrical knowledge transfer, utilizing any combination of supervised and unsupervised data across language barriers. The proposed nonparametric Bayesian model effectively combines cross-lingual alignment with target language predictions. This architecture is a potent alternative to projection methods which decompose these decisions into two separate stages. We apply this approach to the task of morphological segmentation, where the goal is to separate a word into its individual morphemes. When tested on a parallel corpus of Hebrew and Arabic, our joint bilingual model effectively incorporates all available evidence from both languages, yielding significant performance gains.
Lee, Sungjin (Pohang University of Science and Technology (POSTECH)) | Noh, Hyungjong (Pohang University of Science and Technology (POSTECH)) | Lee, Kyusong (Pohang University of Science and Technology (POSTECH)) | Lee, Gary Geunbae (Pohang University of Science and Technology (POSTECH))
The demand for computer-assisted language learning systems that can provide corrective feedback on language learners’ speaking has increased. However, it is not a trivial task to detect grammatical errors in oral conversations because of the unavoidable errors of automatic speech recognition systems. To provide corrective feedback, a novel method to detect grammatical errors in speaking performance is proposed. The proposed method consists of two sub-models: the grammaticality-checking model and the error-type classification model. We automatically generate grammatical errors that learners are likely to commit and construct error patterns based on the articulated errors. When a particular speech pattern is recognized, the grammaticality-checking model performs a binary classification based on the similarity between the error patterns and the recognition result using the confidence score. The error-type classification model chooses the error type based on the most similar error pattern and the error frequency extracted from a learner corpus. The grammaticality checking method largely outperformed the two comparative models by 56.36% and 42.61% in F-score while keeping the false positive rate very low. The error-type classification model exhibited very high performance with a 99.6% accuracy rate. Because high precision and a low false positive rate are important criteria for the language-tutoring setting, the proposed method will be helpful for intelligent computer-assisted language learning systems.
Elections are a vital part of democracy allowing people to vote for the candidate they think can best lead the country. A candidate's campaign aims to demonstrate to the public why they think they are the best choice. However, in this age of constant media coverage and digital communications, the candidate is scrutinized at every step. A single misquote or negative news about a candidate can be the difference between him winning or losing the election. It becomes crucial to have a public relations manager who can guide and direct the candidate's campaign by prioritizing specific campaign activities. One critical aspect of the PR manager's work is to understand the public perception of their candidate and improve public sentiment about the candidate.
Political discourse in the United States is getting increasingly polarized. This polarization frequently causes different communities to react very differently to the same news events. Political blogs as a form of social media provide an unique insight into this phenomenon. We present a multitarget, semisupervised latent variable model, MCR-LDA to model this process by analyzing political blogs posts and their comment sections from different political communities jointly to predict the degree of polarization that news topics cause. Inspecting the model after inference reveals topics and the degree to which it triggers polarization. In this approach, community responses to news topics are observed using sentiment polarity and comment volume which serves as a proxy for the level of interest in the topic. In this context, we also present computational methods to assign sentiment polarity to the comments which serve as targets for latent variable models that predict the polarity based on the topics in the blog content. Our results show that the joint modeling of communities with different political beliefs using MCR-LDA does not sacrifice accuracy in sentiment polarity prediction when compared to approaches that are tailored to specific communities and additionally provides a view of the polarization in responses from the different communities.
Detecting and aggregating sentiments toward people, organizations, and events expressed in unstructured social media have become critical text mining operations. Early systems detected sentiments over whole passages, whereas more recently, target-specific sentiments have been of greater interest. In this paper, we present MTTDSC, a multi-task target-dependent sentiment classification system that is informed by feature representation learnt for the related auxiliary task of passage-level sentiment classification. The auxiliary task uses a gated recurrent unit (GRU) and pools GRU states, followed by an auxiliary fully-connected layer that outputs passage-level predictions. In the main task, these GRUs contribute auxiliary per-token representations over and above word embeddings. The main task has its own, separate GRUs. The auxiliary and main GRUs send their states to a different fully connected layer, trained for the main task. Extensive experiments using two auxiliary datasets and three benchmark datasets (of which one is new, introduced by us) for the main task demonstrate that MTTDSC outperforms state-of-the-art baselines. Using word-level sensitivity analysis, we present anecdotal evidence that prior systems can make incorrect target-specific predictions because they miss sentiments expressed by words independent of target.