Goto

Collaborating Authors

 Fudan University


Measuring Conditional Independence by Independent Residuals: Theoretical Results and Application in Causal Discovery

AAAI Conferences

We investigate the relationship between conditional independence (CI) x โŠฅย  y | Z and the independence of two residuals x โ€“ E ( x | Z ) โŠฅย  โ€“E ( y | Z ), where x and y are two random variables, and Z is a set of random variables. We show that if x,ย y and Z are generated by following linear structural equation model and all external influences follow Gaussian distributions, then x โŠฅย  y | Z if and only if x โ€“ E ( x | Z )ย โŠฅ y โ€“ E ( y | Z ). That is, the test of x โŠฅย  y | Z can be relaxed to a simpler unconditional independence test of x โ€“ E ( x | Z ) โŠฅย  y โ€“ย E ( y | Z ). Furthermore, if all these external influences follow non-Gaussian distributions and the model satisfies structural faithfulness condition, then we have x โŠฅย  y | Z โ‡” x โ€“ย E ( x | Z ) โŠฅย  y โ€“ย E ( y | Z ). We apply the results above to the causal discovery problem, where the causal directions are generally determined by a set of V -structures and their consistent propagations, so CI test-based methods can return a set of Markov equivalence classes. We show that in linear non-Gaussian context, x โ€“ย E ( x | Z ) โŠฅย  y โ€“ E ( y | Z ) โ‡’ x โ€“ E ( x | Z ) โŠฅย  z or y โ€“ E ( y | Z โŠฅย  z (โˆ€ z โˆˆย  Z ) if Z is a minimal d -separator, which implies z causes x (or y ) if z directly connects to x (or y ). Therefore, we conclude that CIs have useful information for distinguishing Markov equivalence classes. In summary, compared with the existing discretization-based and kernel-based CI testing methods, the proposed method provides a simpler way to measure CI, which needs only one unconditional independence test and two regression operations. When being applied to causal discovery, it can find more causal relationships, which is experimentally validated.


RNN-Based Sequence-Preserved Attention for Dependency Parsing

AAAI Conferences

Recurrent neural networks (RNN) combined with attention mechanism has proved to be useful for various NLP tasks including machine translation, sequence labeling and syntactic parsing. The attention mechanism is usually applied by estimating the weights (or importance) of inputs and taking the weighted sum of inputs as derived features. Although such features have demonstrated their effectiveness, they may fail to capture the sequence information due to the simple weighted sum being used to produce them. The order of the words does matter to the meaning or the structure of the sentences, especially for syntactic parsing, which aims to recover the structure from a sequence of words. In this study, we propose an RNN-based attention to capture the relevant and sequence-preserved features from a sentence, and use the derived features to perform the dependency parsing. We evaluated the graph-based and transition-based parsing models enhanced with the RNN-based sequence-preserved attention on the both English PTB and Chinese CTB datasets. The experimental results show that the enhanced systems were improved with significant increase in parsing accuracy.


Neural Networks Incorporating Dictionaries for Chinese Word Segmentation

AAAI Conferences

In recent years, deep neural networks have achieved significant success in Chinese word segmentation and many other natural language processing tasks. Most of these algorithms are end-to-end trainable systems and can effectively process and learn from large scale labeled datasets. However, these methods typically lack the capability of processing rare words and data whose domains are different from training data. Previous statistical methods have demonstrated that human knowledge can provide valuable information for handling rare cases and domain shifting problems. In this paper, we seek to address the problem of incorporating dictionaries into neural networks for the Chinese word segmentation task. Two different methods that extend the bi-directional long short-term memory neural network are proposed to perform the task. To evaluate the performance of the proposed methods, state-of-the-art supervised models based methods and domain adaptation approaches are compared with our methods on nine datasets from different domains. The experimental results demonstrate that the proposed methods can achieve better performance than other state-of-the-art neural network methods and domain adaptation approaches in most cases.


Meta Multi-Task Learning for Sequence Modeling

AAAI Conferences

Semantic composition functions have been playing a pivotal role in neural representation learning of text sequences. In spite of their success, most existing models suffer from the underfitting problem: they use the same shared compositional function on all the positions in the sequence, thereby lacking expressive power due to incapacity to capture the richness of compositionality. Besides, the composition functions of different tasks are independent and learned from scratch. In this paper, we propose a new sharing scheme of composition function across multiple tasks. Specifically, we use a shared meta-network to capture the meta-knowledge of semantic composition and generate the parameters of the task-specific semantic composition models. We conduct extensive experiments on two types of tasks, text classification and sequence tagging, which demonstrate the benefits of our approach. Besides, we show that the shared meta-knowledge learned by our proposed model can be regarded as off-the-shelf knowledge and easily transferred to new tasks.


Incorporating Discriminator in Sentence Generation: a Gibbs Sampling Method

AAAI Conferences

Generating plausible and fluent sentence with desired properties has long been a challenge. Most of the recent works use recurrent neural networks (RNNs) and their variants to predict following words given previous sequence and target label. In this paper, we propose a novel framework to generate constrained sentences via Gibbs Sampling. The candidate sentences are revised and updated iteratively, with sampled new words replacing old ones. Our experiments show the effectiveness of the proposed method to generate plausible and diverse sentences.


Attention-based Belief or Disbelief Feature Extraction for Dependency Parsing

AAAI Conferences

Existing neural dependency parsers usually encode each word in a sentence with bi-directional LSTMs, and estimate the score of an arc from the LSTM representations of the head and the modifier, possibly missing relevant context information for the arc being considered. In this study, we propose a neural feature extraction method that learns to extract arc-specific features. We apply a neural network-based attention method to collect evidences for and against each possible head-modifier pair, with which our model computes certainty scores of belief and disbelief, and determines the final arc score by subtracting the score of disbelief from the one of belief. By explicitly introducing two kinds of evidences, the arc candidates can compete against each other based on more relevant information, especially for the cases where they share the same head or modifier. It makes possible to better discriminate two or more competing arcs by presenting their rivals (disbelief evidence). Experiments on various datasets show that our arc-specific feature extraction mechanism significantly improves the performance of bi-directional LSTM-based models by explicitly modeling long-distance dependencies. For both English and Chinese, the proposed model achieve a higher accuracy on dependency parsing task than most existing neural attention-based models.


Geometric Relationship between Word and Context Representations

AAAI Conferences

Pre-trained distributed word representations have been proven to be useful in various natural language processing (NLP) tasks. However, the geometric basis of word representations and their relations to the representations of word's contexts has not been carefully studied yet. In this study, we first investigate such geometric relationship under a general framework, which is abstracted from some typical word representation learning approaches, and find out that only the directions of word representations are well associated to their context vector representations while the magnitudes are not. In order to make better use of the information contained in the magnitudes of word representations, we propose a hierarchical Gaussian model combined with maximum a posteriori estimation to learn word representations, and extend it to represent polysemous words. Our word representations have been evaluated on multiple NLP tasks, and the experimental results show that the proposed model achieved promising results, comparing to several popular word representations.


Adaptive Co-attention Network for Named Entity Recognition in Tweets

AAAI Conferences

In this study, we investigate the problem of named entity recognition for tweets. Named entity recognition is an important task in natural language processing and has been carefully studied in recent decades.ย  Previous named entity recognition methods usually only used the textual content when processing tweets. However, many tweets contain not only textual content, but also images. Such visual information is also valuable in the name entity recognition task. To make full use of textual and visual information, this paper proposes a novel method to process tweets that contain multimodal information. We extend a bi-directional long short term memory network with conditional random fields and an adaptive co-attention network to achieve this task.ย  To evaluate the proposed methods, we constructed a large scale labeled dataset that contained multimodal tweets. Experimental results demonstrated that the proposed method could achieve a better performance than the previous methods in most cases.


Community Detection in Attributed Graphs: An Embedding Approach

AAAI Conferences

Community detection is a fundamental and widely-studied problem that finds all densely-connected groups of nodes and well separates them from others in graphs. With the proliferation of rich information available for entities in real-world networks, it is useful to discover communities in attributed graphs where nodes tend to have attributes. However, most existing attributed community detection methods directly utilize the original network topology leading to poor results due to ignoring inherent community structures. In this paper, we propose a novel embedding based model to discover communities in attributed graphs. Specifically, based on the observation of densely-connected structures in communities, we develop a novel community structure embedding method to encode inherent community structures via underlying community memberships. Based on node attributes and community structure embedding, we formulate the attributed community detection as a nonnegative matrix factorization optimization problem. Moreover, we carefully design iterative updating rules to make sure of finding a converging solution. Extensive experiments conducted on 19 attributed graph datasets with overlapping and non-overlapping ground-truth communities show that our proposed model CDE can accurately identify attributed communities and significantly outperform 7 state-of-the-art methods.


Learning Context-Specific Word/Character Embeddings

AAAI Conferences

Unsupervised word representations have demonstrated improvements in predictive generalization on various NLP tasks. Most of the existing models are in fact good at capturing the relatedness among words rather than their ''genuine'' similarity because the context representations are often represented by a sum (or an average) of the neighbor's embeddings, which simplifies the computation but ignores an important fact that the meaning of a word is determined by its context, reflecting not only the surrounding words but also the rules used to combine them (i.e. compositionality). On the other hand, much effort has been devoted to learning a single-prototype representation per word, which is problematic because many words are polysemous, and a single-prototype model is incapable of capturing phenomena of homonymy and polysemy. We present a neural network architecture to jointly learn word embeddings and context representations from large data sets. The explicitly produced context representations are further used to learn context-specific and multi-prototype word embeddings. Our embeddings were evaluated on several NLP tasks, and the experimental results demonstrated the proposed model outperformed other competitors and is applicable to intrinsically "character-based" languages.