AITopics | Statistical Learning

Collaborating Authors

Statistical Learning

News Overviews Instructional Materials AI-Alerts Classics

Distantly Labeling Data for Large Scale Cross-Document Coreference

Singh, Sameer, Wick, Michael, McCallum, Andrew

arXiv.org Artificial IntelligenceMay-24-2010

Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling'' a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.

coreference, expert system, text processing, (20 more...)

arXiv.org Artificial Intelligence

1005.4298

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
North America > United States > California > San Francisco County > San Francisco (0.14)

Industry:

Media > Film (0.46)
Leisure & Entertainment > Sports (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.68)
(2 more...)

Add feedback

High-dimensional variable selection for Cox's proportional hazards model

Fan, Jianqing, Feng, Yang, Wu, Yichao

arXiv.org Machine LearningMay-19-2010

Variable selection in high dimensional space has challenged many contemporary statistical problems from many frontiers of scientific disciplines. Recent technology advance has made it possible to collect a huge amount of covariate information such as microarray, proteomic and SNP data via bioimaging technology while observing survival information on patients in clinical studies. Thus, the same challenge applies to the survival analysis in order to understand the association between genomics information and clinical information about the survival time. In this work, we extend the sure screening procedure Fan and Lv (2008) to Cox's proportional hazards model with an iterative version available. Numerical simulation studies have shown encouraging performance of the proposed method in comparison with other techniques such as LASSO. This demonstrates the utility and versatility of the iterative sure independent screening scheme.

health & medicine, optimization problem, proportional hazard model, (20 more...)

arXiv.org Machine Learning

1002.3315

Country: North America > United States > North Carolina (0.14)

Genre:

Research Report > New Finding (0.66)
Research Report > Experimental Study (0.66)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Oncology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)
Information Technology > Biomedical Informatics > Translational Bioinformatics (0.88)

Add feedback

Graph-Structured Multi-task Regression and an Efficient Optimization Method for General Fused Lasso

Chen, Xi, Kim, Seyoung, Lin, Qihang, Carbonell, Jaime G., Xing, Eric P.

arXiv.org Machine LearningMay-19-2010

We consider the problem of learning a structured multi-task regression, where the output consists of multiple responses that are related by a graph and the correlated response variables are dependent on the common inputs in a sparse but synergistic manner. Previous methods such as l1/l2-regularized multi-task regression assume that all of the output variables are equally related to the inputs, although in many real-world problems, outputs are related in a complex manner. In this paper, we propose graph-guided fused lasso (GFlasso) for structured multi-task regression that exploits the graph structure over the output variables. We introduce a novel penalty function based on fusion penalty to encourage highly correlated outputs to share a common set of relevant inputs. In addition, we propose a simple yet efficient proximal-gradient method for optimizing GFlasso that can also be applied to any optimization problems with a convex smooth loss and the general class of fusion penalty defined on arbitrary graph structures. By exploiting the structure of the non-smooth ''fusion penalty'', our method achieves a faster convergence rate than the standard first-order method, sub-gradient method, and is significantly more scalable than the widely adopted second-order cone-programming and quadratic-programming formulations. In addition, we provide an analysis of the consistency property of the GFlasso model. Experimental results not only demonstrate the superiority of GFlasso over the standard lasso but also show the efficiency and scalability of our proximal-gradient method.

artificial intelligence, health & medicine, optimization problem, (17 more...)

arXiv.org Machine Learning

1005.3579

Country: North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)

Genre: Research Report (0.82)

Industry:

Health & Medicine > Therapeutic Area (0.93)
Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

A Ranking Based Model for Automatic Image Annotation in a Social Network

Denoyer, Ludovic (University Pierre et Marie Curie - LIP6) | Gallinari, Patrick (University Pierre et Marie Curie - LIP6)

AAAI ConferencesMay-17-2010

We propose a relational ranking model for learning to tag images in social media sharing systems. This model learns to associate a ranked list of tags to unlabeled images, by considering simultaneously content information (visual or textual) and relational information among the images. It is able to handle implicit relations like content similarities, and explicit ones like friendship or authorship. The model itself is based on a transductive algorithm thats learns from both labeled and unlabeled data. Experiments on a real corpus extracted from Flickr show the effectiveness of this model.

artificial intelligence, relation, social media, (16 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Country:

North America > United States (0.14)
Europe > France (0.14)

Industry: Information Technology > Services (0.74)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

Predicting the Speed, Scale, and Range of Information Diffusion in Twitter

Yang, Jiang (University of Michigan) | Counts, Scott (Microsoft Research)

AAAI ConferencesMay-17-2010

We present results of network analyses of information diffusion on Twitter, via users’ ongoing social interactions as denoted by “@username” mentions. Incorporating survival analysis, we constructed a novel model to capture the three major properties of information diffusion: speed, scale, and range. On the whole, we find that some properties of the tweets themselves predict greater information propagation but that properties of the users, the rate with which a user is mentioned historically in particular, are equal or stronger predictors. Implications for end users and system designers are discussed.

artificial intelligence, social media, tweet, (17 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Country: North America > United States > Michigan (0.14)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.46)

Industry:

Information Technology > Services (0.70)
Health & Medicine (0.55)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

Add feedback

Discovering Serendipitous Information from Wikipedia by Using Its Network Structure

Noda, Yohei (University of Tokyo) | Kiyota, Yoji (University of Tokyo) | Nakagawa, Hiroshi (University of Tokyo)

AAAI ConferencesMay-17-2010

Many researchers conducted studies on extracting relevant information from web documents. However, there are few studies on extracting serendipitous information. We propose methods to discover unexpected information from Wikipedia by using its network structure, for example, the distance between two categories. We evaluated two methods: a classification-based method using support vector machines (SVMs), and a ranking-based method using regression. We demonstrate advantages of regression over classification.

artificial intelligence, discovering serendipitous information, social media, (3 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.53)

Add feedback

What’s Worthy of Comment? Content and Comment Volume in Political Blogs

Yano, Tae (Carnegie Mellon University) | Smith, Noah A. (Carnegie Mellon University)

AAAI ConferencesMay-17-2010

In research on blog data, comments are often ignored, What makes a blog post noteworthy? One measure of the and it is easy to see why: comments are very noisy, full popularity or breadth of interest of a blog post is the extent of nonstandard grammar and spelling, usually unedited, often to which readers of the blog are inspired to leave comments cryptic and uninformative, at least to those outside the on the post. In this paper, we study the relationship between blog's community. A few studies have focused on information the text contents of a blog post and the volume of response in comments. Mishe and Glance (2006) showed the it will receive from blog readers. Modeling this relationship value of comments in characterizing the social repercussions has the potential to reveal the interests of a blog's readership of a post, including popularity and controversy. Their largescale community to its authors, readers, advertisers, and scientists user study correlated popularity and comment activity.

prediction, social media, us government, (18 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Country:

Asia > Middle East (0.69)
North America > United States (0.68)

Industry: Government > Regional Government (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.69)
(2 more...)

Add feedback

To Be a Star Is Not Only Metaphoric: From Popularity to Social Linkage

Stoica, Alina Mihaela (Orange Labs and LIAFA, University Paris 7) | Couronne, Thomas (Orange Labs) | Beuscart, Jean - Samuel (Orange Labs)

AAAI ConferencesMay-17-2010

The emergence of online platforms allowing to mix self publishing activities and social networking offers new possibilities for building online reputation and visibility. In this paper we present a method to analyze the online popularity that takes into consideration both the success of the published content and the social network topology. First, we adapt the Kohonen self organizing maps in order to cluster the users of online platforms depending on their audience and authority characteristics. Then, we perform a detailed analysis of the manner nodes are organized in the social network. Finally, we study the relationship between the network local structure around each node and the corresponding user’s popularity. We apply this method to the MySpace music social network. We observe that the most popular artists are centers of star shaped social structures and that it exists a fraction of artists who are involved in community and social activity dynamics independently of their popularity. This method based on a learning algorithm and on network analysis appears to be a robust and intuitive technique for a rich description of the online behavior.

artificial intelligence, social media, vertex, (19 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Country: North America > United States > New York (0.14)

Industry: Information Technology > Services (0.77)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)

Add feedback

Effective Question Recommendation Based on Multiple Features for Question Answering Communities

Kabutoya, Yutaka (NTT Cyber Solutions Laboratories, NTT Corporation) | Iwata, Tomoharu (NTT Cyber Solutions Laboratories, NTT Corporation) | Shiohara, Hisako (NTT Cyber Solutions Laboratories, NTT Corporation) | Fujimura, Ko (NTT Cyber Solutions Laboratories, NTT Corporation)

AAAI ConferencesMay-17-2010

We propose a new method of recommending questions to answerers so as to suit the answerers’ knowledge and interests in User-Interactive Question Answering (QA) communities. A question recommender can help answerers select the questions that interest them. This increases the number of answers, which will activate QA communities. An effective question recommender should satisfy the following three requirements: First, its accuracy should be higher than the existing category-based approach; more than 50% of answerers select the questions to answer according a fixed system of categories. Second, it should be able to recommend unanswered questions because more than 2,000 questions are posted every day. Third, it should be able to support even those people who have never answered a question previously, because more than 50% of users in current QA communities have never given any answer. To achieve an effective question recommender, we use question histories as well as the answer histories of each user by combining collaborative filtering schemes and content-base filtering schemes. Experiments on real log data sets of a famous Japanese QA community, Oshiete goo, show that our recommender satisfies the three requirements.

accuracy, artificial intelligence, category, (20 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Country: Asia > Japan > Honshū (0.14)

Genre:

Research Report > New Finding (0.30)
Questionnaire & Opinion Survey (0.30)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.70)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)

Add feedback

Empirical Analysis of User Participation in Online Communities: the Case of Wikipedia

Ciampaglia, Giovanni Luca (Università della Svizzera Italiana) | Vancheri, Alberto (Università della Svizzera Italiana)

AAAI ConferencesMay-17-2010

We study the distribution of the activity period of users in five of the largest localized versions of the free, on- line encyclopedia Wikipedia. We find it to be consis- tent with a mixture of two truncated log-normal distri- butions. Using this model, the temporal evolution of these systems can be analyzed, showing that the statis- tical description is consistent over time.

artificial intelligence, social media, wikipedia, (18 more...)

AAAI Conferences

Fourth International AAAI Conference on Weblogs and Social Media

Country: North America > United States (0.14)

Genre: Research Report (0.48)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.47)

Add feedback