AITopics

It is often desirable to extract structured information from raw web pages for better information browsing, query answering, and pattern mining. many such Information Extraction (IE) technologies are costly and applying them at the web-scale is impractical. In this paper, we propose a novel prioritization approach where candidate pages from the corpus are ordered according to their expected contribution to the extraction results and those with higher estimated potential are extracted earlier. Systems employing this approach can stop the extraction process at any time when the resource gets scarce (i.e., not all pages in the corpus can be processed), without worrying about wasting extraction effort on unimportant pages. More specifically, we define a novel notion to measure the value of extraction results and design various mechanisms for estimating a candidate page’s contribution to this value. We further design and build the Extraction Prioritization (EP) system with efficient scoring and scheduling algorithms, and experimentally demonstrate that EP significantly outperforms the naive approach and is more flexible than the classifier approach.

artificial intelligence, information, natural language, (17 more...)

Twenty-Fourth AAAI Conference on Artificial Intelligence

Country:

North America > United States > New York (0.14)
North America > United States > Pennsylvania (0.14)

Industry: Consumer Products & Services > Restaurants (0.47)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Text Mining (0.62)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.62)

Conformal Mapping by Computationally Efficient Methods

Pintilie, Stefan (University of Waterloo) | Ghodsi, Ali (University of Waterloo)

Dimensionality reduction is the process by which a set of data points in a higher dimensional space are mapped to a lower dimension while maintaining certain properties of these points relative to each other. One important property is the preservation of the three angles formed by a triangle consisting of three neighboring points in the high dimensional space. If this property is maintained for those same points in the lower dimensional embedding then the result is a conformal map. However, many of the commonly used nonlinear dimensionality reduction techniques, such as Locally Linear Embedding (LLE) or Laplacian Eigenmaps (LEM), do not produce conformal maps. Post-processing techniques formulated as instances of semi-definite programming (SDP) problems can be applied to the output of either LLE or LEM to produce a conformal map. However, the effectiveness of this approach is limited by the computational complexity of SDP solvers. This paper will propose an alternative post-processing algorithm that produces a conformal map but does not require a solution to a SDP problem and so is more computationally efficient thus allowing it to be applied to a wider selection of datasets. Using this alternative solution, the paper will also propose a new algorithm for 3D object classification. An interesting feature of the 3D classification algorithm is that it is invariant to the scale and the orientation of the surface.

algorithm, artificial intelligence, data mining, (17 more...)

Twenty-Fourth AAAI Conference on Artificial Intelligence

Country: North America > Canada > Ontario (0.14)

Genre: Workflow (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.55)

Zhang, Yi (Carnegie Mellon University)

Multi-Task Active Learning with Output Constraints

Many problems in information extraction, text mining, natural language processing and other fields exhibit the same property: multiple prediction tasks are related in the sense that their outputs (labels) satisfy certain constraints. In this paper, we propose an active learning framework exploiting such relations among tasks. Intuitively, with task outputs coupled by constraints, active learning can utilize not only the uncertainty of the prediction in a single task but also the inconsistency of predictions across tasks. We formalize this idea as a cross-task value of information criteria, in which the reward of a labeling assignment is propagated and measured over all relevant tasks reachable through constraints. A specific example of our framework leads to the cross entropy measure on the predictions of coupled tasks, which generalizes the entropy in the classical single-task uncertain sampling. We conduct experiments on two real-world problems: web information extraction and document classification. Empirical results demonstrate the effectiveness of our framework in actively collecting labeled examples for multiple related tasks.

active learning, artificial intelligence, inductive learning, (22 more...)

Twenty-Fourth AAAI Conference on Artificial Intelligence

Country: North America > United States > Wisconsin (0.14)

Genre: Research Report (0.34)

Technology:

Information Technology > Data Science > Data Mining > Text Mining (0.76)
Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.47)

Automatic Attribution of Quoted Speech in Literary Narrative

Elson, David K. (Columbia University) | McKeown, Kathleen R. (Columbia University)

We describe a method for identifying the speakers of quoted speech in natural-language textual stories. We have assembled a corpus of more than 3,000 quotations, whose speakers (if any) are manually identified, from a collection of 19th and 20th century literature by six authors. Using rule-based and statistical learning, our method identifies candidate characters, determines their genders, and attributes each quote to the most likely speaker. We divide the quotes into syntactic classes in order to leverage common discourse patterns, which enable rapid attribution for many quotes. We apply learning algorithms to the remainder and achieve an overall accuracy of 83%.

artificial intelligence, category, natural language, (20 more...)

Twenty-Fourth AAAI Conference on Artificial Intelligence

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.89)

Friedland, Lisa (University of Massachusetts Amherst)

Detecting Social Ties and Copying Events from Afﬁliation Data

AAAI ConferencesJul-12-2010

The goal of my work is to detect implicit social ties or closely-linked entities within a data set. In data consisting of people (or other entities) and their afﬁliations or discrete attributes, we identify unusually similar pairs of people, and we pose the question: Can their similarity be explained by chance, or it is due to a direct (“copying”) relationship between the people? The thesis will explore how to assess this question, and in particular how one’s judgments and conﬁdence depend not only on the two people in question but also on properties of the entire data set. I will provide a framework for solving this problem and experiment with it across multiple synthetic and real-world data sets. My approach requires a model of the copying relationship, a model of independent people, and a method for distinguishing between them. I will focus on two aspects of the problem: (1) choosing background models to ﬁt arbitrary, correlated afﬁliation data, and (2) understanding how the ability to detect copies is affected by factors like data sparsity and the numbers of people and afﬁliations, independent of the ﬁt of the models.

artificial intelligence, data mining, social tie and copying event, (14 more...)

Fifteenth AAAI/SIGART Doctoral Consortium

Country: North America > United States > Massachusetts > Hampshire County > Amherst (0.15)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Mining (0.47)

Nguyen, Huy, Choi, Deokjai

Application of Data Mining to Network Intrusion Detection: Classifier Selection Model

arXiv.org Artificial IntelligenceJul-7-2010

As network attacks have increased in number and severity over the past few years, intrusion detection system (IDS) is increasingly becoming a critical component to secure the network. Due to large volumes of security audit data as well as complex and dynamic properties of intrusion behaviors, optimizing performance of IDS becomes an important open problem that is receiving more and more attention from the research community. The uncertainty to explore if certain algorithms perform better for certain attack classes constitutes the motivation for the reported herein. In this paper, we evaluate performance of a comprehensive set of classifier algorithms using KDD99 dataset. Based on evaluation results, best algorithms for each attack category is chosen and two classifier algorithm selection models are proposed. The simulation result comparison indicates that noticeable performance improvement and real-time intrusion detection can be achieved as we apply the proposed models to detect different kinds of network attacks.

algorithm, decision tree learning, neural network, (17 more...)

1007.1268

Country:

North America > United States (0.14)
North America > Canada (0.14)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
(5 more...)

Achar, Avinash, Laxman, Srivatsan, Sastry, P. S.

A unified view of Automata-based algorithms for Frequent Episode Discovery

arXiv.org Artificial IntelligenceJul-5-2010

Frequent Episode Discovery framework is a popular framework in Temporal Data Mining with many applications. Over the years many different notions of frequencies of episodes have been proposed along with different algorithms for episode discovery. In this paper we present a unified view of all such frequency counting algorithms. We present a generic algorithm such that all current algorithms are special cases of it. This unified view allows one to gain insights into different frequencies and we present quantitative relationships among different frequencies. Our unified view also helps in obtaining correctness proofs for various algorithms as we show here. We also point out how this unified view helps us to consider generalization of the algorithm so that they can discover episodes with general partial orders.

algorithm, artificial intelligence, data mining, (19 more...)

1007.069

Country: Asia > India (0.14)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (0.93)

Shojaie, Ali, Michailidis, George

Discovering Graphical Granger Causality Using the Truncating Lasso Penalty

arXiv.org Machine LearningJul-3-2010

Components of biological systems interact with each other in order to carry out vital cell functions. Such information can be used to improve estimation and inference, and to obtain better insights into the underlying cellular mechanisms. Discovering regulatory interactions among genes is therefore an important problem in systems biology. Whole-genome expression data over time provides an opportunity to determine how the expression levels of genes are affected by changes in transcription levels of other genes, and can therefore be used to discover regulatory interactions among genes. In this paper, we propose a novel penalization method, called truncating lasso, for estimation of causal relationships from time-course gene expression data. The proposed penalty can correctly determine the order of the underlying time series, and improves the performance of the lasso-type estimators. Moreover, the resulting estimate provides information on the time lag between activation of transcription factors and their effects on regulated genes. We provide an efficient algorithm for estimation of model parameters, and show that the proposed method can consistently discover causal relationships in the large $p$, small $n$ setting. The performance of the proposed model is evaluated favorably in simulated, as well as real, data examples. The proposed truncating lasso method is implemented in the R-package grangerTlasso and is available at http://www.stat.lsa.umich.edu/~shojaie.

artificial intelligence, health & medicine, machine learning, (16 more...)

arXiv.org Machine Learning

1007.0499

Country: North America > United States (0.68)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.96)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Twycross, Jamie, Aickelin, Uwe, Whitbrook, Amanda

Detecting Anomalous Process Behaviour using Second Generation Artificial Immune Systems

arXiv.org Artificial IntelligenceJun-18-2010

Artificial Immune Systems have been successfully applied to a number of problem domains including fault tolerance and data mining, but have been shown to scale poorly when applied to computer intrusion detec- tion despite the fact that the biological immune system is a very effective anomaly detector. This may be because AIS algorithms have previously been based on the adaptive immune system and biologically-naive mod- els. This paper focuses on describing and testing a more complex and biologically-authentic AIS model, inspired by the interactions between the innate and adaptive immune systems. Its performance on a realistic process anomaly detection problem is shown to be better than standard AIS methods (negative-selection), policy-based anomaly detection methods (systrace), and an alternative innate AIS approach (the DCA). In addition, it is shown that runtime information can be used in combination with system call information to enhance detection capability.

algorithm, health & medicine, immunology, (21 more...)

1006.3654

Industry:

Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.48)

Greensmith, Julie, Feyereisl, Jan, Aickelin, Uwe

The DCA:SOMe Comparison A comparative study between two biologically-inspired algorithms

arXiv.org Artificial IntelligenceJun-8-2010

The Dendritic Cell Algorithm (DCA) is an immune-inspired algorithm, developed for the purpose of anomaly detection. The algorithm performs multi-sensor data fusion and correlation which results in a 'context aware' detection system. Previous applications of the DCA have included the detection of potentially malicious port scanning activity, where it has produced high rates of true positives and low rates of false positives. In this work we aim to compare the performance of the DCA and of a Self-Organizing Map (SOM) when applied to the detection of SYN port scans, through experimental analysis. A SOM is an ideal candidate for comparison as it shares similarities with the DCA in terms of the data fusion method employed. It is shown that the results of the two systems are comparable, and both produce false positives for the same processes. This shows that the DCA can produce anomaly detection results to the same standard as an established technique.

algorithm, immunology, information fusion, (22 more...)

1006.1518

Country: North America > United States > California (0.14)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.96)
(2 more...)