Statistical Learning
Complex Question Answering: Unsupervised Learning Approaches and Experiments
Chali, Y., Joty, S. R., Hasan, S. A.
Complex questions that require inferencing and synthesizing information from multiple documents can be seen as a kind of topic-oriented, informative multi-document summarization where the goal is to produce a single text as a compressed version of a set of documents with a minimum loss of relevant information. In this paper, we experiment with one empirical method and two unsupervised statistical machine learning techniques: K-means and Expectation Maximization (EM), for computing relative importance of the sentences. We compare the results of these approaches. Our experiments show that the empirical approach outperforms the other two techniques and EM performs better than K-means. However, the performance of these approaches depends entirely on the feature set used and the weighting of these features. In order to measure the importance and relevance to the user query we extract different kinds of features (i.e. lexical, lexical semantic, cosine similarity, basic element, tree kernel based syntactic and shallow-semantic) for each of the document sentences. We use a local search technique to learn the weights of the features. To the best of our knowledge, no study has used tree kernel functions to encode syntactic/semantic information for more complex tasks such as computing the relatedness between the query sentences and the document sentences in order to generate query-focused summaries (or answers to complex questions). For each of our methods of generating summaries (i.e. empirical, K-means and EM) we show the effects of syntactic and shallow-semantic features over the bag-of-words (BOW) features.
Mining Default Rules from Statistical Data
Kern-Isberner, Gabriele (Technische Universitรคt Dortmund) | Thimm, Matthias (Technische Universitรคt Dortmund) | Finthammer, Marc (FernUniversitรคt in Hagen) | Fisseler, Jens (FernUniversitรคt in Hagen)
In this paper, we are interested in the qualitative knowledge that underlies some given probabilistic information. To represent such qualitative structures, we use ordinal conditional functions, OCFs, (or ranking functions) as a qualitative abstraction of probability functions. The basic idea for transforming probabilities into ordinal rankings is to find well-behaved clusterings of the negative logarithms of the probabilities. We show how popular clustering tools can be used for this, and propose measures for the evaluation of the clustering results in this context. From the so obtained ranking functions, we extract conditionals that may serve as a base for inductive default reasoning.
Organizing Knowledge as an Ontology of the Domain of Resilient Computing by Means of Natural Language Processing - An Experience Report -
Avizienis, Algirdas (Vytautas Magnus University) | Grigonyte, Gintare (Saarland University and Vytautas Magnus University) | Haller, Johann (IAI) | Henke, Friedrich von (Ulm University) | Liebig, Thorsten (Ulm University) | Noppens, Olaf (Ulm University)
Scientists typically need to take a large volume of information intoย account in order to deal with re-occurring tasks such as inspectingย proceedings, finding related work, or reviewing papers. Our workย aims at filling the gap between text documents and a structuredย representations of their content in the domain of resilienceย computing by combining computer linguistics and ontologicalย methods. The results of our research include: a thesaurus of theย domain, automatic clustering of the domain documents, a domainย ontology, and a tool for constructing ontologies with the aid ofย domain thesauri.
VipBoost: A More Accurate Boosting Algorithm
Su, Xiaoyuan (Florida Atlantic University) | Khoshgoftaar, Taghi M | Greiner, Russell
Boosting is a well-known method for improving the accuracy of many learning algorithms. In this paper, we propose a novel boosting algorithm, VipBoost (voting on boosting classifications from imputed learning sets), which first generates multiple incomplete datasets from the original dataset by randomly removing a small percentage of observed attribute values, then uses an imputer to fill in the missing values.ย It then applies AdaBoost (using some base learner) to produce classifiers trained on each of the imputed learning sets, to produce multiple classifiers. The subsequent prediction on a new test case is the most frequent classification from these classifiers. Our empirical results show that VipBoost produces very effective classifiers that significantly improve accuracy for unstable base learners and some stable learners, especially when the initial dataset is incomplete.
Multivariate Time Series Classification with Temporal Abstractions
Batal, Iyad (University of Pittsburgh) | Sacchi, Lucia (University of Pavia) | Bellazzi, Riccardo (University of Pavia) | Hauskrecht, Milos (University of Pittsburgh)
The increase in the number of complex temporal datasets collected today has prompted the development of methods that extend classical machine learning and data mining methods to time-series data.ย This work focuses on methods for multivariate time-series classification. Time series classification is a challenging problem mostly because the number of temporal features that describe the data and are potentially useful for classification is enormous. We study and develop a temporal abstraction framework for generating multivariate time series features suitable for classification tasks. We propose the STF-Mine algorithm that automatically mines discriminative temporal abstraction patterns from the time series data and uses them to learn a classification model. Our experimental evaluations, carried out on both synthetic and real world medical data, demonstrate the benefit of our approach in learning accurate classifiers for time-series datasets.
Extracting Meaning from Cell Phone Improvement Ideas
Turner, Jenine (Athenahealth) | Lencevicius, Raimondas (Qwobl) | Adler, Mark (Nokia Research Center)
Numerous companies nowadays gather product improvement There are two additional modifications we use to adjust ideas. Reviewing all of the resulting thousands of our feature set, that provide improvements over the original ideas without tools would require a great deal of time and feature counts. The first is based upon our assumption that resources. Automatic tools can help these reviewers in a words in the title are more important than words in the other number of ways. The questions we address here are categorization, text fields. We simply weight unigrams and bigrams that finding common ideas, and finding idea trends over appear in the title ten times as heavily as those that appear in time. We explore techniques to answer these questions using the rest of the text.
Hidden Markov Random Fields Based LSI Text Semi-supervised Clustering
Min, Kerui (Fudan University) | Liu, Gang (Fudan University) | Chen, Xin (Nanjing University) | Lu, Shengqi (Fudan University)
Semi-supervised learning is an active research field. Previous results shown that unite background information into the original unsupervised clustering problem could archive higher accuracy. In this paper, we explore the cooperation between the pairwise constrains given by the user and the sematic information in natural language. In addition, we reduce the time complexity to make the algorithm feasible for large quantities of data. Experiments on different scales of corpus show the robustness and effectiveness of the proposed algorithm, which the F-measure archives 20% higher than previous algorithms.
Hierarchical Soft Clustering and Automatic Text Summarization for Accessing the Web on Mobile Devices for Visually Impaired People
Dias, Gaรซl Harry (University of Beira Interior) | Pais, Sebastiรฃo (University of Beira Interior) | Cunha, Fernando (University of Beira Interior) | Costa, Hugo (University of Beira Interior) | Machado, David (University of Beira Interior) | Barbosa, Tiago (University of Beira Interior) | Martins, Bruno (University of Beira Interior)
In this paper, we propose a universal solution to web search and web browsing on handheld devices for visually impaired people. For this purpose, we propose (1) to automatically cluster web page results and (2) to summarize all the information in web pages so that speech-to-speech interaction is used efficiently to access information.
A Large Margin Approach to Anaphora Resolution for Neuroscience Knowledge Discovery
A discriminative large margin classifier based approach to anaphora resolution for neuroscience abstracts is presented. The system employs both syntactic and semantic features. A support vector machine based word sense disambiguation method combining evidence from three methods, that use WordNet and Wikipedia, is also introduced and used for semantic features. The support vector machine anaphora resolution classifier with probabilistic outputs achieved almost four-fold improvement in accuracy over the baseline method.
Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps
Millar, Jeremy R. (Air Force Institute of Technology) | Peterson, Gilbert L. (Air Force Institute of Technology) | Mendenhall, Michael J. (Air Force Institute of Technology)
Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive two-dimensional format. Document topics are inferred using a probabilistic topic model. Then, due to the topology preserving properties of self-organizing maps, document clusters with similar topic distributions are placed near one another in the visualization. This provides the user an intuitive means of browsing from one cluster to another based on topics held in common. The effectiveness of LDA-SOM is evaluated on the 20 Newsgroups and NIPS data sets.