Clustering
Organizing Knowledge as an Ontology of the Domain of Resilient Computing by Means of Natural Language Processing - An Experience Report -
Avizienis, Algirdas (Vytautas Magnus University) | Grigonyte, Gintare (Saarland University and Vytautas Magnus University) | Haller, Johann (IAI) | Henke, Friedrich von (Ulm University) | Liebig, Thorsten (Ulm University) | Noppens, Olaf (Ulm University)
Scientists typically need to take a large volume of information intoย account in order to deal with re-occurring tasks such as inspectingย proceedings, finding related work, or reviewing papers. Our workย aims at filling the gap between text documents and a structuredย representations of their content in the domain of resilienceย computing by combining computer linguistics and ontologicalย methods. The results of our research include: a thesaurus of theย domain, automatic clustering of the domain documents, a domainย ontology, and a tool for constructing ontologies with the aid ofย domain thesauri.
Extracting Meaning from Cell Phone Improvement Ideas
Turner, Jenine (Athenahealth) | Lencevicius, Raimondas (Qwobl) | Adler, Mark (Nokia Research Center)
Numerous companies nowadays gather product improvement There are two additional modifications we use to adjust ideas. Reviewing all of the resulting thousands of our feature set, that provide improvements over the original ideas without tools would require a great deal of time and feature counts. The first is based upon our assumption that resources. Automatic tools can help these reviewers in a words in the title are more important than words in the other number of ways. The questions we address here are categorization, text fields. We simply weight unigrams and bigrams that finding common ideas, and finding idea trends over appear in the title ten times as heavily as those that appear in time. We explore techniques to answer these questions using the rest of the text.
Hidden Markov Random Fields Based LSI Text Semi-supervised Clustering
Min, Kerui (Fudan University) | Liu, Gang (Fudan University) | Chen, Xin (Nanjing University) | Lu, Shengqi (Fudan University)
Semi-supervised learning is an active research field. Previous results shown that unite background information into the original unsupervised clustering problem could archive higher accuracy. In this paper, we explore the cooperation between the pairwise constrains given by the user and the sematic information in natural language. In addition, we reduce the time complexity to make the algorithm feasible for large quantities of data. Experiments on different scales of corpus show the robustness and effectiveness of the proposed algorithm, which the F-measure archives 20% higher than previous algorithms.
Hierarchical Soft Clustering and Automatic Text Summarization for Accessing the Web on Mobile Devices for Visually Impaired People
Dias, Gaรซl Harry (University of Beira Interior) | Pais, Sebastiรฃo (University of Beira Interior) | Cunha, Fernando (University of Beira Interior) | Costa, Hugo (University of Beira Interior) | Machado, David (University of Beira Interior) | Barbosa, Tiago (University of Beira Interior) | Martins, Bruno (University of Beira Interior)
In this paper, we propose a universal solution to web search and web browsing on handheld devices for visually impaired people. For this purpose, we propose (1) to automatically cluster web page results and (2) to summarize all the information in web pages so that speech-to-speech interaction is used efficiently to access information.
Document Clustering and Visualization with Latent Dirichlet Allocation and Self-Organizing Maps
Millar, Jeremy R. (Air Force Institute of Technology) | Peterson, Gilbert L. (Air Force Institute of Technology) | Mendenhall, Michael J. (Air Force Institute of Technology)
Clustering and visualization of large text document collections aids in browsing, navigation, and information retrieval. We present a document clustering and visualization method based on Latent Dirichlet Allocation and self-organizing maps (LDA-SOM). LDA-SOM clusters documents based on topical content and renders clusters in an intuitive two-dimensional format. Document topics are inferred using a probabilistic topic model. Then, due to the topology preserving properties of self-organizing maps, document clusters with similar topic distributions are placed near one another in the visualization. This provides the user an intuitive means of browsing from one cluster to another based on topics held in common. The effectiveness of LDA-SOM is evaluated on the 20 Newsgroups and NIPS data sets.
KiWi: A Scalable Subspace Clustering Algorithm for Gene Expression Analysis
Griffith, Obi L., Gao, Byron J., Bilenky, Mikhail, Prichyna, Yuliya, Ester, Martin, Jones, Steven J. M.
Numerous studies have used coexpression of large expression datasets to infer functional associations between genes [1], to identify groups of related genes that are important in specific cancers or represent common tumour progression mechanisms [2], to study evolutionary change [3], for integration with other large-scale datasets [4][5], [6], and for the generation of high-quality biological interaction networks [7][8][9] [10]. A number of studies have also attempted to use coexpression to identify coregulation with the hypothesis that if two or more genes are expressed at the same time and location and at similar levels then they may be regulated by the same transcription factors and regulatory elements. This approach has shown promise particularly in simpler model organisms such as A. thaliana and S. cerevisiae [11] [12][13] [14] and many groups are currently working on implementing this idea in mammalian systems. However, traditional clustering methods have not worked particularly well on large datasets for this problem. Most methods assign each gene to only one cluster while in reality many genes likely take part in multiple processes. Also, global coexpression is measured across all conditions, whereas, it is probable that most genes are only tightly coregulated under certain conditions or locations. In recent years, a new field of clustering analysis termed subspace clustering (or biclustering) has gained increasing popularity in the analysis of gene expression data and other biological data [15][16][17][18] [19]. In contrast to traditional clustering methods such as hierarchical clustering, subspace clustering methods do not require expression to be correlated across all conditions for genes to be assigned to the same cluster. This has several advantages for data in which biologically relevant subsets exist (e.g.
Unsupervised Methods for Determining Object and Relation Synonyms on the Web
The task of identifying synonymous relations and objects, or synonym resolution, is critical for high-quality information extraction. This paper investigates synonym resolution in the context of unsupervised information extraction, where neither hand-tagged training examples nor domain knowledge is available. The paper presents a scalable, fully-implemented system that runs in O(KN log N) time in the number of extractions, N, and the maximum number of synonyms per word, K. The system, called Resolver , introduces a probabilistic relational model for predicting whether two strings are co-referential based on the similarity of the assertions containing them. On a set of two million assertions extracted from the Web, Resolver resolves objects with 78% precision and 68% recall, and resolves relations with 90% precision and 35% recall. Several variations of resolver's probabilistic model are explored, and experiments demonstrate that under appropriate conditions these variations can improve F1 by 5%. An extension to the basic Resolver system allows it to handle polysemous names with 97% precision and 95% recall on a data set from the TREC corpus.
Learning DTW Global Constraint for Time Series Classification
Niennattrakul, Vit, Ratanamahatana, Chotirat Ann
1-Nearest Neighbor with the Dynamic Time Warping (DTW) distance is one of the most effective classifiers on time series domain. Since the global constraint has been introduced in speech community, many global constraint models have been proposed including Sakoe-Chiba (S-C) band, Itakura Parallelogram, and Ratanamahatana-Keogh (R-K) band. The R-K band is a general global constraint model that can represent any global constraints with arbitrary shape and size effectively. However, we need a good learning algorithm to discover the most suitable set of R-K bands, and the current R-K band learning algorithm still suffers from an 'overfitting' phenomenon. In this paper, we propose two new learning algorithms, i.e., band boundary extraction algorithm and iterative learning algorithm. The band boundary extraction is calculated from the bound of all possible warping paths in each class, and the iterative learning is adjusted from the original R-K band learning. We also use a Silhouette index, a well-known clustering validation technique, as a heuristic function, and the lower bound function, LB_Keogh, to enhance the prediction speed. Twenty datasets, from the Workshop and Challenge on Time Series Classification, held in conjunction of the SIGKDD 2007, are used to evaluate our approach.
A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems
Fournier-Viger, P., Nkambou, R., Nguifo, E. Mephu
Domain experts should provide relevant domain knowledge to an Intelligent Tutoring System (ITS) so that it can guide a learner during problemsolving learning activities. However, for many ill-defined domains, the domain knowledge is hard to define explicitly. In previous works, we showed how sequential pattern mining can be used to extract a partial problem space from logged user interactions, and how it can support tutoring services during problem-solving exercises. This article describes an extension of this approach to extract a problem space that is richer and more adapted for supporting tutoring services. We combined sequential pattern mining with (1) dimensional pattern mining (2) time intervals, (3) the automatic clustering of valued actions and (4) closed sequences mining. Some tutoring services have been implemented and an experiment has been conducted in a tutoring system.
Foundations of a Multi-way Spectral Clustering Framework for Hybrid Linear Modeling
Chen, Guangliang, Lerman, Gilad
The problem of Hybrid Linear Modeling (HLM) is to model and segment data using a mixture of affine subspaces. Different strategies have been proposed to solve this problem, however, rigorous analysis justifying their performance is missing. This paper suggests the Theoretical Spectral Curvature Clustering (TSCC) algorithm for solving the HLM problem, and provides careful analysis to justify it. The TSCC algorithm is practically a combination of Govindu's multi-way spectral clustering framework (CVPR 2005) and Ng et al.'s spectral clustering algorithm (NIPS 2001). The main result of this paper states that if the given data is sampled from a mixture of distributions concentrated around affine subspaces, then with high sampling probability the TSCC algorithm segments well the different underlying clusters. The goodness of clustering depends on the within-cluster errors, the between-clusters interaction, and a tuning parameter applied by TSCC. The proof also provides new insights for the analysis of Ng et al. (NIPS 2001).