AITopics | Statistical Learning

Collaborating Authors

Statistical Learning

News Overviews Instructional Materials AI-Alerts Classics

Clustering of discretely observed diffusion processes

De Gregorio, Alessandro, Iacus, Stefano Maria

arXiv.org Machine LearningSep-23-2008

In this paper a new dissimilarity measure to identify groups of assets dynamics is proposed. The underlying generating process is assumed to be a diffusion process solution of stochastic differential equations and observed at discrete time. The mesh of observations is not required to shrink to zero. As distance between two observed paths, the quadratic distance of the corresponding estimated Markov operators is considered. Analysis of both synthetic data and real financial data from NYSE/NASDAQ stocks, give evidence that this distance seems capable to catch differences in both the drift and diffusion coefficients contrary to other commonly used metrics.

artificial intelligence, banking & finance, machine learning, (16 more...)

arXiv.org Machine Learning

0809.3902

Country:

Europe > Italy (0.28)
North America (0.14)
Europe > Germany (0.14)

Genre: Research Report (0.40)

Industry:

Banking & Finance > Trading (0.54)
Energy > Oil & Gas (0.48)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Finding rare objects and building pure samples: Probabilistic quasar classification from low resolution Gaia spectra

Bailer-Jones, C. A. L., Smith, K. W., Tiede, C., Sordo, R., Vallenari, A.

arXiv.org Machine LearningSep-19-2008

We develop and demonstrate a probabilistic method for classifying rare objects in surveys with the particular goal of building very pure samples. It works by modifying the output probabilities from a classifier so as to accommodate our expectation (priors) concerning the relative frequencies of different classes of objects. We demonstrate our method using the Discrete Source Classifier, a supervised classifier currently based on Support Vector Machines, which we are developing in preparation for the Gaia data analysis. DSC classifies objects using their very low resolution optical spectra. We look in detail at the problem of quasar classification, because identification of a pure quasar sample is necessary to define the Gaia astrometric reference frame. By varying a posterior probability threshold in DSC we can trade off sample completeness and contamination. We show, using our simulated data, that it is possible to achieve a pure sample of quasars (upper limit on contamination of 1 in 40,000) with a completeness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars have a frequency of only 1 in every 2000 objects. The star sample completeness is simultaneously 99% with a contamination of 0.7%. Including parallax and proper motion in the classifier barely changes the results. We further show that not accounting for class priors in the target population leads to serious misclassifications and poor predictions for sample completeness and contamination. (Truncated)

artificial intelligence, machine learning, quasar, (17 more...)

arXiv.org Machine Learning

doi: 10.1111/j.1365-2966.2008.13983.x

0809.3373

Country: Europe (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Finding links and initiators: a graph reconstruction problem

Mannila, Heikki, Terzi, Evimaria

arXiv.org Artificial IntelligenceSep-17-2008

Analyzing 0-1 matrices is one of the main themes in data mining. Techniques such as clustering or mixture modelling, matrix decomposition techniques such as PCA, ICA, and NMR, and Bayesian all aim to give an answer to the informal question: "Where does the matrix come from?" These approaches aim at describing a probabilistic generative model that describes the observed matrix well. In this paper we consider yet another way of answering the question "Where does a 0-1 matrix M come from?" In our model, the matrix M of size n m is considered to arise from initiators, certain few entries that are initially 1. The initiators propagate their 1's by following the links of a directed influence graph G (represented by an n n adjacency matrix). We denote the initiator matrix of size n m by N and we use G (of size n n) to refer both to the directed graph between the rows of M and as well as its adjacency matrix. Then, we believe that the structure of N and G can tell how a matrix M has been created.

artificial intelligence, data mining, machine learning, (18 more...)

arXiv.org Artificial Intelligence

0809.3027

Country: North America > United States (0.47)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Data Science > Data Mining (0.66)

Add feedback

Collective Classification in Network Data

AI MagazineSep-15-2008

Many real-world applications produce networked data such as the world-wide web (hypertext documents connected via hyperlinks), social networks (for example, people connected by friendship links), communication networks (computers connected via communication links) and biological networks (for example, protein interaction networks). A recent focus in machine learning research has been to extend traditional machine learning classification techniques to classify nodes in such networks. In this article, we provide a brief introduction to this area of research and how it has progressed during the past decade. We introduce four of the most widely used inference algorithms for classifying networked data and empirically compare them on both synthetic and real-world data.

algorithm, classification, node, (14 more...)

AI Magazine

Country:

North America > United States > Maryland > Prince George's County > College Park (0.14)
North America > United States > California > San Mateo County > Menlo Park (0.04)
North America > United States > New York (0.04)
(13 more...)

Genre: Research Report > Experimental Study (0.68)

Industry:

Telecommunications > Networks (0.50)
Information Technology > Networks (0.50)

Technology:

Information Technology > Communications > Networks (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Normalized Information Distance

Vitanyi, Paul M. B., Balbach, Frank J., Cilibrasi, Rudi L., Li, Ming

arXiv.org Artificial IntelligenceSep-15-2008

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

0809.2553

Country:

Europe (0.67)
North America > United States > Washington > King County (0.28)

Genre: Research Report (0.40)

Industry:

Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Government (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Bach, Francis

arXiv.org Machine LearningSep-9-2008

For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the l1-norm or the block l1-norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

artificial intelligence, kernel, machine learning, (18 more...)

arXiv.org Machine Learning

0809.1493

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)

Add feedback

Predictive Hypothesis Identification

Hutter, Marcus

arXiv.org Machine LearningSep-8-2008

While statistics focusses on hypothesis testing and on estimating (properties of) the true sampling distribution, in machine learning the performance of learning algorithms on future data is the primary issue. In this paper we bridge the gap with a general principle (PHI) that identifies hypotheses with best predictive performance. This includes predictive point and interval estimation, simple and composite hypothesis testing, (mixture) model selection, and others as special cases. For concrete instantiations we will recover well-known methods, variations thereof, and new ones. PHI nicely justifies, reconciles, and blends (a reparametrization invariant variation of) MAP, ML, MDL, and moment estimation. One particular feature of PHI is that it can genuinely deal with nested hypotheses.

estimation, hypothesis, prediction, (15 more...)

arXiv.org Machine Learning

0809.1270

Country:

Oceania > Australia > Australian Capital Territory > Canberra (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > Poland (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.46)

Add feedback

From Data to the p-Adic or Ultrametric Model

Murtagh, Fionn

arXiv.org Machine LearningSep-2-2008

We model anomaly and change in data by embedding the data in an ultrametric space. Taking our initial data as cross-tabulation counts (or other input data formats), Correspondence Analysis allows us to endow the information space with a Euclidean metric. We then model anomaly or change by an induced ultrametric. The induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We apply this work to the flow of narrative expressed in the film script of the Casablanca movie; and to the evolution between 1988 and 2004 of the Colombian social conflict and violence.

artificial intelligence, correspondence analysis, machine learning, (19 more...)

arXiv.org Machine Learning

doi: 10.1134/S2070046609010063

0809.0492

Country:

Europe > United Kingdom (0.28)
Africa > Middle East > Morocco > Casablanca-Settat Region > Casablanca (0.26)

Genre: Research Report (0.40)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.31)

Add feedback

The Correspondence Analysis Platform for Uncovering Deep Structure in Data and Information

Murtagh, Fionn

arXiv.org Artificial IntelligenceSep-2-2008

We study two aspects of information semantics: (i) the collection of all relationships, (ii) tracking and spotting anomaly and change. The first is implemented by endowing all relevant information spaces with a Euclidean metric in a common projected space. The second is modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding of different information spaces based on cross-tabulation counts (and from other input data formats) is provided by Correspondence Analysis. From there, the induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We employ such a perspective to look at narrative, "the flow of thought and the flow of language" (Chafe). In application to policy decision making, we show how we can focus analysis in a small number of dimensions.

correspondence analysis, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1093/comjnl/bxn045

0807.0908

Country: Europe > United Kingdom (0.46)

Genre: Research Report (0.50)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

Uncertainty quantification in complex systems using approximate solvers

Koutsourelakis, Phaedon-Stelios

arXiv.org Machine LearningAug-25-2008

This paper proposes a novel uncertainty quantification framework for computationally demanding systems characterized by a large vector of non-Gaussian uncertainties. It combines state-of-the-art techniques in advanced Monte Carlo sampling with Bayesian formulations. The key departure from existing works is the use of inexpensive, approximate computational models in a rigorous manner. Such models can readily be derived by coarsening the discretization size in the solution of the governing PDEs, increasing the time step when integration of ODEs is performed, using fewer iterations if a non-linear solver is employed or making use of lower order models. It is shown that even in cases where the inexact models provide very poor approximations of the exact response, statistics of the latter can be quantified accurately with significant reductions in the computational effort. Multiple approximate models can be used and rigorous confidence bounds of the estimates produced are provided at all stages.

artificial intelligence, bayesian inference, machine learning, (18 more...)

arXiv.org Machine Learning

0808.3416

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Mathematics of Computing (0.67)

Add feedback