AITopics

2307.16718

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

arXiv.org Artificial IntelligenceJun-9-2023

Two-level histograms for dealing with outliers and heavy tail distributions

Boullé, Marc

Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Many approaches have been proposed in the literature to infer these parameters, either assuming hypotheses about the underlying data distributions or exploiting a model selection approach. In this paper, we focus on the G-Enum histogram method, which exploits the Minimum Description Length (MDL) principle to build histograms without any user parameter and achieves state-of-the art performance w.r.t accuracy; parsimony and computation time. We investigate on the limits of this method in the case of outliers or heavy-tailed distributions. We suggest a two-level heuristic to deal with such cases. The first level exploits a logarithmic transformation of the data to split the data set into a list of data subsets with a controlled range of values. The second level builds a sub-histogram for each data subset and aggregates them to obtain a complete histogram. Extensive experiments show the benefits of the approach.

artificial intelligence, histogram, machine learning, (16 more...)

2306.05786

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.86)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory > Minimum Complexity Machines (0.68)

arXiv.org Artificial IntelligenceDec-27-2022

Fast and fully-automated histograms for large-scale data sets

Mendizábal, Valentina Zelaya, Boullé, Marc, Rossi, Fabrice

G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing histogram construction as a density estimation problem and its automation as a model selection task, these histograms leverage the Minimum Description Length principle (MDL) to derive two different model selection criteria. Several proven theoretical results about these criteria give insights about their asymptotic behavior and are used to speed up their optimisation. These insights, combined to a greedy search heuristic, are used to construct histograms in linearithmic time rather than the polynomial time incurred by previous works. The capabilities of the proposed MDL density estimation method are illustrated with reference to other fully automated methods in the literature, both on synthetic and large real-world data sets.

artificial intelligence, histogram, machine learning, (20 more...)

doi: 10.1016/j.csda.2022.107668

2212.13524

Country: North America > United States > New York (0.28)

Genre: Research Report (1.00)

arXiv.org Artificial IntelligenceDec-22-2022

Co-clustering based exploratory analysis of mixed-type data tables

Bouchareb, Aichetou, Boullé, Marc, Clérot, Fabrice, Rossi, Fabrice

Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are binarized according to a number of bins chosen by the analyst, by equal frequency discretization in the numerical case, or keeping the most frequent values in the categorical case. The second step applies a co-clustering to the instances and the binary variables, leading to groups of instances and groups of variable parts. We apply this methodology on several data sets and compare with the results of a Multiple Correspondence Analysis applied to the same data.

artificial intelligence, machine learning, variable part, (18 more...)

doi: 10.1007/978-3-030-18129-1_2

2212.11728

Country:

North America > United States (0.46)
Europe > France (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)

arXiv.org Artificial IntelligenceDec-22-2022

Model Based Co-clustering of Mixed Numerical and Binary Data

Bouchareb, Aichetou, Boullé, Marc, Clérot, Fabrice, Rossi, Fabrice

The goal of co-clustering is to jointly perform a clustering of rows and a clustering of columns of a data table. Proposed by [Good, 1965] then by [Hartigan, 1975], co-clustering is an extension of the standard clustering that extracts the underlying structure in the data in the form of clusters of row and clusters of columns. The advantage of this technique, over the standard clustering, lies in the joint (simultaneous) analysis of the rows and columns which enables extracting the maximum of information about the interdependence between the two entities. The utility of co-clustering lies in its capacity to create easily interpretable clusters and its capability to reduce a large data table into a significantly smaller matrix having the same structure as the orig-Aichetou Bouchareb, Marc Boullé and Fabrice Clérot: Orange Labs, 2 Avenue Pierre Marzin 22300 Lannion - France, e-mail: firstname.

artificial intelligence, data mining, machine learning, (15 more...)

doi: 10.1007/978-3-030-18129-1_1

2212.11725

Country: Europe > France (0.48)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)

arXiv.org Machine LearningFeb-6-2019

Un mod\`ele Bay\'esien de co-clustering de donn\'ees mixtes

Bouchareb, Aichetou, Boullé, Marc, Rossi, Fabrice, Clérot, Fabrice

We propose a MAP Bayesian approach to perform and evaluate a co-clustering of mixed-type data tables. The proposed model infers an optimal segmentation of all variables then performs a co-clustering by minimizing a Bayesian model selection cost function. One advantage of this approach is that it is user parameter-free. Another main advantage is the proposed criterion which gives an exact measure of the model quality, measured by probability of fitting it to the data. Continuous optimization of this criterion ensures finding better and better models while avoiding data over-fitting. The experiments conducted on real data show the interest of this co-clustering approach in exploratory data analysis of large data sets.

artificial intelligence, bayesian inference, modèle, (18 more...)

1902.02056

Country: North America > United States (0.29)

Genre: Research Report (0.40)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.34)

arXiv.org Machine LearningAug-29-2016

Discovering Patterns in Time-Varying Graphs: A Triclustering Approach

Guigourès, Romain, Boullé, Marc, Rossi, Fabrice

This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clusters of source vertices, destination vertices and time segments where the edge distributions across clusters of vertices follow the same evolution over the time segments. The main novelty of this approach lies in that the time segments are directly inferred from the evolution of the edge distribution between the vertices, thus not requiring the user to make any a priori quantization. Experiments conducted on artificial data illustrate the good behavior of the technique, and a study of a real-life data set shows the potential of the proposed approach for exploratory data analysis.

artificial intelligence, graph, health & medicine, (20 more...)

doi: 10.1007/s11634-015-0218-6

1608.07929

Country:

Europe > France (0.14)
Europe > Belgium (0.14)

Genre: Research Report (1.00)

Industry:

Transportation (0.93)
Health & Medicine (0.67)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.69)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)

arXiv.org Machine LearningNov-4-2015

Co-Clustering Network-Constrained Trajectory Data

Mahrsi, Mohamed Khalil El, Guigourès, Romain, Rossi, Fabrice, Boullé, Marc

Recently, clustering moving object trajectories kept gaining interest from both the data mining and machine learning communities. This problem, however, was studied mainly and extensively in the setting where moving objects can move freely on the euclidean space. In this paper, we study the problem of clustering trajectories of vehicles whose movement is restricted by the underlying road network. We model relations between these trajectories and road segments as a bipartite graph and we try to cluster its vertices. We demonstrate our approaches on synthetic data and show how it could be useful in inferring knowledge about the flow dynamics and the behavior of the drivers using the road network.

artificial intelligence, ground transportation, trajectory, (20 more...)

doi: 10.1007/978-3-319-23751-0_2

1511.01281

Country:

North America > United States (0.14)
Europe (0.14)

Genre: Research Report (0.65)

Industry:

Transportation > Infrastructure & Services (1.00)
Transportation > Ground > Road (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (1.00)
Information Technology > Data Science > Data Mining (0.89)

arXiv.org Machine LearningOct-30-2015

A Study of the Spatio-Temporal Correlations in Mobile Calls Networks

Guigourès, Romain, Boullé, Marc, Rossi, Fabrice

For the last few years, the amount of data has significantly increased in the companies. It is the reason why data analysis methods have to evolve to meet new demands. In this article, we introduce a practical analysis of a large database from a telecommunication operator. The problem is to segment a territory and characterize the retrieved areas owing to their inhabitant behavior in terms of mobile telephony. We have call detail records collected during five months in France. We propose a two stages analysis. The first one aims at grouping source antennas which originating calls are similarly distributed on target antennas and conversely for target antenna w.r.t. source antenna. A geographic projection of the data is used to display the results on a map of France. The second stage discretizes the time into periods between which we note changes in distributions of calls emerging from the clusters of source antennas. This enables an analysis of temporal changes of inhabitants behavior in every area of the country.

antenna, artificial intelligence, télécommunications, (15 more...)

doi: 10.1007/978-3-319-23751-0_1

1510.09005

Country: Europe > France (0.87)

Genre: Research Report (1.00)

Industry: Telecommunications (0.48)

Technology:

Information Technology > Communications (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.68)
Information Technology > Data Science (0.66)

arXiv.org Machine LearningAug-6-2015

Universal Approximation of Edge Density in Large Graphs

Boullé, Marc

In this paper, we present a novel way to summarize the structure of large graphs, based on non-parametric estimation of edge density in directed multigraphs. Following coclustering approach, we use a clustering of the vertices, with a piecewise constant estimation of the density of the edges across the clusters, and address the problem of automatically and reliably inferring the number of clusters, which is the granularity of the coclustering. We use a model selection technique with data-dependent prior and obtain an exact evaluation criterion for the posterior probability of edge density estimation models. We demonstrate, both theoretically and empirically, that our data-dependent modeling technique is consistent, resilient to noise, valid non asymptotically and asymptotically behaves as an universal approximator of the true edge density in directed multigraphs. We evaluate our method using artificial graphs and present its practical interest on real world graphs. The method is both robust and scalable. It is able to extract insightful patterns in the unsupervised learning setting and to provide state of the art accuracy when used as a preparation step for supervised learning.

air transportation, graph, survey article, (20 more...)

1508.0134

Country:

Europe (0.67)
North America > United States > New York (0.14)
North America > United States > Pennsylvania (0.14)

Genre: Research Report > New Finding (0.45)

Industry:

Transportation > Air (0.46)
Government > Regional Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.90)