lpca
A Survey of Dimension Estimation Methods
Binnie, James A. D., Dłotko, Paweł, Harvey, John, Malinowski, Jakub, Yim, Ka Man
It is a standard assumption that datasets in high dimension have an internal structure which means that they in fact lie on, or near, subsets of a lower dimension. In many instances it is important to understand the real dimension of the data, hence the complexity of the dataset at hand. A great variety of dimension estimators have been developed to find the intrinsic dimension of the data but there is little guidance on how to reliably use these estimators. This survey reviews a wide range of dimension estimation methods, categorising them by the geometric information they exploit: tangential estimators which detect a local affine structure; parametric estimators which rely on dimension-dependent probability distributions; and estimators which use topological or metric invariants. The paper evaluates the performance of these methods, as well as investigating varying responses to curvature and noise. Key issues addressed include robustness to hyperparameter selection, sample size requirements, accuracy in high dimensions, precision, and performance on non-linear geometries. In identifying the best hyperparameters for benchmark datasets, overfitting is frequent, indicating that many estimators may not generalise well beyond the datasets on which they have been tested.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland (0.04)
- Europe > Poland > Masovia Province > Warsaw (0.04)
- (12 more...)
- Overview (1.00)
- Research Report > New Finding (0.67)
- Education (0.45)
- Government (0.45)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.67)
Multi-Scale Node Embeddings for Graph Modeling and Generation
Milocco, Riccardo, Jansen, Fabian, Garlaschelli, Diego
Lying at the interface between Network Science and Machine Learning, node embedding algorithms take a graph as input and encode its structure onto output vectors that represent nodes in an abstract geometric space, enabling various vector-based downstream tasks such as network modelling, data compression, link prediction, and community detection. Two apparently unrelated limitations affect these algorithms. On one hand, it is not clear what the basic operation defining vector spaces, i.e. the vector sum, corresponds to in terms of the original nodes in the network. On the other hand, while the same input network can be represented at multiple levels of resolution by coarse-graining the constituent nodes into arbitrary block-nodes, the relationship between node embeddings obtained at different hierarchical levels is not understood. Here, building on recent results in network renormalization theory, we address these two limitations at once and define a multiscale node embedding method that, upon arbitrary coarse-grainings, ensures statistical consistency of the embedding vector of a block-node with the sum of the embedding vectors of its constituent nodes. We illustrate the power of this approach on two economic networks that can be naturally represented at multiple resolution levels: namely, the international trade between (sets of) countries and the input-output flows among (sets of) industries in the Netherlands. We confirm the statistical consistency between networks retrieved from coarse-grained node vectors and networks retrieved from sums of fine-grained node vectors, a result that cannot be achieved by alternative methods. Several key network properties, including a large number of triangles, are successfully replicated already from embeddings of very low dimensionality, allowing for the generation of faithful replicas of the original networks at arbitrary resolution levels.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Netherlands > South Holland > Leiden (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Europe > Italy (0.04)
- Banking & Finance (0.92)
- Health & Medicine > Therapeutic Area (0.46)
- Government > Commerce (0.34)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Deep reinforcement learning for weakly coupled MDP's with continuous actions
Robledo, Francisco, Ayesta, Urtzi, Avrachenkov, Konstantin
This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA's robustness and efficiency in managing resource allocation while maximizing rewards.
- Europe > France > Occitanie > Haute-Garonne > Toulouse (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Massachusetts (0.04)
- (3 more...)
Shining light on data: Geometric data analysis through quantum dynamics
Experimental sciences have come to depend heavily on our ability to organize and interpret high-dimensional datasets. Natural laws, conservation principles, and inter-dependencies among observed variables yield geometric structure, with fewer degrees of freedom, on the dataset. We introduce the frameworks of semiclassical and microlocal analysis to data analysis and develop a novel, yet natural uncertainty principle for extracting fine-scale features of this geometric structure in data, crucially dependent on data-driven approximations to quantum mechanical processes underlying geometric optics. This leads to the first tractable algorithm for approximation of wave dynamics and geodesics on data manifolds with rigorous probabilistic convergence rates under the manifold hypothesis. We demonstrate our algorithm on real-world datasets, including an analysis of population mobility information during the COVID-19 pandemic to achieve four-fold improvement in dimensionality reduction over existing state-of-the-art and reveal anomalous behavior exhibited by less than 1.2% of the entire dataset. Our work initiates the study of data-driven quantum dynamics for analyzing datasets, and we outline several future directions for research.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Georgia (0.04)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- (29 more...)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (0.67)
- Government > Regional Government > North America Government > United States Government (1.00)
- Energy (1.00)
- Education > Educational Setting > Online (0.92)
- (2 more...)
Low-Rank Representations Towards Classification Problem of Complex Networks
Çelik, Murat, Taşdemir, Ali Baran, Özkahya, Lale
Complex networks representing social interactions, brain activities, molecular structures have been studied widely to be able to understand and predict their characteristics as graphs. Models and algorithms for these networks are used in real-life applications, such as search engines, and recommender systems. In general, such networks are modelled by constructing a low-dimensional Euclidean embedding of the vertices of the network, where proximity of the vertices in the Euclidean space hints the likelihood of an edge (link). In this work, we study the performance of such low-rank representations of real-life networks on a network classification problem.
- Asia > Middle East > Republic of Türkiye > Ankara Province > Ankara (0.05)
- North America > United States > Alaska > Anchorage Municipality > Anchorage (0.04)
- Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
Intrinsic Dimensionality Estimation within Tight Localities: A Theoretical and Experimental Analysis
Amsaleg, Laurent, Chelly, Oussama, Houle, Michael E., Kawarabayashi, Ken-ichi, Radovanović, Miloš, Treeratanajaru, Weeris
Accurate estimation of Intrinsic Dimensionality (ID) is of crucial importance in many data mining and machine learning tasks, including dimensionality reduction, outlier detection, similarity search and subspace clustering. However, since their convergence generally requires sample sizes (that is, neighborhood sizes) on the order of hundreds of points, existing ID estimation methods may have only limited usefulness for applications in which the data consists of many natural groups of small size. In this paper, we propose a local ID estimation strategy stable even for `tight' localities consisting of as few as 20 sample points. The estimator applies MLE techniques over all available pairwise distances among the members of the sample, based on a recent extreme-value-theoretic model of intrinsic dimensionality, the Local Intrinsic Dimension (LID). Our experimental results show that our proposed estimation technique can achieve notably smaller variance, while maintaining comparable levels of bias, at much smaller sample sizes than state-of-the-art estimators.
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Manifold learning via quantum dynamics
We introduce an algorithm for computing geodesics on sampled manifolds that relies on simulation of quantum dynamics on a graph embedding of the sampled data. Our approach exploits classic results in semiclassical analysis and the quantum-classical correspondence, and forms a basis for techniques to learn the manifold from which a dataset is sampled, and subsequently for nonlinear dimensionality reduction of high-dimensional datasets. We illustrate the new algorithm with data sampled from model manifolds and also by a clustering demonstration based on COVID-19 mobility data. Finally, our method reveals interesting connections between the discretization provided by data sampling and quantization.
- North America > United States > Georgia (0.04)
- South America > Venezuela (0.04)
- South America > Peru (0.04)
- (25 more...)
- Health & Medicine (1.00)
- Energy (1.00)
- Government > Regional Government > North America Government > United States Government (0.67)
Dimensionality Reduction for Binary Data through the Projection of Natural Parameters
Landgraf, Andrew J., Lee, Yoonkyung
Principal component analysis (PCA) for binary data, known as logistic PCA, has become a popular alternative to dimensionality reduction of binary data. It is motivated as an extension of ordinary PCA by means of a matrix factorization, akin to the singular value decomposition, that maximizes the Bernoulli log-likelihood. We propose a new formulation of logistic PCA which extends Pearson's formulation of a low dimensional data representation with minimum error to binary data. Our formulation does not require a matrix factorization, as previous methods do, but instead looks for projections of the natural parameters from the saturated model. Due to this difference, the number of parameters does not grow with the number of observations and the principal component scores on new data can be computed with simple matrix multiplication. We derive explicit solutions for data matrices of special structure and provide computationally efficient algorithms for solving for the principal component loadings. Through simulation experiments and an analysis of medical diagnoses data, we compare our formulation of logistic PCA to the previous formulation as well as ordinary PCA to demonstrate its benefits.
- Europe > Austria > Vienna (0.14)
- North America > United States > Ohio (0.04)