intrinsic dimension estimation
- North America > United States > California > Santa Clara County > Palo Alto (0.05)
- Oceania > Australia > New South Wales > Sydney (0.04)
Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Rapidly increasing quality of AI-generated content makes it difficult to distinguish between human and AI-generated texts, which may lead to undesirable consequences for society. Therefore, it becomes increasingly important to study the properties of human texts that are invariant over text domains and various proficiency of human writers, can be easily calculated for any language, and can robustly separate natural and AI-generated texts regardless of the generation model and sampling method. In this work, we propose such an invariant of human texts, namely the intrinsic dimensionality of the manifold underlying the set of embeddings of a given text sample. We show that the average intrinsic dimensionality of fluent texts in natural language is hovering around the value $9$ for several alphabet-based languages and around $7$ for Chinese, while the average intrinsic dimensionality of AI-generated texts for each language is $\approx 1.5$ lower, with a clear statistical separation between human-generated and AI-generated distributions. This property allows us to build a score-based artificial text detector. The proposed detector's accuracy is stable over text domains, generator models, and human writer proficiency levels, outperforming SOTA detectors in model-agnostic and cross-domain scenarios by a significant margin.
Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models
Roset, Joan Font-Quer, Mohan, Devina, Scaife, Anna
In this work, we estimate the intrinsic dimension (iD) of the Radio Galaxy Zoo (RGZ) dataset using a score-based diffusion model. We examine how the iD estimates vary as a function of Bayesian neural network (BNN) energy scores, which measure how similar the radio sources are to the MiraBest subset of the RGZ dataset. We find that out-of-distribution sources exhibit higher iD values, and that the overall iD for RGZ exceeds those typically reported for natural image datasets. Furthermore, we analyse how iD varies across Fanaroff-Riley (FR) morphological classes and as a function of the signal-to-noise ratio (SNR). While no relationship is found between FR I and FR II classes, a weak trend toward higher SNR at lower iD. Future work using the RGZ dataset could make use of the relationship between iD and energy scores to quantitatively study and improve the representations learned by various self-supervised learning algorithms.
- Europe > United Kingdom > England > Greater Manchester > Manchester (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.05)
- Europe > United Kingdom > England > Greater London > London (0.04)
A Novel Approach for Intrinsic Dimension Estimation
Özçoban, Kadir, Manguoğlu, Murat, Yetkin, Emrullah Fatih
Dimensionality reduction approaches are crucial in various applications of machine learning tasks such as computer vision, robotics, natural language processing, medical diagnosis, recommendation systems or industrial IoT applications such as predictive maintenance which need to generate and process large amounts of data and variables. In general, dimensionality reduction improves the performance of machine learning tasks' by removing redundant features. In this regard, both linear and non-linear dimensionality reduction methods, specifically the manifold learning techniques are particularly efficient since they are based on the preservation of the geometric structure of the original feature space. In this manner, there are several approaches already available and studied extensively in the literature such as principal component analysis (PCA), Multidimensional scaling (MDS), Laplacian Eigenmaps (LE) and other. We refer the reader to (Lee and Verleysen, 2007) for a comprehensive survey of the available methods.
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- Asia > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
- (2 more...)
- Research Report > New Finding (0.67)
- Research Report > Promising Solution (0.52)
- Overview > Innovation (0.52)
Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Rapidly increasing quality of AI-generated content makes it difficult to distinguish between human and AI-generated texts, which may lead to undesirable consequences for society. Therefore, it becomes increasingly important to study the properties of human texts that are invariant over text domains and various proficiency of human writers, can be easily calculated for any language, and can robustly separate natural and AI-generated texts regardless of the generation model and sampling method. In this work, we propose such an invariant of human texts, namely the intrinsic dimensionality of the manifold underlying the set of embeddings of a given text sample. We show that the average intrinsic dimensionality of fluent texts in natural language is hovering around the value 9 for several alphabet-based languages and around 7 for Chinese, while the average intrinsic dimensionality of AI-generated texts for each language is \approx 1.5 lower, with a clear statistical separation between human-generated and AI-generated distributions. This property allows us to build a score-based artificial text detector. The proposed detector's accuracy is stable over text domains, generator models, and human writer proficiency levels, outperforming SOTA detectors in model-agnostic and cross-domain scenarios by a significant margin.
Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Tulchinskii, Eduard, Kuznetsov, Kristian, Kushnareva, Laida, Cherniavskii, Daniil, Barannikov, Serguei, Piontkovskaya, Irina, Nikolenko, Sergey, Burnaev, Evgeny
Rapidly increasing quality of AI-generated content makes it difficult to distinguish between human and AI-generated texts, which may lead to undesirable consequences for society. Therefore, it becomes increasingly important to study the properties of human texts that are invariant over different text domains and varying proficiency of human writers, can be easily calculated for any language, and can robustly separate natural and AI-generated texts regardless of the generation model and sampling method. In this work, we propose such an invariant for human-written texts, namely the intrinsic dimensionality of the manifold underlying the set of embeddings for a given text sample. We show that the average intrinsic dimensionality of fluent texts in a natural language is hovering around the value $9$ for several alphabet-based languages and around $7$ for Chinese, while the average intrinsic dimensionality of AI-generated texts for each language is $\approx 1.5$ lower, with a clear statistical separation between human-generated and AI-generated distributions. This property allows us to build a score-based artificial text detector. The proposed detector's accuracy is stable over text domains, generator models, and human writer proficiency levels, outperforming SOTA detectors in model-agnostic and cross-domain scenarios by a significant margin.
CA-PCA: Manifold Dimension Estimation, Adapted for Curvature
Gilbert, Anna C., O'Neill, Kevin
Much of modern data analysis in high dimensions relies on the premise that data, while embedded in a high-dimensional space, lie on or near a submanifold of lower dimension. This allows one to embed the data in a space of lower dimension while preserving much of the essential structure, with benefits including faster computation and data visualization. This lower dimension, hereafter referred to as the intrinsic dimension (ID) of the underlying manifold, often enters as a parameter of the dimension-reduction scheme. For instance, in each of the Johnson-Lindenstrauss-type results for manifolds by [13] and [4] the target dimension depends on the ID. Furthermore, the ID is a parameter of popular dimension reduction methods such as t-SNE [28] and multidimensional scaling [12, 16]. Therefore, it may be beneficial to estimate the ID before running further analysis since compressing the data too much may destroy underlying structure and it may be computationally expensive to re-run algorithms with a new dimension parameter, if such an error is even detectable.
Intrinsic Dimension Estimation Using Packing Numbers
We propose a new algorithm to estimate the intrinsic dimension of data sets. The method is based on geometric properties of the data and re- quires neither parametric assumptions on the data generating model nor input parameters to set. The method is compared to a similar, widely- used algorithm from the same family of geometric techniques. Experi- ments show that our method is more robust in terms of the data generating distribution and more reliable in the presence of noise.
Intrinsic dimension estimation for discrete metrics
Macocco, Iuri, Glielmo, Aldo, Grilli, Jacopo, Laio, Alessandro
Real world-datasets characterized by discrete features are ubiquitous: from categorical surveys to clinical questionnaires, from unweighted networks to DNA sequences. Nevertheless, the most common unsupervised dimensional reduction methods are designed for continuous spaces, and their use for discrete spaces can lead to errors and biases. In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting, finding a surprisingly small ID, of order 2. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Scikit-dimension: a Python package for intrinsic dimension estimation
Bac, Jonathan, Mirkes, Evgeny M., Gorban, Alexander N., Tyukin, Ivan, Zinovyev, Andrei
Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality (ID). A number of methods have been suggested for the purpose of estimating ID, but no standard package to easily apply them one by one or all at once has been implemented in Python. This technical note introduces \texttt{scikit-dimension}, an open-source Python package for intrinsic dimension estimation. \texttt{scikit-dimension} package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface to evaluate global and local intrinsic dimension, as well as generators of synthetic toy and benchmark datasets widespread in the literature. The package is developed with tools assessing the code quality, coverage, unit testing and continuous integration. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data. The source code is available from https://github.com/j-bac/scikit-dimension , the documentation is available from https://scikit-dimension.readthedocs.io .
- Asia > Russia (0.14)
- Europe > France > Île-de-France > Paris > Paris (0.05)
- Europe > United Kingdom > England > Leicestershire > Leicester (0.04)
- (6 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.68)
- Health & Medicine > Therapeutic Area > Oncology (0.46)