data site
Data-intrinsic approximation in metric spaces
Dölz, Jürgen, Multerer, Michael
Analysis and processing of data is a vital part of our modern society and requires vast amounts of computational resources. To reduce the computational burden, compressing and approximating data has become a central topic. We consider the approximation of labeled data samples, mathematically described as site-to-value maps between finite metric spaces. Within this setting, we identify the discrete modulus of continuity as an effective data-intrinsic quantity to measure regularity of site-to-value maps without imposing further structural assumptions. We investigate the consistency of the discrete modulus of continuity in the infinite data limit and propose an algorithm for its efficient computation. Building on these results, we present a sample based approximation theory for labeled data. For data subject to statistical uncertainty we consider multilevel approximation spaces and a variant of the multilevel Monte Carlo method to compute statistical quantities of interest. Our considerations connect approximation theory for labeled data in metric spaces to the covering problem for (random) balls on the one hand and the efficient evaluation of the discrete modulus of continuity to combinatorial optimization on the other hand. We provide extensive numerical studies to illustrate the feasibility of the approach and to validate our theoretical results.
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (5 more...)
Optimal Recovery Meets Minimax Estimation
DeVore, Ronald, Nowak, Robert D., Parhi, Rahul, Petrova, Guergana, Siegel, Jonathan W.
A fundamental problem in statistics and machine learning is to estimate a function $f$ from possibly noisy observations of its point samples. The goal is to design a numerical algorithm to construct an approximation $\hat f$ to $f$ in a prescribed norm that asymptotically achieves the best possible error (as a function of the number $m$ of observations and the variance $\sigma^2$ of the noise). This problem has received considerable attention in both nonparametric statistics (noisy observations) and optimal recovery (noiseless observations). Quantitative bounds require assumptions on $f$, known as model class assumptions. Classical results assume that $f$ is in the unit ball of a Besov space. In nonparametric statistics, the best possible performance of an algorithm for finding $\hat f$ is known as the minimax rate and has been studied in this setting under the assumption that the noise is Gaussian. In optimal recovery, the best possible performance of an algorithm is known as the optimal recovery rate and has also been determined in this setting. While one would expect that the minimax rate recovers the optimal recovery rate when the noise level $\sigma$ tends to zero, it turns out that the current results on minimax rates do not carefully determine the dependence on $\sigma$ and the limit cannot be taken. This paper handles this issue and determines the noise-level-aware (NLA) minimax rates for Besov classes when error is measured in an $L_q$-norm with matching upper and lower bounds. The end result is a reconciliation between minimax rates and optimal recovery rates. The NLA minimax rate continuously depends on the noise level and recovers the optimal recovery rate when $\sigma$ tends to zero.
- North America > United States > Texas > Brazos County > College Station (0.04)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
On Quasi-Localized Dual Pairs in Reproducing Kernel Hilbert Spaces
Harbrecht, Helmut, Kempf, Rüdiger, Multerer, Michael
In scattered data approximation, the span of a finite number of translates of a chosen radial basis function is used as approximation space and the basis of translates is used for representing the approximate. However, this natural choice is by no means mandatory and different choices, like, for example, the Lagrange basis, are possible and might offer additional features. In this article, we discuss different alternatives together with their canonical duals. We study a localized version of the Lagrange basis, localized orthogonal bases, such as the Newton basis, and multiresolution versions thereof, constructed by means of samplets. We argue that the choice of orthogonal bases is particularly useful as they lead to symmetric preconditioners. All bases under consideration are compared numerically to illustrate their feasibility for scattered data approximation. We provide benchmark experiments in two spatial dimensions and consider the reconstruction of an implicit surface as a relevant application from computer graphics.
- North America > United States > New York (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- (5 more...)
Distributed sequential federated learning
Wang, Z. F., Zhang, X. Y., Chang, Y-c I.
The analysis of data stored in multiple sites has become more popular, raising new concerns about the security of data storage and communication. Federated learning, which does not require centralizing data, is a common approach to preventing heavy data transportation, securing valued data, and protecting personal information protection. Therefore, determining how to aggregate the information obtained from the analysis of data in separate local sites has become an important statistical issue. The commonly used averaging methods may not be suitable due to data nonhomogeneity and incomparable results among individual sites, and applying them may result in the loss of information obtained from the individual analyses. Using a sequential method in federated learning with distributed computing can facilitate the integration and accelerate the analysis process. We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data without encountering potential issues such as information security and heavy transportation due to data communication. In addition, the proposed method can preserve the properties of classical sequential adaptive design, such as data-driven sample size and estimation precision when applied to generalized linear models. We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico, to illustrate the proposed method.
- North America > Mexico (0.24)
- Asia > China (0.04)
- North America > United States > New York (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
High Dimensional Restrictive Federated Model Selection with multi-objective Bayesian Optimization over shifted distributions
Sun, Xudong, Bommert, Andrea, Pfisterer, Florian, Rahnenführer, Jörg, Lang, Michel, Bischl, Bernd
A novel machine learning optimization process coined Restrictive Federated Model Selection (RFMS) is proposed under the scenario, for example, when data from healthcare units can not leave the site it is situated on and it is forbidden to carry out training algorithms on remote data sites due to either technical or privacy and trust concerns. To carry out a clinical research under this scenario, an analyst could train a machine learning model only on local data site, but it is still possible to execute a statistical query at a certain cost in the form of sending a machine learning model to some of the remote data sites and get the performance measures as feedback, maybe due to prediction being usually much cheaper. Compared to federated learning, which is optimizing the model parameters directly by carrying out training across all data sites, RFMS trains model parameters only on one local data site but optimizes hyper-parameters across other data sites jointly since hyper-parameters play an important role in machine learning performance. The aim is to get a Pareto optimal model with respective to both local and remote unseen prediction losses, which could generalize well across data sites. In this work, we specifically consider high dimensional data with shifted distributions over data sites. As an initial investigation, Bayesian Optimization especially multi-objective Bayesian Optimization is used to guide an adaptive hyper-parameter optimization process to select models under the RFMS scenario. Empirical results show that solely using the local data site to tune hyper-parameters generalizes poorly across data sites, compared to methods that utilize the local and remote performances. Furthermore, in terms of dominated hypervolumes, multi-objective Bayesian Optimization algorithms show increased performance across multiple data sites among other candidates.
Distributed multivariable modeling for signature development under data protection constraints
Zöller, Daniela, Lenz, Stefan, Binder, Harald
Data protection constraints frequently require distributed analysis of data, i.e. individual-level data remains at many different sites, but analysis nevertheless has to be performed jointly. The data exchange is often handled manually, requiring explicit permission before transfer, i.e. the number of data calls and the amount of data should be limited. Thus, only simple summary statistics are typically transferred and aggregated with just a single call, but this does not allow for complex statistical techniques, e.g., automatic variable selection for prognostic signature development. We propose a multivariable regression approach for building a prognostic signature by automatic variable selection that is based on aggregated data from different locations in iterative calls. To minimize the amount of transferred data and the number of calls, we also provide a heuristic variant of the approach. To further strengthen data protection, the approach can also be combined with a trusted third party architecture. We evaluate our proposed method in a simulation study comparing our results to the results obtained with the pooled individual data. The proposed method is seen to be able to detect covariates with true effect to a comparable extent as a method based on individual data, although the performance is moderately decreased if the number of sites is large. In a typical scenario, the heuristic decreases the number of data calls from more than 10 to 3. To make our approach widely available for application, we provide an implementation on top of the DataSHIELD framework.
- Europe > Germany > Baden-Württemberg > Freiburg (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.90)
Probabilistic Combination of Classifier and Cluster Ensembles for Non-transductive Learning
Acharya, Ayan, Hruschka, Eduardo R., Ghosh, Joydeep, Sarwar, Badrul, Ruvini, Jean-David
Unsupervised models can provide supplementary soft constraints to help classify new target data under the assumption that similar objects in the target set are more likely to share the same class label. Such models can also help detect possible differences between training and target distributions, which is useful in applications where concept drift may take place. This paper describes a Bayesian framework that takes as input class labels from existing classifiers (designed based on labeled data from the source domain), as well as cluster labels from a cluster ensemble operating solely on the target data to be classified, and yields a consensus labeling of the target data. This framework is particularly useful when the statistics of the target data drift or change from those of the training data. We also show that the proposed framework is privacy-aware and allows performing distributed learning when data/models have sharing restrictions. Experiments show that our framework can yield superior results to those provided by applying classifier ensembles only.
- North America > United States > Texas > Travis County > Austin (0.14)
- South America > Brazil > São Paulo (0.04)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
A Privacy-Aware Bayesian Approach for Combining Classifier and Cluster Ensembles
Acharya, Ayan, Hruschka, Eduardo R., Ghosh, Joydeep
This paper introduces a privacy-aware Bayesian approach that combines ensembles of classifiers and clusterers to perform semi-supervised and transductive learning. We consider scenarios where instances and their classification/clustering results are distributed across different data sites and have sharing restrictions. As a special case, the privacy aware computation of the model when instances of the target data are distributed across different data sites, is also discussed. Experimental results show that the proposed approach can provide good classification accuracies while adhering to the data/model sharing constraints.
- Asia > Middle East > Jordan (0.05)
- South America > Brazil > São Paulo (0.04)
- North America > United States > Texas (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.85)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.71)