Goto

Collaborating Authors

 cv 10


SupplementaryMaterial: ExperimentalDesignforLinearFunctionalsinReproducing KernelHilbertSpaces AEstimabilityresults

Neural Information Processing Systems

In the following section, we either describe proof of for implication or equivalences of certain conditionsstudiedinthiswork. We use a different formulation of the relative-bias conditionduetoProposition5. C = BM, andC = BX X, hence we can define L=BX. Lemma 2. The assumption in Definition 4 implies the assumption in Definition 1 withν = J k Likewise, qG = W1/2CV 1/20 X (XV 10 X) 1ϵG is distributed asN(0,Ik). The last final inequality in the statement of the probability follows from taking a square root and triangleinequality,whichfinishestheproof.


A benchmark of categorical encoders for binary classification

Matteucci, Federico, Arzamasov, Vadim, Boehm, Klemens

arXiv.org Artificial Intelligence

Categorical encoders transform categorical features into numerical representations that are indispensable for a wide range of machine learning models. Existing encoder benchmark studies lack generalizability because of their limited choice of (1) encoders, (2) experimental factors, and (3) datasets. Additionally, inconsistencies arise from the adoption of varying aggregation strategies. This paper is the most comprehensive benchmark of categorical encoders to date, including an extensive evaluation of 32 configurations of encoders from diverse families, with 36 combinations of experimental factors, and on 50 datasets. The study shows the profound influence of dataset selection, experimental factors, and aggregation strategies on the benchmark's conclusions -- aspects disregarded in previous encoder benchmarks.


Improved active output selection strategy for noisy environments

Prochaska, Adrian, Pillas, Julien, Bäker, Bernard

arXiv.org Machine Learning

The test bench time needed for model-based calibration can be reduced with active learning methods for test design. This paper presents an improved strategy for active output selection. This is the task of learning multiple models in the same input dimensions and suits the needs of calibration tasks. Compared to an existing strategy, we take into account the noise estimate, which is inherent to Gaussian processes. The method is validated on three different toy examples. The performance compared to the existing best strategy is the same or better in each example. In a best case scenario, the new strategy needs at least 10% less measurements compared to all other active or passive strategies. Further efforts will evaluate the strategy on a real-world application. Moreover, the implementation of more sophisticated active-learning strategies for the query placement will be realized.


Risk estimation for high-dimensional lasso regression

Homrighausen, Darren, McDonald, Daniel J.

arXiv.org Machine Learning

In high-dimensional estimation, analysts are faced with more parameters $p$ than available observations $n$, and asymptotic analysis of performance allows the ratio $p/n\rightarrow \infty$. This situation makes regularization both necessary and desirable in order for estimators to possess theoretical guarantees. However, the amount of regularization, often determined by one or more tuning parameters, is integral to achieving good performance. In practice, choosing the tuning parameter is done through resampling methods (e.g. cross-validation), generalized information criteria, or reformulating the optimization problem (e.g. square-root lasso or scaled sparse regression). Each of these techniques comes with varying levels of theoretical guarantee for the low- or high-dimensional regimes. However, there are some notable deficiencies in the literature. The theory, and sometimes practice, of many methods relies on either the knowledge or estimation of the variance parameter, which is difficult to estimate in high dimensions. In this paper, we provide theoretical intuition suggesting that some previously proposed approaches based on information criteria work poorly in high dimensions. We introduce a suite of new risk estimators leveraging the burgeoning literature on high-dimensional variance estimation. Finally, we compare our proposal to many existing methods for choosing the tuning parameters for lasso regression by providing an extensive simulation to examine their finite sample performance. We find that our new estimators perform quite well, often better than the existing approaches across a wide range of simulation conditions and evaluation criteria.