Regression
Multiclass Classification Using TensorFlow
In the previous article, I discussed building a linear regression model using Tensorflow. In this article, I will try to solve a multiclass classification problem using Tensorflow. I have used the MNIST-digit recognizer dataset here. Please note that even though a Convolutional Neural Network might have worked better for this problem as this is an image recognition problem, but I have used a generic neural network as I wanted to showcase solving a classification problem using Neural Networks. The dataset consists of 784 pixel columns, where each row represents a 28 x 28 image flattened out into a row vector, and a label column, with the image labels given by the digits the image represent, from 0–9.
Approximate Bayesian Computation Based on Maxima Weighted Isolation Kernel Mapping
This paper addresses the problem of precisely estimating the parameters of a stochastic model corresponding to branching processes. A branching process is a stochastic process consisting of collections of random variables indexed by the natural numbers. Branching processes are often used to describe population models Jagers (1989) and Athreya and Ney (2012); for example, models in the population genetics showing the genetic drift Burden and Simon (2016) Chen et al. (2017). In contrast to statistical approaches, branching processes enable the study of the dynamics of cell evolution and, as a consistence, have become a popular approach to cancer cell evolution research West et al., 2016. However, particularly in the case of cancer cell evolution, as well as in branching processes in general, the ultimate extinction of a population often occurs Devroye (1998). It is for this reason that with the initial uniform distribution of parameters, branching processes models tend to yield unevenly distributed data consisting of sparse and dense regions. The stochastic nature of the data is an another obstacle in estimating the parameters of a branching processes model, especially in the case of cancer cell evolution Nagornov et al. (2021). Moreover, simulations, based on a model of cell mutations, population evolution, and tumor/cancer subpopulations, commonly lead to the emergence of many clones and rarely to the appearance of cancer cells.
Python for Finance: Investment Fundamentals & Data Analytics
Learn how to code in Python Take your career to the next level Work with Python's conditional statements, functions, sequences, and loops Work with scientific packages, like NumPy Understand how to use the data analysis toolkit, Pandas Plot graphs with Matplotlib Use Python to solve real-world tasks Get a job as a data scientist with Python Acquire solid financial acumen Carry out in-depth investment analysis Build investment portfolios Calculate risk and return of individual securities Calculate risk and return of investment portfolios Apply best practices when working with financial data Use univariate and multivariate regression analysis Understand the Capital Asset Pricing Model Compare securities in terms of their Sharpe ratio Perform Monte Carlo simulations Learn how to price options by applying the Black Scholes formula Be comfortable applying for a developer job in a financial institution You'll need to install Anaconda. You'll need to install Anaconda. Do you want to learn how to use Python in a working environment? Are you a young professional interested in a career in Data Science? Would you like to explore how Python can be applied in the world of Finance and solve portfolio optimization problems?
Meta-Learners for Estimation of Causal Effects: Finite Sample Cross-Fit Performance
In recent years there has been a growing interest in the estimation of causal effects using machine learning algorithms, particularly in the field of economics (Athey, 2018). The newly emerging synthesis of machine learning methods with causal inference has a large potential for a more comprehensive estimation of causal effects (Lechner, 2018). On the one hand, it enables a more flexible estimation of average effects which are of main interest in microeconometrics (Imbens & Wooldridge, 2009). On the other hand, it advances the estimation beyond the average effects and allows for a systematic analysis of effect heterogeneity (Athey & Imbens, 2017). Both of these aspects contribute to a better description of the causal mechanisms and thus to a possibly more efficient treatment allocation (Zhao, Zeng, Rush, & Kosorok, 2012; Kitagawa & Tetenov, 2018; Athey & Wager, 2021; Nie, Brunskill, & Wager, 2021). Hence, applied empirical researchers can greatly benefit from the usage of machine learning methods ranging from evaluation of public policies and business decisions to designing personalized interventions (Andini, Ciani, de Blasio, D'Ignazio, & Salvestrini, 2018; Bansak et al., 2018). Machine learning estimators as such are, however, primarily designed for prediction problems and thus cannot be used directly for causal inference. Therefore, new approaches for the estimation of causal parameters using machine learning emerged (see Athey & Imbens, 2019, for an overview). In particular, the development of the so-called meta-learners have received considerable attention (see e.g.
A Priori Denoising Strategies for Sparse Identification of Nonlinear Dynamical Systems: A Comparative Study
Cortiella, Alexandre, Park, Kwang-Chun, Doostan, Alireza
In recent years, identification of nonlinear dynamical systems from data has become increasingly popular. Sparse regression approaches, such as Sparse Identification of Nonlinear Dynamics (SINDy), fostered the development of novel governing equation identification algorithms assuming the state variables are known a priori and the governing equations lend themselves to sparse, linear expansions in a (nonlinear) basis of the state variables. In the context of the identification of governing equations of nonlinear dynamical systems, one faces the problem of identifiability of model parameters when state measurements are corrupted by noise. Measurement noise affects the stability of the recovery process yielding incorrect sparsity patterns and inaccurate estimation of coefficients of the governing equations. In this work, we investigate and compare the performance of several local and global smoothing techniques to a priori denoise the state measurements and numerically estimate the state time-derivatives to improve the accuracy and robustness of two sparse regression methods to recover governing equations: Sequentially Thresholded Least Squares (STLS) and Weighted Basis Pursuit Denoising (WBPDN) algorithms. We empirically show that, in general, global methods, which use the entire measurement data set, outperform local methods, which employ a neighboring data subset around a local point. We additionally compare Generalized Cross Validation (GCV) and Pareto curve criteria as model selection techniques to automatically estimate near optimal tuning parameters, and conclude that Pareto curves yield better results. The performance of the denoising strategies and sparse regression methods is empirically evaluated through well-known benchmark problems of nonlinear dynamical systems.
Geometry- and Accuracy-Preserving Random Forest Proximities
Rhodes, Jake S., Cutler, Adele, Moon, Kevin R.
Abstract--Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry-and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry. ANDOM forests [1] are well-known, powerful predictors comprised of an ensemble of binary recursive was first defined by Leo Breiman as the proportion of decision trees. Random forests are easily adapted for both trees in which the observations reside in the same terminal classification and regression, are trivially parallelizable, can node [16].
Re-calibrating Photometric Redshift Probability Distributions Using Feature-space Regression
Dey, Biprateep, Newman, Jeffrey A., Andrews, Brett H., Izbicki, Rafael, Lee, Ann B., Zhao, David, Rau, Markus Michael, Malz, Alex I.
Many astrophysical analyses depend on estimates of redshifts (a proxy for distance) determined from photometric (i.e., imaging) data alone. Inaccurate estimates of photometric redshift uncertainties can result in large systematic errors. However, probability distribution outputs from many photometric redshift methods do not follow the frequentist definition of a Probability Density Function (PDF) for redshift -- i.e., the fraction of times the true redshift falls between two limits $z_{1}$ and $z_{2}$ should be equal to the integral of the PDF between these limits. Previous works have used the global distribution of Probability Integral Transform (PIT) values to re-calibrate PDFs, but offsetting inaccuracies in different regions of feature space can conspire to limit the efficacy of the method. We leverage a recently developed regression technique that characterizes the local PIT distribution at any location in feature space to perform a local re-calibration of photometric redshift PDFs. Though we focus on an example from astrophysics, our method can produce PDFs which are calibrated at all locations in feature space for any use case.
Fairness implications of encoding protected categorical attributes
Mougan, Carlos, Alvarez, Jose M., Patro, Gourab K, Ruggieri, Salvatore, Staab, Steffen
Protected attributes are often presented as categorical features that need to be encoded before feeding them into a machine learning algorithm. Encoding these attributes is paramount as they determine the way the algorithm will learn from the data. Categorical feature encoding has a direct impact on the model performance and fairness. In this work, we compare the accuracy and fairness implications of the two most well-known encoders: one-hot encoding and target encoding. We distinguish between two types of induced bias that can arise while using these encodings and can lead to unfair models. The first type, irreducible bias, is due to direct group category discrimination and a second type, reducible bias, is due to large variance in less statistically represented groups. We take a deeper look into how regularization methods for target encoding can improve the induced bias while encoding categorical features. Furthermore, we tackle the problem of intersectional fairness that arises when mixing two protected categorical features leading to higher cardinality. This practice is a powerful feature engineering technique used for boosting model performance. We study its implications on fairness as it can increase both types of induced bias
On the Role of Multi-Objective Optimization to the Transit Network Design Problem
Silva, Vasco D., Finamore, Anna, Henriques, Rui
Ongoing traffic changes, including those triggered by the COVID-19 pandemic, reveal the necessity to adapt our public transport systems to the ever-changing users' needs. This work shows that single and multi objective stances can be synergistically combined to better answer the transit network design problem (TNDP). Single objective formulations are dynamically inferred from the rating of networks in the approximated (multi-objective) Pareto Front, where a regression approach is used to infer the optimal weights of transfer needs, times, distances, coverage, and costs. As a guiding case study, the solution is applied to the multimodal public transport network in the city of Lisbon, Portugal. The system takes individual trip data given by smartcard validations at CARRIS buses and METRO subway stations and uses them to estimate the origin-destination demand in the city. Then, Genetic Algorithms are used, considering both single and multi objective approaches, to redesign the bus network that better fits the observed traffic demand. The proposed TNDP optimization proved to improve results, with reductions in objective functions of up to 28.3%. The system managed to extensively reduce the number of routes, and all passenger related objectives, including travel time and transfers per trip, significantly improve. Grounded on automated fare collection data, the system can incrementally redesign the bus network to dynamically handle ongoing changes to the city traffic.
A Probabilistic Framework for Dynamic Object Recognition in 3D Environment With A Novel Continuous Ground Estimation Method
In this thesis a probabilistic framework is developed and proposed for Dynamic Object Recognition in 3D Environments. A software package is developed using C++ and Python in ROS that performs the detection and tracking task. Furthermore, a novel Gaussian Process Regression (GPR) based method is developed to detect ground points in different urban scenarios of regular, sloped and rough. The ground surface behavior is assumed to only demonstrate local input-dependent smoothness. kernel's length-scales are obtained. Bayesian inference is implemented sing \textit{Maximum a Posteriori} criterion. The log-marginal likelihood function is assumed to be a multi-task objective function, to represent a whole-frame unbiased view of the ground at each frame because adjacent segments may not have similar ground structure in an uneven scene while having shared hyper-parameter values. Simulation results shows the effectiveness of the proposed method in uneven and rough scenes which outperforms similar Gaussian process based ground segmentation methods.