Goto

Collaborating Authors

 Regression


Large-Scale Cell-Level Quality of Service Estimation on 5G Networks Using Machine Learning Techniques

arXiv.org Artificial Intelligence

This study presents a general machine learning framework to estimate the traffic-measurement-level experience rate at given throughput values in the form of a Key Performance Indicator for the cells on base stations across various cities, using busy-hour counter data, and several technical parameters together with the network topology. Relying on feature engineering techniques, scores of additional predictors are proposed to enhance the effects of raw correlated counter values over the corresponding targets, and to represent the underlying interactions among groups of cells within nearby spatial locations effectively. An end-to-end regression modeling is applied on the transformed data, with results presented on unseen cities of varying sizes.


Plant species richness prediction from DESIS hyperspectral data: A comparison study on feature extraction procedures and regression models

arXiv.org Artificial Intelligence

The diversity of terrestrial vascular plants plays a key role in maintaining the stability and productivity of ecosystems. Monitoring species compositional diversity across large spatial scales is challenging and time consuming. The advanced spectral and spatial specification of the recently launched DESIS (the DLR Earth Sensing Imaging Spectrometer) instrument provides a unique opportunity to test the potential for monitoring plant species diversity with spaceborne hyperspectral data. This study provides a quantitative assessment on the ability of DESIS hyperspectral data for predicting plant species richness in two different habitat types in southeast Australia. Spectral features were first extracted from the DESIS spectra, then regressed against on-ground estimates of plant species richness, with a two-fold cross validation scheme to assess the predictive performance. We tested and compared the effectiveness of Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), and Partial Least Squares analysis (PLS) for feature extraction, and Kernel Ridge Regression (KRR), Gaussian Process Regression (GPR), Random Forest Regression (RFR) for species richness prediction. The best prediction results were r=0.76 and RMSE=5.89 for the Southern Tablelands region, and r=0.68 and RMSE=5.95 for the Snowy Mountains region. Relative importance analysis for the DESIS spectral bands showed that the red-edge, red, and blue spectral regions were more important for predicting plant species richness than the green bands and the near-infrared bands beyond red-edge. We also found that the DESIS hyperspectral data performed better than Sentinel-2 multispectral data in the prediction of plant species richness. Our results provide a quantitative reference for future studies exploring the potential of spaceborne hyperspectral data for plant biodiversity mapping.


Sequentially Controlled Text Generation

arXiv.org Artificial Intelligence

While GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure. We study the problem of imposing structure on long-range text. We propose a novel controlled text generation task, sequentially controlled text generation, and identify a dataset, NewsDiscourse as a starting point for this task. We develop a sequential controlled text generation pipeline with generation and editing. We test different degrees of structural awareness and show that, in general, more structural awareness results in higher control-accuracy, grammaticality, coherency and topicality, approaching human-level writing performance.


One-vs-All Logistic Regression for Image Recognition in Python

#artificialintelligence

This article represents the continuation of a series of dedicated articles that began some time ago. This series proposes the reader to understand the basic concepts leading to Machine Learning for biomedical data, like the difference between Linear and logistic regression, the Cost Function, Regularized Logistic Regression, and Gradient (see the Reference section). Each implementation is intended from scratch, and we will not use optimized machine learning packages like Scikit-learn, PyTorch, or TensorFlow. The only requirement is an updated version of Python 3, some fundamental libraries, and the desire to read this post to the end! Regressions (linear, logistic, for single and multiple variables) are statistical models helpful in finding correlations between observed dataset variables and answering whether those correlations are statistically significant.


$l_{1-2}$ GLasso: $L_{1-2}$ Regularized Multi-task Graphical Lasso for Joint Estimation of eQTL Mapping and Gene Network

arXiv.org Machine Learning

Developments in sequencing technology allow us to obtain more and more genomic data since the publication of the first human genome sequence. Computational techniques can help us to mine meaningful information from raw data and understand how gene expression is regulated in cells. In general, these problems include identifying cancer gene co-expression (co-expression: simultaneous expression of two or more genes) modules, determining SNP-gene relationships through eQTL (expression quantitative trait locus) mapping and determining gene-gene relationships by estimating gene network structure, etc (Rockman and Kruglyak, 2006; Gardner and Faith, 2005). Given a dataset containing single nucleotide polymorphisms (SNPs) and mRNA expression, the problem is to understand the SNP-gene and gene-gene relationships.


Machine-Learning Prediction of the Computed Band Gaps of Double Perovskite Materials

arXiv.org Artificial Intelligence

Prediction of the electronic structure of functional materials is essential for the engineering of new devices. Conventional electronic structure prediction methods based on density functional theory (DFT) suffer from not only high computational cost, but also limited accuracy arising from the approximations of the exchange-correlation functional. Surrogate methods based on machine learning have garnered much attention as a viable alternative to bypass these limitations, especially in the prediction of solid-state band gaps, which motivated this research study. Herein, we construct a random forest regression model for band gaps of double perovskite materials, using a dataset of 1306 band gaps computed with the GLLBSC (Gritsenko, van Leeuwen, van Lenthe, and Baerends solid correlation) functional. Among the 20 physical features employed, we find that the bulk modulus, superconductivity temperature, and cation electronegativity exhibit the highest importance scores, consistent with the physics of the underlying electronic structure. Using the top 10 features, a model accuracy of 85.6% with a root mean square error of 0.64 eV is obtained, comparable to previous studies. Our results are significant in the sense that they attest to the potential of machine learning regressions for the rapid screening of promising candidate functional materials.


Augmenting data-driven models for energy systems through feature engineering: A Python framework for feature engineering

arXiv.org Artificial Intelligence

Data-driven modeling is an approach in energy systems modeling that has been gaining popularity. In data-driven modeling, machine learning methods such as linear regression, neural networks or decision-tree based methods are being applied. While these methods do not require domain knowledge, they are sensitive to data quality. Therefore, improving data quality in a dataset is beneficial for creating machine learning-based models. The improvement of data quality can be implemented through preprocessing methods. A selected type of preprocessing is feature engineering, which focuses on evaluating and improving the quality of certain features inside the dataset. Feature engineering methods include methods such as feature creation, feature expansion, or feature selection. In this work, a Python framework containing different feature engineering methods is presented. This framework contains different methods for feature creation, expansion and selection; in addition, methods for transforming or filtering data are implemented. The implementation of the framework is based on the Python library scikit-learn. The framework is demonstrated on a case study of a use case from energy demand prediction. A data-driven model is created including selected feature engineering methods. The results show an improvement in prediction accuracy through the engineered features.


Supervised Machine Learning: Classification

#artificialintelligence

This course introduces you to one of the main types of modeling families of supervised Machine Learning: Classification. You will learn how to train predictive models to classify categorical outcomes and how to use error metrics to compare across different models. The hands-on section of this course focuses on using best practices for classification, including train and test splits, and handling data sets with unbalanced classes. By the end of this course you should be able to: -Differentiate uses and applications of classification and classification ensembles -Describe and use logistic regression models -Describe and use decision tree and tree-ensemble models -Describe and use other ensemble methods for classification -Use a variety of error metrics to compare and select the classification model that best suits your data -Use oversampling and undersampling as techniques to handle unbalanced classes in a data set Who should take this course? This course targets aspiring data scientists interested in acquiring hands-on experience with Supervised Machine Learning Classification techniques in a business setting.


Measuring tail risk at high-frequency: An $L_1$-regularized extreme value regression approach with unit-root predictors

arXiv.org Machine Learning

We study tail risk dynamics in high-frequency financial markets and their connection with trading activity and market uncertainty. We introduce a dynamic extreme value regression model accommodating both stationary and local unit-root predictors to appropriately capture the time-varying behaviour of the distribution of high-frequency extreme losses. To characterize trading activity and market uncertainty, we consider several volatility and liquidity predictors, and propose a two-step adaptive $L_1$-regularized maximum likelihood estimator to select the most appropriate ones. We establish the oracle property of the proposed estimator for selecting both stationary and local unit-root predictors, and show its good finite sample properties in an extensive simulation study. Studying the high-frequency extreme losses of nine large liquid U.S. stocks using 42 liquidity and volatility predictors, we find the severity of extreme losses to be well predicted by low levels of price impact in period of high volatility of liquidity and volatility.


HiClass: a Python library for local hierarchical classification compatible with scikit-learn

arXiv.org Artificial Intelligence

HiClass is an open-source Python library for local hierarchical classification entirely compatible with scikit-learn. It contains implementations of the most common design patterns for hierarchical machine learning models found in the literature, that is, the local classifiers per node, per parent node and per level. Additionally, the package contains implementations of hierarchical metrics, which are more appropriate for evaluating classification performance on hierarchical data. The documentation includes installation and usage instructions, examples within tutorials and interactive notebooks, and a complete description of the API. HiClass is released under the simplified BSD license, encouraging its use in both academic and commercial environments.