Goto

Collaborating Authors

A Leisurely Look at Versions and Variants of the Cross Validation Estimator

arXiv.org Machine Learning

Many versions of cross-validation (CV) exist in the literature; and each version though has different variants. All are used interchangeably by many practitioners; yet, without explanation to the connection or difference among them. This article has three contributions. First, it starts by mathematical formalization of these different versions and variants that estimate the error rate and the Area Under the ROC Curve (AUC) of a classification rule, to show the connection and difference among them. Second, we prove some of their properties and prove that many variants are either redundant or "not smooth". Hence, we suggest to abandon all redundant versions and variants and only keep the leave-one-out, the $K$-fold, and the repeated $K$-fold. We show that the latter is the only among the three versions that is "smooth" and hence looks mathematically like estimating the mean performance of the classification rules. However, empirically, for the known phenomenon of "weak correlation", which we explain mathematically and experimentally, it estimates both conditional and mean performance almost with the same accuracy. Third, we conclude the article with suggesting two research points that may answer the remaining question of whether we can come up with a finalist among the three estimators: (1) a comparative study, that is much more comprehensive than those available in literature and conclude no overall winner, is needed to consider a wide range of distributions, datasets, and classifiers including complex ones obtained via the recent deep learning approach. (2) we sketch the path of deriving a rigorous method for estimating the variance of the only "smooth" version, repeated $K$-fold CV, rather than those ad-hoc methods available in the literature that ignore the covariance structure among the folds of CV.


A review of machine learning applications in wildfire science and management

arXiv.org Machine Learning

Artificial intelligence has been applied in wildfire science and management since the 1990s, with early applications including neural networks and expert systems. Since then the field has rapidly progressed congruently with the wide adoption of machine learning (ML) in the environmental sciences. Here, we present a scoping review of ML in wildfire science and management. Our objective is to improve awareness of ML among wildfire scientists and managers, as well as illustrate the challenging range of problems in wildfire science available to data scientists. We first present an overview of popular ML approaches used in wildfire science to date, and then review their use in wildfire science within six problem domains: 1) fuels characterization, fire detection, and mapping; 2) fire weather and climate change; 3) fire occurrence, susceptibility, and risk; 4) fire behavior prediction; 5) fire effects; and 6) fire management. We also discuss the advantages and limitations of various ML approaches and identify opportunities for future advances in wildfire science and management within a data science context. We identified 298 relevant publications, where the most frequently used ML methods included random forests, MaxEnt, artificial neural networks, decision trees, support vector machines, and genetic algorithms. There exists opportunities to apply more current ML methods (e.g., deep learning and agent based learning) in wildfire science. However, despite the ability of ML models to learn on their own, expertise in wildfire science is necessary to ensure realistic modelling of fire processes across multiple scales, while the complexity of some ML methods requires sophisticated knowledge for their application. Finally, we stress that the wildfire research and management community plays an active role in providing relevant, high quality data for use by practitioners of ML methods.


Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration

arXiv.org Machine Learning

We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the smoothly clipped absolute deviation (SCAD) penalty. The contributions of this study are three-fold: We first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with $\ell_1$ in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory, and associated phase diagrams are given. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the consistency of the theoretical results and the efficiency of the approximate formula and nonconvexity annealing.


Automatic Language Identification in Texts: A Survey

Journal of Artificial Intelligence Research

Language identification (“LI”) is the problem of determining the natural language that a document or part thereof is written in. Automatic LI has been extensively researched for over fifty years. Today, LI is a key part of many text processing pipelines, as text processing techniques generally assume that the language of the input text is known. Research in this area has recently been especially active. This article provides a brief history of LI research, and an extensive survey of the features and methods used in the LI literature. We describe the features and methods using a unified notation, to make the relationships between methods clearer. We discuss evaluation methods, applications of LI, as well as off-the-shelfLI systems that do not require training by the end user. Finally, we identify open issues, survey the work to date on each issue, and propose future directions for research in LI.


On Human Robot Interaction using Multiple Modes

arXiv.org Artificial Intelligence

Today robotics is a vibrant field of research and it has tremendous application potentials not only in the area of industrial environment, battle field, construction industry and deep sea exploration but also in the household domain as a humanoid social robot. To be accepted in the household, the robots must have a higher level of intelligence and they must be capable of interacting people socially around it who is not supposed to be robot specialist. All these come under the field of human robot interaction (HRI). Our hypothesis is- "It is possible to design a multimodal human robot interaction framework, to effectively communicate with Humanoid Robots". In order to establish the above hypothesis speech and gesture have been used as a mode of interaction and throughout the thesis we validate our hypothesis by theoretical design and experimental verifications.