Goto

Collaborating Authors

 Support Vector Machines


Leveraging machine learning for less developed languages: Progress on Urdu text detection

arXiv.org Artificial Intelligence

Text detection in natural scene images has applications for autonomous driving, navigation help for elderly and blind people. However, the research on Urdu text detection is usually hindered by lack of data resources. We have developed a dataset of scene images with Urdu text. We present the use of machine learning methods to perform detection of Urdu text from the scene images. We extract text regions using channel enhanced Maximally Stable Extremal Region (MSER) method. First, we classify text and noise based on their geometric properties. Next, we use a support vector machine for early discarding of non-text regions. To further remove the non-text regions, we use histogram of oriented gradients (HoG) features obtained and train a second SVM classifier. This improves the overall performance on text region detection within the scene images. To support research on Urdu text, We aim to make the data freely available for research use. We also aim to highlight the challenges and the research gap for Urdu text detection.


Hyperspectral Remote Sensing Benchmark Database for Oil Spill Detection with an Isolation Forest-Guided Unsupervised Detector

arXiv.org Artificial Intelligence

Oil spill detection has attracted increasing attention in recent years since marine oil spill accidents severely affect environments, natural resources, and the lives of coastal inhabitants. Hyperspectral remote sensing images provide rich spectral information which is beneficial for the monitoring of oil spills in complex ocean scenarios. However, most of the existing approaches are based on supervised and semi-supervised frameworks to detect oil spills from hyperspectral images (HSIs), which require a huge amount of effort to annotate a certain number of high-quality training sets. In this study, we make the first attempt to develop an unsupervised oil spill detection method based on isolation forest for HSIs. First, considering that the noise level varies among different bands, a noise variance estimation method is exploited to evaluate the noise level of different bands, and the bands corrupted by severe noise are removed. Second, kernel principal component analysis (KPCA) is employed to reduce the high dimensionality of the HSIs. Then, the probability of each pixel belonging to one of the classes of seawater and oil spills is estimated with the isolation forest, and a set of pseudo-labeled training samples is automatically produced using the clustering algorithm on the detected probability. Finally, an initial detection map can be obtained by performing the support vector machine (SVM) on the dimension-reduced data, and then, the initial detection result is further optimized with the extended random walker (ERW) model so as to improve the detection accuracy of oil spills. Experiments on airborne hyperspectral oil spill data (HOSD) created by ourselves demonstrate that the proposed method obtains superior detection performance with respect to other state-of-the-art detection approaches.


What Are You Anxious About? Examining Subjects of Anxiety during the COVID-19 Pandemic

arXiv.org Artificial Intelligence

COVID-19 poses disproportionate mental health consequences to the public during different phases of the pandemic. We use a computational approach to capture the specific aspects that trigger an online community's anxiety about the pandemic and investigate how these aspects change over time. First, we identified nine subjects of anxiety (SOAs) in a sample of Reddit posts ($N$=86) from r/COVID19\_support using thematic analysis. Then, we quantified Reddit users' anxiety by training algorithms on a manually annotated sample ($N$=793) to automatically label the SOAs in a larger chronological sample ($N$=6,535). The nine SOAs align with items in various recently developed pandemic anxiety measurement scales. We observed that Reddit users' concerns about health risks remained high in the first eight months of the pandemic. These concerns diminished dramatically despite the surge of cases occurring later. In general, users' language disclosing the SOAs became less intense as the pandemic progressed. However, worries about mental health and the future increased steadily throughout the period covered in this study. People also tended to use more intense language to describe mental health concerns than health risks or death concerns. Our results suggest that this online group's mental health condition does not necessarily improve despite COVID-19 gradually weakening as a health threat due to appropriate countermeasures. Our system lays the groundwork for population health and epidemiology scholars to examine aspects that provoke pandemic anxiety in a timely fashion.


Improving Probabilistic Models in Text Classification via Active Learning

arXiv.org Artificial Intelligence

Social scientists often classify text documents to use the resulting labels as an outcome or a predictor in empirical research. Automated text classification has become a standard tool, since it requires less human coding. However, scholars still need many human-labeled documents to train automated classifiers. To reduce labeling costs, we propose a new algorithm for text classification that combines a probabilistic model with active learning. The probabilistic model uses both labeled and unlabeled data, and active learning concentrates labeling efforts on difficult documents to classify. Our validation study shows that the classification performance of our algorithm is comparable to state-of-the-art methods at a fraction of the computational cost. Moreover, we replicate two recently published articles and reach the same substantive conclusions with only a small proportion of the original labeled data used in those studies. We provide activeText, an open-source software to implement our method.


Using Unmanned Aerial Systems (UAS) for Assessing and Monitoring Fall Hazard Prevention Systems in High-rise Building Projects

arXiv.org Artificial Intelligence

This study develops a framework for unmanned aerial systems (UASs) to monitor fall hazard prevention systems near unprotected edges and openings in high-rise building projects. A three-step machine-learning-based framework was developed and tested to detect guardrail posts from the images captured by UAS. First, a guardrail detector was trained to localize the candidate locations of posts supporting the guardrail. Since images were used in this process collected from an actual job site, several false detections were identified. Therefore, additional constraints were introduced in the following steps to filter out false detections. Second, the research team applied a horizontal line detector to the image to properly detect floors and remove the detections that were not close to the floors. Finally, since the guardrail posts are installed with approximately normal distribution between each post, the space between them was estimated and used to find the most likely distance between the two posts. The research team used various combinations of the developed approaches to monitor guardrail systems in the captured images from a high-rise building project. Comparing the precision and recall metrics indicated that the cascade classifier achieves better performance with floor detection and guardrail spacing estimation. The research outcomes illustrate that the proposed guardrail recognition system can improve the assessment of guardrails and facilitate the safety engineer's task of identifying fall hazards in high-rise building projects.


Can Transformer Models Effectively Detect Software Aspects in StackOverflow Discussion?

arXiv.org Artificial Intelligence

Dozens of new tools and technologies are being incorporated to help developers, which is becoming a source of consternation as they struggle to choose one over the others. For example, there are at least ten frameworks available to developers for developing web applications, posing a conundrum in selecting the best one that meets their needs. As a result, developers are continuously searching for all of the benefits and drawbacks of each API, framework, tool, and so on. One of the typical approaches is to examine all of the features through official documentation and discussion. This approach is time-consuming, often makes it difficult to determine which aspects are the most important to a particular developer and whether a particular aspect is important to the community at large. In this paper, we have used a benchmark API aspects dataset (Opiner) collected from StackOverflow posts and observed how Transformer models (BERT, RoBERTa, DistilBERT, and XLNet) perform in detecting software aspects in textual developer discussion with respect to the baseline Support Vector Machine (SVM) model. Through extensive experimentation, we have found that transformer models improve the performance of baseline SVM for most of the aspects, i.e., `Performance', `Security', `Usability', `Documentation', `Bug', `Legal', `OnlySentiment', and `Others'. However, the models fail to apprehend some of the aspects (e.g., `Community' and `Potability') and their performance varies depending on the aspects. Also, larger architectures like XLNet are ineffective in interpreting software aspects compared to smaller architectures like DistilBERT.


On the speed of uniform convergence in Mercer's theorem

arXiv.org Artificial Intelligence

Mercer kernels play an important role in machine learning and is a mathematical basis of such techniques as kernel density estimation and spline models [14], Support Vector Machines [11], kernel principal components analysis [10], regularization of neural networks [13] and many others. According to Aronszajn's theorem, any Mercer kernel induces a reproducing kernel Hilbert space (RKHS) and vice versa, any RKHS corresponds to a kernel. A relationship between the latter two notions is decribed in the classical Mercer's theorem. A goal of this note is torefine this theoremandgive some estimates onthe speedof uniformconvergencestated in it.


Chance constrained conic-segmentation support vector machine with uncertain data

arXiv.org Artificial Intelligence

In classification problems, a classifier is a function that mimics the relationship between the data vectors and their class labels. Support vector machine(SVM) is a popular classifier, which was proposed by Cortes and Vapnik [1] as a maximum margin classifier. The success of the SVM has encouraged further research into extensions to the more general multiclass cases, which has been an active topic of research interest [2-4]. Shilton et al.[5] proposed the conicsegmentation support vector machine (CS-SVM) by introducing the concept of target space into the problem formulation and showed that some other multiclassfication model are special cases of this framework. The standard CS-SVM is dealing with the situation where the exact values of the data points are known.


Linear Algorithms for Robust and Scalable Nonparametric Multiclass Probability Estimation

arXiv.org Artificial Intelligence

Multiclass probability estimation is the problem of estimating conditional probabilities of a data point belonging to a class given its covariate information. It has broad applications in statistical analysis and data science. Recently a class of weighted Support Vector Machines (wSVMs) has been developed to estimate class probabilities through ensemble learning for $K$-class problems (Wu, Zhang and Liu, 2010; Wang, Zhang and Wu, 2019), where $K$ is the number of classes. The estimators are robust and achieve high accuracy for probability estimation, but their learning is implemented through pairwise coupling, which demands polynomial time in $K$. In this paper, we propose two new learning schemes, the baseline learning and the One-vs-All (OVA) learning, to further improve wSVMs in terms of computational efficiency and estimation accuracy. In particular, the baseline learning has optimal computational complexity in the sense that it is linear in $K$. Though not being most efficient in computation, the OVA offers the best estimation accuracy among all the procedures under comparison. The resulting estimators are distribution-free and shown to be consistent. We further conduct extensive numerical experiments to demonstrate finite sample performance.


Leak Detection in Natural Gas Pipeline Using Machine Learning Models

arXiv.org Artificial Intelligence

Leak detection in gas pipelines is an important and persistent problem in the Oil and Gas industry. This is particularly important as pipelines are the most common way of transporting natural gas. This research aims to study the ability of data-driven intelligent models to detect small leaks for a natural gas pipeline using basic operational parameters and then compare the intelligent models among themselves using existing performance metrics. This project applies the observer design technique to detect leaks in natural gas pipelines using a regressoclassification hierarchical model where an intelligent model acts as a regressor and a modified logistic regression model acts as a classifier. Five intelligent models (gradient boosting, decision trees, random forest, support vector machine and artificial neural network) are studied in this project using a pipeline data stream of four weeks. The results shows that while support vector machine and artificial neural networks are better regressors than the others, they do not provide the best results in leak detection due to their internal complexities and the volume of data used. The random forest and decision tree models are the most sensitive as they can detect a leak of 0.1% of nominal flow in about 2 hours. All the intelligent models had high reliability with zero false alarm rate in testing phase. The average time to leak detection for all the intelligent models was compared to a real time transient model in literature. The results show that intelligent models perform relatively well in the problem of leak detection. This result suggests that intelligent models could be used alongside a real time transient model to significantly improve leak detection results.