Ensemble Learning
Top Python Libraries For Data Science with Free Courses
Dask is a powerful open-source Python parallel computing framework. Dask scales Python programs from single-core local workstations to huge distributed cloud clusters. Dask provides a familiar user experience by replicating the APIs of other PyData ecosystem programs like Pandas, Scikit-learn, and NumPy. It also offers low-level APIs that allow programmers to execute bespoke algorithms concurrently.
Automatic Identification and Classification of Share Buybacks and their Effect on Short-, Mid- and Long-Term Returns
This thesis investigates share buybacks, specifically share buyback announcements. It addresses how to recognize such announcements, the excess return of share buybacks, and the prediction of returns after a share buyback announcement. We illustrate two NLP approaches for the automated detection of share buyback announcements. Even with very small amounts of training data, we can achieve an accuracy of up to 90%. This thesis utilizes these NLP methods to generate a large dataset consisting of 57,155 share buyback announcements. By analyzing this dataset, this thesis aims to show that most companies, which have a share buyback announced are underperforming the MSCI World. A minority of companies, however, significantly outperform the MSCI World. This significant overperformance leads to a net gain when looking at the averages of all companies. If the benchmark index is adjusted for the respective size of the companies, the average overperformance disappears, and the majority underperforms even greater. However, it was found that companies that announce a share buyback with a volume of at least 1% of their market cap, deliver, on average, a significant overperformance, even when using an adjusted benchmark. It was also found that companies that announce share buybacks in times of crisis emerge better than the overall market. Additionally, the generated dataset was used to train 72 machine learning models. Through this, it was able to find many strategies that could achieve an accuracy of up to 77% and generate great excess returns. A variety of performance indicators could be improved across six different time frames and a significant overperformance was identified. This was achieved by training several models for different tasks and time frames as well as combining these different models, generating significant improvement by fusing weak learners, in order to create one strong learner.
Cross-lingual Dysarthria Severity Classification for English, Korean, and Tamil
Yeo, Eun Jung, Choi, Kwanghee, Kim, Sunhee, Chung, Minhwa
This paper proposes a cross-lingual classification method for English, Korean, and Tamil, which employs both language-independent features and language-unique features. First, we extract thirty-nine features from diverse speech dimensions such as voice quality, pronunciation, and prosody. Second, feature selections are applied to identify the optimal feature set for each language. A set of shared features and a set of distinctive features are distinguished by comparing the feature selection results of the three languages. Lastly, automatic severity classification is performed, utilizing the two feature sets. Notably, the proposed method removes different features by languages to prevent the negative effect of unique features for other languages. Accordingly, eXtreme Gradient Boosting (XGBoost) algorithm is employed for classification, due to its strength in imputing missing data. In order to validate the effectiveness of our proposed method, two baseline experiments are conducted: experiments using the intersection set of mono-lingual feature sets (Intersection) and experiments using the union set of mono-lingual feature sets (Union). According to the experimental results, our method achieves better performance with a 67.14% F1 score, compared to 64.52% for the Intersection experiment and 66.74% for the Union experiment. Further, the proposed method attains better performances than mono-lingual classifications for all three languages, achieving 17.67%, 2.28%, 7.79% relative percentage increases for English, Korean, and Tamil, respectively. The result specifies that commonly shared features and language-specific features must be considered separately for cross-language dysarthria severity classification.
Modelling the Frequency of Home Deliveries: An Induced Travel Demand Contribution of Aggrandized E-shopping in Toronto during COVID-19 Pandemics
Liu, Yicong, Wang, Kaili, Loa, Patrick, Habib, Khandker Nurul
The dramatic growth of e-shopping will undoubtedly cause significant impacts on travel demand. As a result, transportation modeller's ability to model e-shopping demand is becoming increasingly important. This study developed models to predict households' weekly home delivery frequencies. We used both classical econometric and machine learning techniques to obtain the best model. It is found that socioeconomic factors such as having an online grocery membership, household members' average age, the percentage of male household members, the number of workers in the household and various land-use factors influence home delivery demand. This study also compared the interpretations and performances of the machine learning models and the classical econometric model. Agreement is found in the variable's effects identified through the machine learning and econometric models. However, with similar recall accuracy, the ordered probit model, a classical econometric model, can accurately predict the aggregate distribution of household delivery demand. In contrast, both machine learning models failed to match the observed distribution.
ESTA: An Esports Trajectory and Action Dataset
Xenopoulos, Peter, Silva, Claudio
Sports, due to their global reach and impact-rich prediction tasks, are an exciting domain to deploy machine learning models. However, data from conventional sports is often unsuitable for research use due to its size, veracity, and accessibility. To address these issues, we turn to esports, a growing domain that encompasses video games played in a capacity similar to conventional sports. Since esports data is acquired through server logs rather than peripheral sensors, esports provides a unique opportunity to obtain a massive collection of clean and detailed spatiotemporal data, similar to those collected in conventional sports. To parse esports data, we develop awpy, an open-source esports game log parsing library that can extract player trajectories and actions from game logs. Using awpy, we parse 8.6m actions, 7.9m game frames, and 417k trajectories from 1,558 game logs from professional Counter-Strike tournaments to create the Esports Trajectory and Actions (ESTA) dataset. ESTA is one of the largest and most granular publicly available sports data sets to date. We use ESTA to develop benchmarks for win prediction using player-specific information.
Accurate ADMET Prediction with XGBoost
Tian, Hao, Ketkar, Rajas, Tao, Peng
The absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are important in drug discovery as they define efficacy and safety. In this work, we applied an ensemble of features, including fingerprints and descriptors, and a tree-based machine learning model, extreme gradient boosting, for accurate ADMET prediction. Our model performs well in the Therapeutics Data Commons ADMET benchmark group. For 22 tasks, our model is ranked first in 18 tasks and top 3 in 21 tasks. The trained machine learning models are integrated in ADMETboost, a web server that is publicly available at https://ai-druglab.smu.edu/admet.
Feature Importance to Predict Mushrooms' Edibility in Python
This article aims at leveraging feature importance to assess whether all the columns within a dataset need to be used for prediction or not. Imagine you're enjoying a walk in the woods and you find some mushrooms on the side of the path. Wouldn't it be nice to input some of their features into an ML-powered application that can detect with confidence edible qualities? I'm personally not into mushroom hunting but I'm definitely into food, and I can already smell a nice dish of "tagliolini ai funghi" in front of me after a long walk. Mushrooms are fungi, part of a kingdom of their own separate from plants and animals.
Machine learning for real-time aggregated prediction of hospital admission for emergency patients - npj Digital Medicine
Machine learning for hospital operations is under-studied. We present a prediction pipeline that uses live electronic health-records for patients in a UK teaching hospital’s emergency department (ED) to generate short-term, probabilistic forecasts of emergency admissions. A set of XGBoost classifiers applied to 109,465 ED visits yielded AUROCs from 0.82 to 0.90 depending on elapsed visit-time at the point of prediction. Patient-level probabilities of admission were aggregated to forecast the number of admissions among current ED patients and, incorporating patients yet to arrive, total emergency admissions within specified time-windows. The pipeline gave a mean absolute error (MAE) of 4.0 admissions (mean percentage error of 17%) versus 6.5 (32%) for a benchmark metric. Models developed with 104,504 later visits during the Covid-19 pandemic gave AUROCs of 0.68–0.90 and MAE of 4.2 (30%) versus a 4.9 (33%) benchmark. We discuss how we surmounted challenges of designing and implementing models for real-time use, including temporal framing, data preparation, and changing operational conditions.
FLInt: Exploiting Floating Point Enabled Integer Arithmetic for Efficient Random Forest Inference
Hakert, Christian, Chen, Kuan-Hsun, Chen, Jian-Jia
In many machine learning applications, e.g., tree-based ensembles, floating point numbers are extensively utilized due to their expressiveness. Nowadays performing data analysis on embedded devices from dynamic data masses becomes available, but such systems often lack hardware capabilities to process floating point numbers, introducing large overheads for their processing. Even if such hardware is present in general computing systems, using integer operations instead of floating point operations promises to reduce operation overheads and improve the performance. In this paper, we provide \mdname, a full precision floating point comparison for random forests, by only using integer and logic operations. To ensure the same functionality preserves, we formally prove the correctness of this comparison. Since random forests only require comparison of floating point numbers during inference, we implement \mdname~in low level realizations and therefore eliminate the need for floating point hardware entirely, by keeping the model accuracy unchanged. The usage of \mdname~basically boils down to a one-by-one replacement of conditions: For instance, a comparison statement in C: if(pX[3]<=(float)10.074347) becomes if((*(((int*)(pX))+3))<=((int)(0x41213087))). Experimental evaluation on X86 and ARMv8 desktop and server class systems shows that the execution time can be reduced by up to $\approx 30\%$ with our novel approach.
Patient-specific modelling, simulation and real-time processing for respiratory diseases
Asthma is a common chronic disease of the respiratory system causing significant disability and societal burden. It affects more than 300 million people worldwide, while more than 100 million people will likely have asthma by 2025. The price of asthma varies greatly from nation to nation. Mean yearly cost can be estimated to 1900 EUR in Europe and $3100 in the United States. Managing asthma involves controlling symptoms, preventing exacerbations, and maintaining lung function. Improved asthma control is reduces the risk of exacerbations and lung function impairment while reducing the direct costs of asthma care and indirect costs associated with reduced productivity. Understanding the complex dynamics of the pulmonary system and the lung's response to disease is fundamental to the advancement of Asthma treatment. Computational models of the respiratory system seek to provide a theoretical framework to understand the interaction between structure and function. Their application can improve pulmonary medicine by a patient-specific approach to medicinal methodologies optimizing the delivery given the personalized geometry and personalized ventilation patterns. A three-fold objective is addressed within this dissertation. The first part refers to the comprehension of pulmonary pathophysiology and the mechanics of Asthma and subsequently of constrictive pulmonary conditions in general. The second part refers to the design and implementation of tools that facilitate personalized medicine to improve delivery and effectiveness. Finally, the third part refers to the self-management of the condition, meaning that medical personnel and patients have access to tools and methods that allow the first party to easily track the course of the condition and the second party, i.e. the patient to easily self-manage it alleviating the significant burden from the health system.