Accuracy
Cadence: A Practical Time-series Partitioning Algorithm for Unlabeled IoT Sensor Streams
Chowdhury, Tahiya, Aldeer, Murtadha, Laghate, Shantanu, Ortiz, Jorge
Timeseries partitioning is an essential step in most machine-learning driven, sensor-based IoT applications. This paper introduces a sample-efficient, robust, time-series segmentation model and algorithm. We show that by learning a representation specifically with the segmentation objective based on maximum mean discrepancy (MMD), our algorithm can robustly detect time-series events across different applications. Our loss function allows us to infer whether consecutive sequences of samples are drawn from the same distribution (null hypothesis) and determines the change-point between pairs that reject the null hypothesis (i.e., come from different distributions). We demonstrate its applicability in a real-world IoT deployment for ambient-sensing based activity recognition. Moreover, while many works on change-point detection exist in the literature, our model is significantly simpler and can be fully trained in 9-93 seconds on average with little variation in hyperparameters for data across different applications. We empirically evaluate Cadence on four popular change point detection (CPD) datasets where Cadence matches or outperforms existing CPD techniques.
Weighted Scaling Approach for Metabolomics Data Analysis
Biswas, Biplab, Kumar, Nishith, Hoque, Md Aminul, Alam, Md Ashad
Systematic variation is a common issue in metabolomics data analysis. Therefore, different scaling and normalization techniques are used to preprocess the data for metabolomics data analysis. Although several scaling methods are available in the literature, however, choice of scaling, transformation and/or normalization technique influence the further statistical analysis. It is challenging to choose the appropriate scaling technique for downstream analysis to get accurate results or to make a proper decision. Moreover, the existing scaling techniques are sensitive to outliers or extreme values. To fill the gap, our objective is to introduce a robust scaling approach that is not influenced by outliers as well as provides more accurate results for downstream analysis. Here, we introduced a new weighted scaling approach that is robust against outliers however, where no additional outlier detection/treatment step is needed in data preprocessing and also compared it with the conventional scaling and normalization techniques through artificial and real metabolomics datasets. We evaluated the performance of the proposed method in comparison to the other existing conventional scaling techniques using metabolomics data analysis in both the absence and presence of different percentages of outliers. Results show that in most cases, the proposed scaling technique performs better than the traditional scaling methods in both the absence and presence of outliers. The proposed method improves the further downstream metabolomics analysis. The R function of the proposed robust scaling method is available at https://github.com/nishithkumarpaul/robustScaling/blob/main/wscaling.R
Flood Prediction Using Machine Learning Models
Syeed, Miah Mohammad Asif, Farzana, Maisha, Namir, Ishadie, Ishrar, Ipshita, Nushra, Meherin Hossain, Rahman, Tanvir
Floods are one of nature's most catastrophic calamities which cause irreversible and immense damage to human life, agriculture, infrastructure and socio-economic system. Several studies on flood catastrophe management and flood forecasting systems have been conducted. The accurate prediction of the onset and progression of floods in real time is challenging. To estimate water levels and velocities across a large area, it is necessary to combine data with computationally demanding flood propagation models. This paper aims to reduce the extreme risks of this natural disaster and also contributes to policy suggestions by providing a prediction for floods using different machine learning models. This research will use Binary Logistic Regression, K-Nearest Neighbor (KNN), Support Vector Classifier (SVC) and Decision tree Classifier to provide an accurate prediction. With the outcome, a comparative analysis will be conducted to understand which model delivers a better accuracy.
On the Detection of Adaptive Adversarial Attacks in Speaker Verification Systems
Speaker verification systems have been widely used in smart phones and Internet of things devices to identify legitimate users. In recent work, it has been shown that adversarial attacks, such as FAKEBOB, can work effectively against speaker verification systems. The goal of this paper is to design a detector that can distinguish an original audio from an audio contaminated by adversarial attacks. Specifically, our designed detector, called MEH-FEST, calculates the minimum energy in high frequencies from the short-time Fourier transform of an audio and uses it as a detection metric. Through both analysis and experiments, we show that our proposed detector is easy to implement, fast to process an input audio, and effective in determining whether an audio is corrupted by FAKEBOB attacks. The experimental results indicate that the detector is extremely effective: with near zero false positive and false negative rates for detecting FAKEBOB attacks in Gaussian mixture model (GMM) and i-vector speaker verification systems. Moreover, adaptive adversarial attacks against our proposed detector and their countermeasures are discussed and studied, showing the game between attackers and defenders.
A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models
Yan, Chao, Yan, Yao, Wan, Zhiyu, Zhang, Ziqi, Omberg, Larsson, Guinney, Justin, Mooney, Sean D., Malin, Bradley A.
Synthetic health data have the potential to mitigate privacy concerns when sharing data to support biomedical research and the development of innovative healthcare applications. Modern approaches for data generation based on machine learning, generative adversarial networks (GAN) methods in particular, continue to evolve and demonstrate remarkable potential. Yet there is a lack of a systematic assessment framework to benchmark methods as they emerge and determine which methods are most appropriate for which use cases. In this work, we introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data with respect to utility and privacy metrics. We apply the framework to evaluate synthetic data generation methods for electronic health records (EHRs) data from two large academic medical centers with respect to several use cases. The results illustrate that there is a utility-privacy tradeoff for sharing synthetic EHR data. The results further indicate that no method is unequivocally the best on all criteria in each use case, which makes it evident why synthetic data generation methods need to be assessed in context.
How should we proxy for race/ethnicity? Comparing Bayesian improved surname geocoding to machine learning methods
Political science research often requires constructing a race/ethnicity proxy variable for datasets that do not contain it, like voter registration files, lists of electoral candidates, or political donation records. Constructing such a proxy is an important step for conducting ecological inference in voting rights litigation (Barreto et al. [2019], Imai and Khanna [2016]), redistricting (DeLuca and Curiel [2022], Kenny et al. [2021]), and substantive research on the role of race/ethnicity in politics (Enos [2016], Enos et al. [2019], Grumbach and Sahn [2020]). The most common method for proxying race/ethnicity is Bayesian Improved Surname Geocoding (BISG), which uses Bayes' rule to compute a probability distribution over race/ethnicity categories conditional on a voter's surname and where they live (Elliott et al. [2008, 2009]). BISG has attained widespread popularity due to its parsimony, computational efficiency, and superior performance when compared to existing alternatives, namely spatial interpolation of Census racial-ethnic composition from Census geographies (Imai and Khanna [2016], Clark et al. [2021], Shah and Davis [2017]). While BISG performs well compared to the small suite of existing alternatives, it has not yet been benchmarked against machine learning (ML) models, which can produce race/ethnicity predictions from more flexible and potentially more accurate models. In this paper I present the results of such a benchmark. I train a range of machine learning models using voter registration data from Florida, Georgia, North Carolina, and a portion of California where voters self-report their race/ethnicity upon registration. The registries in these states contain over 26 million labelled observations, which equates to greater than a five percent non-representative sample of the United States electorate. I then compare BISG against predictions from these models made out-of-state.
XOOD: Extreme Value Based Out-Of-Distribution Detection For Image Classification
Berglind, Frej, Temam, Haron, Mukhopadhyay, Supratik, Das, Kamalika, Sajol, Md Saiful Islam, Kumar, Sricharan, Kallurupalli, Kumar
Detecting out-of-distribution (OOD) data at inference time is crucial for many applications of machine learning. We present XOOD: a novel extreme value-based OOD detection framework for image classification that consists of two algorithms. The first, XOOD-M, is completely unsupervised, while the second XOOD-L is self-supervised. Both algorithms rely on the signals captured by the extreme values of the data in the activation layers of the neural network in order to distinguish between in-distribution and OOD instances. We show experimentally that both XOOD-M and XOOD-L outperform state-of-the-art OOD detection methods on many benchmark data sets in both efficiency and accuracy, reducing false-positive rate (FPR95) by 50%, while improving the inferencing time by an order of magnitude.
Mitigating Biases in Student Performance Prediction via Attention-Based Personalized Federated Learning
Chu, Yun-Wei, Hosseinalipour, Seyyedali, Tenorio, Elizabeth, Cruz, Laura, Douglas, Kerrie, Lan, Andrew, Brinton, Christopher
Traditional learning-based approaches to student modeling generalize poorly to underrepresented student groups due to biases in data availability. In this paper, we propose a methodology for predicting student performance from their online learning activities that optimizes inference accuracy over different demographic groups such as race and gender. Building upon recent foundations in federated learning, in our approach, personalized models for individual student subgroups are derived from a global model aggregated across all student models via meta-gradient updates that account for subgroup heterogeneity. To learn better representations of student activity, we augment our approach with a self-supervised behavioral pretraining methodology that leverages multiple modalities of student behavior (e.g., visits to lecture videos and participation on forums), and include a neural network attention mechanism in the model aggregation stage. Through experiments on three real-world datasets from online courses, we demonstrate that our approach obtains substantial improvements over existing student modeling baselines in predicting student learning outcomes for all subgroups. Visual analysis of the resulting student embeddings confirm that our personalization methodology indeed identifies different activity patterns within different subgroups, consistent with its stronger inference ability compared with the baselines.
Word-level Text Highlighting of Medical Texts for Telehealth Services
Ozyegen, Ozan, Kabe, Devika, Cevik, Mucahit
The medical domain is often subject to information overload. The digitization of healthcare, constant updates to online medical repositories, and increasing availability of biomedical datasets make it challenging to effectively analyze the data. This creates additional work for medical professionals who are heavily dependent on medical data to complete their research and consult their patients. This paper aims to show how different text highlighting techniques can capture relevant medical context. This would reduce the doctors' cognitive load and response time to patients by facilitating them in making faster decisions, thus improving the overall quality of online medical services. Three different word-level text highlighting methodologies are implemented and evaluated. The first method uses TF-IDF scores directly to highlight important parts of the text. The second method is a combination of TF-IDF scores and the application of Local Interpretable Model-Agnostic Explanations to classification models. The third method uses neural networks directly to make predictions on whether or not a word should be highlighted. The results of our experiments show that the neural network approach is successful in highlighting medically-relevant terms and its performance is improved as the size of the input segment increases.
EMFlow: Data Imputation in Latent Space via EM and Deep Flow Models
The presence of missing values within high-dimensional data is an ubiquitous problem for many applied sciences. A serious limitation of many available data mining and machine learning methods is their inability to handle partially missing values and so an integrated approach that combines imputation and model estimation is vital for down-stream analysis. A computationally fast algorithm, called EMFlow, is introduced that performs imputation in a latent space via an online version of Expectation-Maximization (EM) algorithm by using a normalizing flow (NF) model which maps the data space to a latent space. The proposed EMFlow algorithm is iterative, involving updating the parameters of online EM and NF alternatively. Extensive experimental results for high-dimensional multivariate and image datasets are presented to illustrate the superior performance of the EMFlow compared to a couple of recently available methods in terms of both predictive accuracy and speed of algorithmic convergence. We provide code for all our experiments.