lr 0
Murmur2Vec: A Hashing Based Solution For Embedding Generation Of COVID-19 Spike Sequences
Early detection and characterization of coronavirus disease (COVID-19), caused by SARS-CoV-2, remain critical for effective clinical response and public-health planning. The global availability of large-scale viral sequence data presents significant opportunities for computational analysis; however, existing approaches face notable limitations. Phylogenetic tree-based methods are computationally intensive and do not scale efficiently to today's multi-million-sequence datasets. Similarly, current embedding-based techniques often rely on aligned sequences or exhibit suboptimal predictive performance and high runtime costs, creating barriers to practical large-scale analysis. In this study, we focus on the most prevalent SARS-CoV-2 lineages associated with the spike protein region and introduce a scalable embedding method that leverages hashing to generate compact, low-dimensional representations of spike sequences. These embeddings are subsequently used to train a variety of machine learning models for supervised lineage classification. We conduct an extensive evaluation comparing our approach with multiple baseline and state-of-the-art biological sequence embedding methods across diverse metrics. Our results demonstrate that the proposed embeddings offer substantial improvements in efficiency, achieving up to 86.4\% classification accuracy while reducing embedding generation time by as much as 99.81\%. This highlights the method's potential as a fast, effective, and scalable solution for large-scale viral sequence analysis.
- Asia > China > Guangdong Province (0.14)
- North America > United States > California > Alameda County > Fremont (0.04)
- Asia > Pakistan > Sindh > Karachi Division > Karachi (0.04)
Benchmark Datasets for Lead-Lag Forecasting on Social Platforms
Kazemian, Kimia, Liu, Zhenzhen, Yang, Yangfanyu, Luo, Katie Z, Gu, Shuhan, Du, Audrey, Yang, Xinyu, Jansons, Jack, Weinberger, Kilian Q, Thickstun, John, Yin, Yian, Dean, Sarah
Social and collaborative platforms emit multivariate time-series traces in which early interactions--such as views, likes, or downloads--are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardized datasets. To anchor research in LLF, here we present two high-volume benchmark datasets--arXiv (accesses citations of 2.3M papers) and GitHub (pushes/stars forks of 3M repositories)--and outline additional domains with analogous lead-lag dynamics, including Wikipedia (page-views edits), Spotify (streams concert attendance), e-commerce (click-throughs purchases), and LinkedIn profile (views messages). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding sur-vivorship bias in sampling. We documented all technical details of data cura-tion and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data. The success of human activities is often measured by their collective impact, ranging from music streams and movie box office revenues to product sales and social media popularity. These impact metrics typically follow heavy-tailed distributions (Clauset et al., 2009) and slow decay patterns across timescales (Candia et al., 2019), making early identification of future hits fundamentally challenging (Cheng et al., 2014; Martin et al., 2016). At the same time, digital platforms increasingly log online user interactions--searches, views, downloads, likes, and shares--that often precede these long-term dynamics. These temporal lead-lag dynamics are remarkably ubiquitous, spanning domains as diverse as science (Haque & Ginsparg, 2009), economics (Wu & Brynjolfsson, 2015), arts (Goel et al., 2010), culture (Gruhl et al., 2005), and social movements (Johnson et al., 2016). A systematic understanding of such lead-lag dynamics is not only crucial for anticipating and optimizing impact in digital ecosystems, but also essential for designing effective strategies that identify and promote emerging innovations and products.
- North America > United States > California (0.14)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Information Technology (1.00)
- Government > Regional Government (0.68)
- Energy > Power Industry (0.68)
- (2 more...)
The Road Less Scheduled Aaron Defazio 1 Fundamental AI Research Team, Meta Xingyu (Alice) Y ang 2
Recently, Zamani and Glineur (2023) and Defazio et al. (2023) showed that the exact worst-case Our approach uses an alternative form of momentum that replaces traditional momentum. So from this viewpoint, the Schedule-Free updates can be seen as a version of momentum that has the same immediate effect, but with a greater delay for adding in the remainder of the gradient.
- Research Report > New Finding (0.68)
- Research Report > Experimental Study (0.67)
90fd4f88f588ae64038134f1eeaa023f-AuthorFeedback.pdf
Thank you for all the helpful comments. Several related works were raised by the reviewers which we discuss here. We note that the authors have marked their ArXiv submission as containing errors. Each of their inner loops uses SGD to solve the distance-regularized objectives. First, we use the EMA of slow weights to adjust the training parameters during optimization.
A Algorithms. Algorithm 1 Training DHRL 1: sample D
T time-steps, this upper-bound of error rate is also satisfied in all path from s to g . As shown in the table above, the wider the initial distribution, the easier it is for the agent to explore the map. 'fixed initial state distribution' requires less prior information about the state space. Figure 12: Changes in the graph level over the training; DHRL can explore long tasks with'fixed The results are averaged over 4 random seeds and smoothed equally.
Categorical Classification of Book Summaries Using Word Embedding Techniques
Keskin, Kerem, Keleş, Mümine Kaya
In this study, book summaries and categories taken from book sites were classified using word embedding methods, natural language processing techniques and machine learning algorithms. In addition, one hot encoding, Word2Vec and Term Frequency - Inverse Document Frequency (TF - IDF) methods, which are frequently used word embedding methods were used in this study and their success was compared. Additionally, the combination table of the pre - processing methods used is shown and added to the table. Looking at the results, it was observed that Support Vector Machine, Naive Bayes and Logistic Regression Models and TF - IDF and One - Hot Encoder word embedding techniques gave more successful results for Turkish texts. Using word2vec to process big text data.
- Europe > Kosovo > District of Pristina > Pristina (0.06)
- Asia > Middle East > Republic of Türkiye > Adana Province > Adana (0.04)
- Asia > Afghanistan > Kabul Province > Kabul (0.04)
- (3 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.55)
Underrepresentation, Label Bias, and Proxies: Towards Data Bias Profiles for the EU AI Act and Beyond
Ceccon, Marina, Cornacchia, Giandomenico, Pezze, Davide Dalle, Fabris, Alessandro, Susto, Gian Antonio
Undesirable biases encoded in the data are key drivers of algorithmic discrimination. Their importance is widely recognized in the algorithmic fairness literature, as well as legislation and standards on anti-discrimination in AI. Despite this recognition, data biases remain understudied, hindering the development of computational best practices for their detection and mitigation. In this work, we present three common data biases and study their individual and joint effect on algorithmic discrimination across a variety of datasets, models, and fairness measures. We find that underrepresentation of vulnerable populations in training sets is less conducive to discrimination than conventionally affirmed, while combinations of proxies and label bias can be far more critical. Consequently, we develop dedicated mechanisms to detect specific types of bias, and combine them into a preliminary construct we refer to as the Data Bias Profile (DBP). This initial formulation serves as a proof of concept for how different bias signals can be systematically documented. Through a case study with popular fairness datasets, we demonstrate the effectiveness of the DBP in predicting the risk of discriminatory outcomes and the utility of fairness-enhancing interventions. Overall, this article bridges algorithmic fairness research and anti-discrimination policy through a data-centric lens.
- Europe > Austria > Vienna (0.14)
- South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (19 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Law (1.00)
- Health & Medicine > Therapeutic Area > Dermatology (1.00)
- (5 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
- Information Technology > Data Science > Data Mining (0.93)
- Information Technology > Artificial Intelligence > Natural Language (0.92)
Explaining Concept Shift with Interpretable Feature Attribution
Lyu, Ruiqi, Turcan, Alistair, Wilder, Bryan
Regardless the amount of data a machine learning (ML) model is trained on, there will inevitably be data that differs from their training set, lowering model performance. Concept shift occurs when the distribution of labels conditioned on the features changes, making even a well-tuned ML model to have learned a fundamentally incorrect representation. Identifying these shifted features provides unique insight into how one dataset differs from another, considering the difference may be across a scientifically relevant dimension, such as time, disease status, population, etc. In this paper, we propose SGShift, a model for detecting concept shift in tabular data and attributing reduced model performance to a sparse set of shifted features. SGShift models concept shift with a Generalized Additive Model (GAM) and performs subsequent feature selection to identify shifted features. We propose further extensions of SGShift by incorporating knockoffs to control false discoveries and an absorption term to account for models with poor fit to the data. We conduct extensive experiments in synthetic and real data across various ML models and find SGShift can identify shifted features with AUC $>0.9$ and recall $>90\%$, often 2 or 3 times as high as baseline methods.
- North America > United States > California (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Sepsyn-OLCP: An Online Learning-based Framework for Early Sepsis Prediction with Uncertainty Quantification using Conformal Prediction
Zhou, Anni, Raheem, Beyah, Kamaleswaran, Rishikesan, Xie, Yao
Sepsis is a life-threatening syndrome with high morbidity and mortality in hospitals. Early prediction of sepsis plays a crucial role in facilitating early interventions for septic patients. However, early sepsis prediction systems with uncertainty quantification and adaptive learning are scarce. This paper proposes Sepsyn-OLCP, a novel online learning algorithm for early sepsis prediction by integrating conformal prediction for uncertainty quantification and Bayesian bandits for adaptive decision-making. By combining the robustness of Bayesian models with the statistical uncertainty guarantees of conformal prediction methodologies, this algorithm delivers accurate and trustworthy predictions, addressing the critical need for reliable and adaptive systems in high-stakes healthcare applications such as early sepsis prediction. We evaluate the performance of Sepsyn-OLCP in terms of regret in stochastic bandit setting, the area under the receiver operating characteristic curve (AUROC), and F-measure. Our results show that Sepsyn-OLCP outperforms existing individual models, increasing AUROC of a neural network from 0.64 to 0.73 without retraining and high computational costs. And the model selection policy converges to the optimal strategy in the long run. We propose a novel reinforcement learning-based framework integrated with conformal prediction techniques to provide uncertainty quantification for early sepsis prediction. The proposed methodology delivers accurate and trustworthy predictions, addressing a critical need in high-stakes healthcare applications like early sepsis prediction.
- North America > United States (0.46)
- Europe > Iceland > Capital Region > Reykjavik (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)
Evaluating link prediction: New perspectives and recommendations
Kalyani, Bhargavi I, Mathi, A Rama Prasad, Sett, Niladri
Link prediction (LP) is an important problem in network science and machine learning research. The state-of-the-art LP methods are usually evaluated in a uniform setup, ignoring several factors associated with the data and application specific needs. We identify a number of such factors, such as, network-type, problem-type, geodesic distance between the end nodes and its distribution over the classes, nature and applicability of LP methods, class imbalance and its impact on early retrieval, evaluation metric, etc., and present an experimental setup which allows us to evaluate LP methods in a rigorous and controlled manner. We perform extensive experiments with a variety of LP methods over real network datasets in this controlled setup, and gather valuable insights on the interactions of these factors with the performance of LP through an array of carefully designed hypotheses. Following the insights, we provide recommendations to be followed as best practice for evaluating LP methods.
- North America > United States > California > Orange County > Irvine (0.04)
- Europe > Greece > Attica > Athens (0.04)
- Asia > India > Telangana (0.04)
- Asia > India > Andhra Pradesh (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Education (0.70)
- Energy > Power Industry (0.68)