Plotting

Results


ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers Computer Science, Computer Science, University of Oxford University of Oxford Debora S. Marks

Neural Information Processing Systems

Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.


BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits Mo Tiwari Sebastian Thrun Department of Computer Science Department of Computer Science Stanford University

Neural Information Processing Systems

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering, k-medoids clustering requires the cluster centers to be actual data points and supports arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art k-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size n for each iteration, being prohibitively expensive for large datasets.


BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits Mo Tiwari Sebastian Thrun Department of Computer Science Department of Computer Science Stanford University

Neural Information Processing Systems

Clustering is a ubiquitous task in data science. Compared to the commonly used k-means clustering, k-medoids clustering requires the cluster centers to be actual data points and supports arbitrary distance metrics, which permits greater interpretability and the clustering of structured objects. Current state-of-the-art k-medoids clustering algorithms, such as Partitioning Around Medoids (PAM), are iterative and are quadratic in the dataset size n for each iteration, being prohibitively expensive for large datasets.


Curriculum Design for Teaching via Demonstrations: Theory and Applications

Neural Information Processing Systems

We consider the problem of teaching via demonstrations in sequential decisionmaking settings. In particular, we study how to design a personalized curriculum over demonstrations to speed up the learner's convergence. We provide a unified curriculum strategy for two popular learner models: Maximum Causal Entropy Inverse Reinforcement Learning (MaxEnt-IRL) and Cross-Entropy Behavioral Cloning (CrossEnt-BC). Our unified strategy induces a ranking over demonstrations based on a notion of difficulty scores computed w.r.t. the teacher's optimal policy and the learner's current policy. Compared to the state of the art, our strategy doesn't require access to the learner's internal dynamics and still enjoys similar convergence guarantees under mild technical conditions. Furthermore, we adapt our curriculum strategy to the setting where no teacher agent is present using task-specific difficulty scores. Experiments on a synthetic car driving environment and navigation-based environments demonstrate the effectiveness of our curriculum strategy.


Curriculum Design for Teaching via Demonstrations: Theory and Applications

Neural Information Processing Systems

We consider the problem of teaching via demonstrations in sequential decisionmaking settings. In particular, we study how to design a personalized curriculum over demonstrations to speed up the learner's convergence. We provide a unified curriculum strategy for two popular learner models: Maximum Causal Entropy Inverse Reinforcement Learning (MaxEnt-IRL) and Cross-Entropy Behavioral Cloning (CrossEnt-BC). Our unified strategy induces a ranking over demonstrations based on a notion of difficulty scores computed w.r.t. the teacher's optimal policy and the learner's current policy. Compared to the state of the art, our strategy doesn't require access to the learner's internal dynamics and still enjoys similar convergence guarantees under mild technical conditions. Furthermore, we adapt our curriculum strategy to the setting where no teacher agent is present using task-specific difficulty scores. Experiments on a synthetic car driving environment and navigation-based environments demonstrate the effectiveness of our curriculum strategy.


Machine Learning-Based Research on the Adaptability of Adolescents to Online Education

arXiv.org Artificial Intelligence

With the rapid advancement of internet technology, the adaptability of adolescents to online learning has emerged as a focal point of interest within the educational sphere. However, the academic community's efforts to develop predictive models for adolescent online learning adaptability require further refinement and expansion. Utilizing data from the "Chinese Adolescent Online Education Survey" spanning the years 2014 to 2016, this study implements five machine learning algorithms - logistic regression, K-nearest neighbors, random forest, XGBoost, and CatBoost - to analyze the factors influencing adolescent online learning adaptability and to determine the model best suited for prediction. The research reveals that the duration of courses, the financial status of the family, and age are the primary factors affecting students' adaptability in online learning environments. Additionally, age significantly impacts students' adaptive capacities. Among the predictive models, the random forest, XGBoost, and CatBoost algorithms demonstrate superior forecasting capabilities, with the random forest model being particularly adept at capturing the characteristics of students' adaptability.


Time Series Analysis for Education: Methods, Applications, and Future Directions

arXiv.org Artificial Intelligence

Recent advancements in the collection and analysis of sequential educational data have brought time series analysis to a pivotal position in educational research, highlighting its essential role in facilitating data-driven decision-making. However, there is a lack of comprehensive summaries that consolidate these advancements. To the best of our knowledge, this paper is the first to provide a comprehensive review of time series analysis techniques specifically within the educational context. We begin by exploring the landscape of educational data analytics, categorizing various data sources and types relevant to education. We then review four prominent time series methods-forecasting, classification, clustering, and anomaly detection-illustrating their specific application points in educational settings. Subsequently, we present a range of educational scenarios and applications, focusing on how these methods are employed to address diverse educational tasks, which highlights the practical integration of multiple time series methods to solve complex educational problems. Finally, we conclude with a discussion on future directions, including personalized learning analytics, multimodal data fusion, and the role of large language models (LLMs) in educational time series. The contributions of this paper include a detailed taxonomy of educational data, a synthesis of time series techniques with specific educational applications, and a forward-looking perspective on emerging trends and future research opportunities in educational analysis. The related papers and resources are available and regularly updated at the project page.


Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review

arXiv.org Artificial Intelligence

Recent technological advancements have enhanced our ability to collect and analyze rich multimodal data (e.g., speech, video, and eye gaze) to better inform learning and training experiences. While previous reviews have focused on parts of the multimodal pipeline (e.g., conceptual models and data fusion), a comprehensive literature review on the methods informing multimodal learning and training environments has not been conducted. This literature review provides an in-depth analysis of research methods in these environments, proposing a taxonomy and framework that encapsulates recent methodological advances in this field and characterizes the multimodal domain in terms of five modality groups: Natural Language, Video, Sensors, Human-Centered, and Environment Logs. We introduce a novel data fusion category -- mid fusion -- and a graph-based technique for refining literature reviews, termed citation graph pruning. Our analysis reveals that leveraging multiple modalities offers a more holistic understanding of the behaviors and outcomes of learners and trainees. Even when multimodality does not enhance predictive accuracy, it often uncovers patterns that contextualize and elucidate unimodal data, revealing subtleties that a single modality may miss. However, there remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.


AI-Powered Dynamic Fault Detection and Performance Assessment in Photovoltaic Systems

arXiv.org Artificial Intelligence

The intermittent nature of photovoltaic (PV) solar energy, driven by variable weather, leads to power losses of 10-70% and an average energy production decrease of 25%. Accurate loss characterization and fault detection are crucial for reliable PV system performance and efficiency, integrating this data into control signal monitoring systems. Computational modeling of PV systems supports technological, economic, and performance analyses, but current models are often rigid, limiting advanced performance optimization and innovation. Conventional fault detection strategies are costly and often yield unreliable results due to complex data signal profiles. Artificial intelligence (AI), especially machine learning algorithms, offers improved fault detection by analyzing relationships between input parameters (e.g., meteorological and electrical) and output metrics (e.g., production). Once trained, these models can effectively identify faults by detecting deviations from expected performance. This research presents a computational model using the PVlib library in Python, incorporating a dynamic loss quantification algorithm that processes meteorological, operational, and technical data. An artificial neural network (ANN) trained on synthetic datasets with a five-minute resolution simulates real-world PV system faults. A dynamic threshold definition for fault detection is based on historical data from a PV system at Universidad de los Andes. Key contributions include: (i) a PV system model with a mean absolute error of 6.0% in daily energy estimation; (ii) dynamic loss quantification without specialized equipment; (iii) an AI-based algorithm for technical parameter estimation, avoiding special monitoring devices; and (iv) a fault detection model achieving 82.2% mean accuracy and 92.6% maximum accuracy.


Generalized Encouragement-Based Instrumental Variables for Counterfactual Regression

arXiv.org Machine Learning

In causal inference, encouragement designs (EDs) are widely used to analyze causal effects, when randomized controlled trials (RCTs) are impractical or compliance to treatment cannot be perfectly enforced. Unlike RCTs, which directly allocate treatments, EDs randomly assign encouragement policies that positively motivate individuals to engage in a specific treatment. These random encouragements act as instrumental variables (IVs), facilitating the identification of causal effects through leveraging exogenous perturbations in discrete treatment scenarios. However, real-world applications of encouragement designs often face challenges such as incomplete randomization, limited experimental data, and significantly fewer encouragements compared to treatments, hindering precise causal effect estimation. To address this, this paper introduces novel theories and algorithms for identifying the Conditional Average Treatment Effect (CATE) using variations in encouragement. Further, by leveraging both observational and encouragement data, we propose a generalized IV estimator, named Encouragement-based Counterfactual Regression (EnCounteR), to effectively estimate the causal effects. Extensive experiments on both synthetic and real-world datasets demonstrate the superiority of EnCounteR over existing methods.