Goto

Collaborating Authors

 Instructional Material


Missing Data in Signal Processing and Machine Learning: Models, Methods and Modern Approaches

arXiv.org Machine Learning

Missing data appears when parts of the data are not available for a given variable or a given observation. It is an ubiquitous problem in a wide range of scientific disciplines, including sensor networks, geophysical data analysis, radar and image processing, remote sensing, ecological statistics and biomedical studies, just to name a few [1]-[5]. Signal processing is no exception to the rule, where missing data mainly come from sensor malfunction, hidden or impossible measurements, human errors and natural hazards, all of which can hinder a thorough understanding, analysis, and interpretation of the signal. One of the earliest work on missing data was published in 1932 by Wilks, who mentioned the need to extract as much information as possible from fragmentary answers of questionnaires in social sciences and government statistics. Therefore, it is not surprising that the first discipline to witness this issue was mathematical statistics. This led Wilks to derive efficient estimators for the parameters of a normal bivariate distribution when the data contain missing values [6]. This work was extended to the multivariate case by Lord in 1955 [7]. Since the early 1970's, the literature in missing data has flourished with the development of computational capacity, leading to major developments in signal processing and its related fields, such as statistical inference [2], data analysis [8] and machine learning [9]. In particular, the formulation of a missing-data theory framework by Rubin in [10], which describes the relation between missingness and data values in the so-called missing-data mechanisms, has allowed tremendous advancements in statistical analysis. Therefore, a tutorial paper aiming to summarize the existing and novel strategies in the SP & ML literature addressing various problems related to missing data, such as parameter estimation, matrix completion, missing data imputation and learning with missing values, as well as showing their potential applications, is an urgent desideratum. This tutorial aims to provide practitioners with vital tools, in an accessible way, to answer the question: How to deal with missing data? There are many strategies to handle incomplete signals.


On the Benefits of Accelerated Optimization in Robust and Private Estimation

arXiv.org Machine Learning

We study the advantages of accelerated gradient methods, specifically based on the Frank-Wolfe method and projected gradient descent, for privacy and heavy-tailed robustness. Our approaches are as follows: For the Frank-Wolfe method, our technique is based on a tailored learning rate and a uniform lower bound on the gradient of the $\ell_2$-norm over the constraint set. For accelerating projected gradient descent, we use the popular variant based on Nesterov's momentum, and we optimize our objective over $\mathbb{R}^p$. These accelerations reduce iteration complexity, translating into stronger statistical guarantees for empirical and population risk minimization. Our analysis covers three settings: non-random data, random model-free data, and parametric models (linear regression and generalized linear models). Methodologically, we approach both privacy and robustness based on noisy gradients. We ensure differential privacy via the Gaussian mechanism and advanced composition, and we achieve heavy-tailed robustness using a geometric median-of-means estimator, which also sharpens the dependency on the dimension of the covariates. Finally, we compare our rates to existing bounds and identify scenarios where our methods attain optimal convergence.


Student Perspectives on the Benefits and Risks of AI in Education

arXiv.org Artificial Intelligence

The use of chatbots equipped with artificial intelligence (AI) in educational settings has increased in recent years, showing potential to support teaching and learning. However, the adoption of these technologies has raised concerns about their impact on academic integrity, students' ability to problem-solve independently, and potential underlying biases. To better understand students' perspectives and experiences with these tools, a survey was conducted at a large public university in the United States. Through thematic analysis, 262 undergraduate students' responses regarding their perceived benefits and risks of AI chatbots in education were identified and categorized into themes. The results discuss several benefits identified by the students, with feedback and study support, instruction capabilities, and access to information being the most cited. Their primary concerns included risks to academic integrity, accuracy of information, loss of critical thinking skills, the potential development of overreliance, and ethical considerations such as data privacy, system bias, environmental impact, and preservation of human elements in education. While student perceptions align with previously discussed benefits and risks of AI in education, they show heightened concerns about distinguishing between human and AI generated work - particularly in cases where authentic work is flagged as AI-generated. To address students' concerns, institutions can establish clear policies regarding AI use and develop curriculum around AI literacy. With these in place, practitioners can effectively develop and implement educational systems that leverage AI's potential in areas such as immediate feedback and personalized learning support. This approach can enhance the quality of students' educational experiences while preserving the integrity of the learning process with AI.


Stereotypical gender actions can be extracted from Web text

arXiv.org Artificial Intelligence

Online social networks and micro-blogging services are no longer limited to the followers of the latest technologies or teenagers, as might once have been expected. Such technology and services are becoming widely adopted by the mainstream population as an integral part of their daily lives (Fox et al., 2009). A very prominent example of such an application is Twitter, a micro-blogging service. Twitter lets its users post very short (at most 140-character) messages - which are called tweets - about what they have been doing or thinking, or what they want to share with their friends and other people. Everyday, tens of millions of tweets are posted by users worldwide. The proliferation of publicly available, user-generated content is a vast source of social data and is already shaping the field of computational social science (Lazer et al., 2009; Thelwall et al., 2010a). Another field which enjoys the abundance of Web-based text is knowledge extraction and automated ontology building. An example application is KNEXT ( Kn owledge Ex traction from T ext) - a system proposed for extracting "general world knowledge from miscellaneous texts, including fiction" (Schubert and Tong, 2003). Web-based text is increasingly used as a source for everyday knowledge (frequently referred as commonsense knowledge).


Machine vs Machine: Using AI to Tackle Generative AI Threats in Assessment

arXiv.org Artificial Intelligence

This paper presents a theoretical framework for addressing the challenges posed by generative artificial intelligence (AI) in higher education assessment through a machine-versus-machine approach. Large language models like GPT-4, Claude, and Llama increasingly demonstrate the ability to produce sophisticated academic content, traditional assessment methods face an existential threat, with surveys indicating 74-92% of students experimenting with these tools for academic purposes. Current responses, ranging from detection software to manual assessment redesign, show significant limitations: detection tools demonstrate bias against non-native English writers and can be easily circumvented, while manual frameworks rely heavily on subjective judgment and assume static AI capabilities. This paper introduces a dual strategy paradigm combining static analysis and dynamic testing to create a comprehensive theoretical framework for assessment vulnerability evaluation. The static analysis component comprises eight theoretically justified elements: specificity and contextualization, temporal relevance, process visibility requirements, personalization elements, resource accessibility, multimodal integration, ethical reasoning requirements, and collaborative elements. Each element addresses specific limitations in generative AI capabilities, creating barriers that distinguish authentic human learning from AI-generated simulation. The dynamic testing component provides a complementary approach through simulation-based vulnerability assessment, addressing limitations in pattern-based analysis. The paper presents a theoretical framework for vulnerability scoring, including the conceptual basis for quantitative assessment, weighting frameworks, and threshold determination theory.


KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

arXiv.org Artificial Intelligence

Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.


Hierarchical Bayesian Knowledge Tracing in Undergraduate Engineering Education

arXiv.org Machine Learning

Educators teaching entry-level university engineering modules face the challenge of identifying which topics students find most difficult and how to support diverse student needs effectively. This study demonstrates a rigorous yet interpretable statistical approach -- hierarchical Bayesian modeling -- that leverages detailed student response data to quantify both skill difficulty and individual student abilities. Using a large-scale dataset from an undergraduate Statics course, we identified clear patterns of skill mastery and uncovered distinct student subgroups based on their learning trajectories. Our analysis reveals that certain concepts consistently present challenges, requiring targeted instructional support, while others are readily mastered and may benefit from enrichment activities. Importantly, the hierarchical Bayesian method provides educators with intuitive, reliable metrics without sacrificing predictive accuracy. This approach allows for data-informed decisions, enabling personalized teaching strategies to improve student engagement and success. By combining robust statistical methods with clear interpretability, this study equips educators with actionable insights to better support diverse learner populations.


Agnostic Reinforcement Learning: Foundations and Algorithms

arXiv.org Machine Learning

Reinforcement Learning (RL) has demonstrated tremendous empirical success across numerous challenging domains. However, we lack a strong theoretical understanding of the statistical complexity of RL in environments with large state spaces, where function approximation is required for sample-efficient learning. This thesis addresses this gap by rigorously examining the statistical complexity of RL with function approximation from a learning theoretic perspective. Departing from a long history of prior work, we consider the weakest form of function approximation, called agnostic policy learning, in which the learner seeks to find the best policy in a given class $ฮ $, with no guarantee that $ฮ $ contains an optimal policy for the underlying task. We systematically explore agnostic policy learning along three key axes: environment access -- how a learner collects data from the environment; coverage conditions -- intrinsic properties of the underlying MDP measuring the expansiveness of state-occupancy measures for policies in the class $ฮ $, and representational conditions -- structural assumptions on the class $ฮ $ itself. Within this comprehensive framework, we (1) design new learning algorithms with theoretical guarantees and (2) characterize fundamental performance bounds of any algorithm. Our results reveal significant statistical separations that highlight the power and limitations of agnostic policy learning.


Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling

arXiv.org Artificial Intelligence

Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces high processing latency. To address this issue, we propose SMLLE, a streaming framework for generating high-quality speech frame-by-frame. SMLLE employs a Transducer to convert text into semantic tokens in real time while simultaneously obtaining duration alignment information. The combined outputs are then fed into a fully autoregressive (AR) streaming model to reconstruct mel-spectrograms. To further stabilize the generation process, we design a Delete < Bos > Mechanism that allows the AR model to access future text introducing as minimal delay as possible. Experimental results suggest that the SMLLE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems. Samples are available on shy-98.github.io/SMLLE_demo_page/.


FormFactory: An Interactive Benchmarking Suite for Multimodal Form-Filling Agents

arXiv.org Artificial Intelligence

Online form filling is a common yet labor-intensive task involving extensive keyboard and mouse interactions. Despite the long-standing vision of automating this process with "one click", existing tools remain largely rule-based and lack generalizable, generative capabilities. Recent advances in Multimodal Large Language Models (MLLMs) have enabled promising agents for GUI-related tasks in general-purpose scenarios. However, they struggle with the unique challenges of form filling, such as flexible layouts and the difficulty of aligning textual instructions with on-screen fields. To bridge this gap, we formally define the form-filling task and propose FormFactory, an interactive benchmarking suite comprising a web-based interface, backend evaluation module, and carefully constructed dataset. Our benchmark covers diverse real-world scenarios, incorporates various field formats, and simulates high-fidelity form interactions. We conduct a comprehensive evaluation of state-of-the-art MLLMs and observe that no model surpasses 5% accuracy, underscoring the inherent difficulty of the task. These findings also reveal significant limitations in current models' visual layout reasoning and field-value alignment abilities. We hope our benchmark can serve as a stepping stone for further research into robust, practical form-filling agents.