Goto

Collaborating Authors

 Kuang, Wei


Online Covariance Matrix Estimation in Sketched Newton Methods

arXiv.org Machine Learning

Given the ubiquity of streaming data, online algorithms have been widely used for parameter estimation, with second-order methods particularly standing out for their efficiency and robustness. In this paper, we study an online sketched Newton method that leverages a randomized sketching technique to perform an approximate Newton step in each iteration, thereby eliminating the computational bottleneck of second-order methods. While existing studies have established the asymptotic normality of sketched Newton methods, a consistent estimator of the limiting covariance matrix remains an open problem. We propose a fully online covariance matrix estimator that is constructed entirely from the Newton iterates and requires no matrix factorization. Compared to covariance estimators for first-order online methods, our estimator for second-order methods is batch-free. We establish the consistency and convergence rate of our estimator, and coupled with asymptotic normality results, we can then perform online statistical inference for the model parameters based on sketched Newton methods. We also discuss the extension of our estimator to constrained problems, and demonstrate its superior performance on regression problems as well as benchmark problems in the CUTEst set.


Automated Root Cause Analysis System for Complex Data Products

arXiv.org Artificial Intelligence

We present ARCAS (Automated Root Cause Analysis System), a diagnostic platform based on a Domain Specific Language (DSL) built for fast diagnostic implementation and low learning curve. Arcas is composed of a constellation of automated troubleshooting guides (Auto-TSGs) that can execute in parallel to detect issues using product telemetry and apply mitigation in near-real-time. The DSL is tailored specifically to ensure that subject matter experts can deliver highly curated and relevant Auto-TSGs in a short time without having to understand how they will interact with the rest of the diagnostic platform, thus reducing time-to-mitigate and saving crucial engineering cycles when they matter most. This contrasts with platforms like Datadog and New Relic, which primarily focus on monitoring and require manual intervention for mitigation. ARCAS uses a Large Language Model (LLM) to prioritize Auto-TSGs outputs and take appropriate actions, thus suppressing the costly requirement of understanding the general behavior of the system. We explain the key concepts behind ARCAS and demonstrate how it has been successfully used for multiple products across Azure Synapse Analytics and Microsoft Fabric Synapse Data Warehouse.


Transfer Learning-Based Co-Run Scheduling for Heterogeneous Datacenters

AAAI Conferences

Today’s data centers are designed with multi-core CPUs where multiple virtual machines (VMs) can be co-located into one physical machine or distribute multiple computing tasks onto one physical machine. The result is co-tenancy, resource sharing and competition. Modeling and predicting such co-run interference becomes crucial for job scheduling and Quality of Service assurance. Co-locating interference can be characterized into two components, sensitivity and pressure, where sensitivity characterizes how an application’s own performance is affected by a co-run application, and pressure characterizes how much contentiousness an application exerts/brings onto the memory subsystem. Previous studies show that with simple models, sensitivity and pressure can be accurately characterized for a single machine. We extend the models to consider cross-architecture sensitivity (across different machines).