elt
Towards a Unified Analysis of Kernel-based Methods Under Covariate Shift
Covariate shift occurs prevalently in practice, where the input distributions of the source and target data are substantially different. Despite its practical importance in various learning problems, most of the existing methods only focus on some specific learning tasks and are not well validated theoretically and numerically. To tackle this problem, we propose a unified analysis of general nonparametric methods in a reproducing kernel Hilbert space (RKHS) under covariate shift. Our theoretical results are established for a general loss belonging to a rich loss function family, which includes many commonly used methods as special cases, such as mean regression, quantile regression, likelihood-based classification, and margin-based classification. Two types of covariate shift problems are the focus of this paper and the sharp convergence rates are established for a general loss function to provide a unified theoretical analysis, which concurs with the optimal results in literature where the squared loss is used. Extensive numerical studies on synthetic and real examples confirm our theoretical findings and further illustrate the effectiveness of our proposed method.
Sample-Efficient Learning of Correlated Equilibria in Extensive-Form Games
Imperfect-Information Extensive-Form Games (IIEFGs) is a prevalent model for real-world games involving imperfect information and sequential plays. The Extensive-Form Correlated Equilibrium (EFCE) has been proposed as a natural solution concept for multi-player general-sum IIEFGs. However, existing algorithms for finding an EFCE require full feedback from the game, and it remains open how to efficiently learn the EFCE in the more challenging bandit feedback setting where the game can only be learned by observations from repeated playing. This paper presents the first sample-efficient algorithm for learning the EFCE from bandit feedback. We begin by proposing K-EFCE--a generalized definition that allows players to observe and deviate from the recommended actions for K times. The K-EFCE includes the EFCE as a special case at K = 1, and is an increasingly stricter notion of equilibrium as K increases.
CLT-Optimal Parameter Error Bounds for Linear System Identification
There has been remarkable progress over the past decade in establishing finite-sample, non-asymptotic bounds on recovering unknown system parameters from observed system behavior. Surprisingly, however, we show that the current state-of-the-art bounds do not accurately capture the statistical complexity of system identification, even in the most fundamental setting of estimating a discrete-time linear dynamical system (LDS) via ordinary least-squares regression (OLS). Specifically, we utilize asymptotic normality to identify classes of problem instances for which current bounds overstate the squared parameter error, in both spectral and Frobenius norm, by a factor of the state-dimension of the system. Informed by this discrepancy, we then sharpen the OLS parameter error bounds via a novel second-order decomposition of the parameter error, where crucially the lower-order term is a matrix-valued martingale that we show correctly captures the CLT scaling. From our analysis we obtain finite-sample bounds for both (i) stable systems and (ii) the many-trajectories setting that match the instance-specific optimal rates up to constant factors in Frobenius norm, and polylogarithmic state-dimension factors in spectral norm.
MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science
Kim, Junho, Kim, Yeachan, Park, Jun-Hyung, Oh, Yerim, Kim, Suho, Lee, SangKeun
We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of MELT, demonstrating superior performance compared to existing continued pre-training methods. The in-depth analysis also shows that MELT enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applicability across a wide spectrum of materials science.
MELT: Mining Effective Lightweight Transformations from Pull Requests
Ramos, Daniel, Mitchell, Hailie, Lynce, Inรชs, Manquinho, Vasco, Martins, Ruben, Goues, Claire Le
Software developers often struggle to update APIs, leading to manual, time-consuming, and error-prone processes. We introduce MELT, a new approach that generates lightweight API migration rules directly from pull requests in popular library repositories. Our key insight is that pull requests merged into open-source libraries are a rich source of information sufficient to mine API migration rules. By leveraging code examples mined from the library source and automatically generated code examples based on the pull requests, we infer transformation rules in \comby, a language for structural code search and replace. Since inferred rules from single code examples may be too specific, we propose a generalization procedure to make the rules more applicable to client projects. MELT rules are syntax-driven, interpretable, and easily adaptable. Moreover, unlike previous work, our approach enables rule inference to seamlessly integrate into the library workflow, removing the need to wait for client code migrations. We evaluated MELT on pull requests from four popular libraries, successfully mining 461 migration rules from code examples in pull requests and 114 rules from auto-generated code examples. Our generalization procedure increases the number of matches for mined rules by 9x. We applied these rules to client projects and ran their tests, which led to an overall decrease in the number of warnings and fixing some test cases demonstrating MELT's effectiveness in real-world scenarios.
ETL vs ELT: Which One is Right for Your Data Pipeline? - KDnuggets
ETL and ELT are data integration pipelines that transfer data from multiple sources to a single centralized source and perform some transformation and processing steps to it. The difference between these two is ETL transforms the data before loading, and ELT transforms the data after loading. But before diving deeply into them, let's first understand the meaning of E, L, and T. T for Transform - Transforming the data is a process of cleaning and modifying the data in a format so that it can be used for business analysis. L for Loading - It involves loading data to a target system, which may be a data warehouse or a database. ETL is the first standardized data integration method that emerged in the 1970s due to the evolution of disk storage.
ETL or ELT? The Big Data age calls for the right integration strategy - ET CIO
By Vikram Labhe It is a truism at this point to talk of the centrality of data for organisations. According to IDC, the global datasphere will rise at a compound annual growth rate (CAGR) of 23% between 2020-2025, highlighting the importance of responding to the surge in storage demand. For businesses to leverage data insights and drive growth, they must coordinate the dependencies and execute the different tasks on their data journey in the desired order, all while ensuring minimal impact from potential errors. Whether an organisation favours extract, transform, load (ETL) or extract, load, transform (ELT) will depend on their specific needs. Orchestration is fundamental for modern data processes, but for many businesses a modern data stack makes specific orchestration tools redundant.
Cloud turns data transformation on its head
The traditional data transformation procedure of extract, transform and load (ETL) is rapidly being turned on its head in a modern twist enabled by cloud technologies. The Cloud's lower costs, its flexibility and scalability, and the huge processing capability of cloud data warehouses, have driven a major change: the ability to load all data into the cloud, before transforming it. This trend means that ETL itself has been transformed--into extract, load and transform, or ELT. ELT offers several advantages, including retention of data granularity, reduced need for expensive software engineers and significantly reduced project turnaround times. Data is vital for organizations, who use it to understand their customers, identify new opportunities and support decision-makers with mission-critical and up-to-date information.
Why the Future of ETL Is Not ELT, But EL(T) - KDnuggets
How we store and manage data has completely changed over the last decade. We moved from an ETL world to an ELT world, with companies like Fivetran pushing the trend. However, we don't think it is going to stop there; ELT is a transition in our mind towards EL(T) (with EL decoupled from T). And to understand this, we need to discern the underlying reasons for this trend, as they might show what's in store for the future. This is what we will be doing in this article. Historically, the data pipeline process consisted of extracting, transforming, and loading data into a warehouse or a data lake.
ETL & ELT, a comparison
When designing and building data pipelines to load data into data warehouses you might have heard of the common ETL and ELT paradigms. This post goes over what they mean, their differences and which paradigm you might want to choose. If you are wondering why we have a staging area click here. ELT is very similar but the data is loaded into a table before being transformed to a final table which is used by users. As you can see it has fewer components compared to the ETL approach.