AITopics | Chaudhuri, Surajit

Collaborating Authors

Chaudhuri, Surajit

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Generator-Validator Fine-tuning

Xing, Junjie, He, Yeye, Zhou, Mengyu, Dong, Haoyu, Han, Shi, Zhang, Dongmei, Chaudhuri, Surajit

arXiv.org Artificial IntelligenceOct-15-2024

In this work, we propose Table-LLM-Specialist, or Table-Specialist for short, as a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm, to iteratively generate-then-validate training data from language-models, to fine-tune stronger \sys models that can specialize in a given task, without requiring manually-labeled data. Our extensive evaluations suggest that our Table-Specialist has (1) \textit{strong performance} on diverse table tasks over vanilla language-models -- for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) \textit{lower cost} to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency and inference cost, with comparable quality, and (3) \textit{better generalizability} when evaluated across multiple benchmarks, since \sys is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code and data will be available at https://github.com/microsoft/Table-LLM-Specialist.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2410.12164

Country:

Europe > Ukraine (1.00)
Asia > Middle East (1.00)
North America > United States > California > Los Angeles County (0.27)

Genre: Research Report (1.00)

Industry:

Transportation > Passenger (1.00)
Transportation > Ground > Road (1.00)
Media (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations

Chen, Sibei, He, Yeye, Cui, Weiwei, Fan, Ju, Ge, Song, Zhang, Haidong, Zhang, Dongmei, Chaudhuri, Surajit

arXiv.org Artificial IntelligenceApr-18-2024

Spreadsheets are widely recognized as the most popular end-user programming tools, which blend the power of formula-based computation, with an intuitive table-based interface. Today, spreadsheets are used by billions of users to manipulate tables, most of whom are neither database experts nor professional programmers. Despite the success of spreadsheets, authoring complex formulas remains challenging, as non-technical users need to look up and understand non-trivial formula syntax. To address this pain point, we leverage the observation that there is often an abundance of similar-looking spreadsheets in the same organization, which not only have similar data, but also share similar computation logic encoded as formulas. We develop an Auto-Formula system that can accurately predict formulas that users want to author in a target spreadsheet cell, by learning and adapting formulas that already exist in similar spreadsheets, using contrastive-learning techniques inspired by "similar-face recognition" from compute vision. Extensive evaluations on over 2K test formulas extracted from real enterprise spreadsheets show the effectiveness of Auto-Formula over alternatives. Our benchmark data is available at https://github.com/microsoft/Auto-Formula to facilitate future research.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2404.12608

Country:

Asia > China (0.14)
North America > United States (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Software (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Table-GPT: Table-tuned GPT for Diverse Table Tasks

Li, Peng, He, Yeye, Yashar, Dror, Cui, Weiwei, Ge, Song, Zhang, Haidong, Fainman, Danielle Rifinski, Zhang, Dongmei, Chaudhuri, Surajit

arXiv.org Artificial IntelligenceOct-13-2023

Language models, such as GPT-3.5 and ChatGPT, demonstrate remarkable abilities to follow diverse human instructions and perform a wide range of tasks. However, when probing language models using a range of basic table-understanding tasks, we observe that today's language models are still sub-optimal in many table-related tasks, likely because they are pre-trained predominantly on \emph{one-dimensional} natural-language texts, whereas relational tables are \emph{two-dimensional} objects. In this work, we propose a new "\emph{table-tuning}" paradigm, where we continue to train/fine-tune language models like GPT-3.5 and ChatGPT, using diverse table-tasks synthesized from real tables as training data, with the goal of enhancing language models' ability to understand tables and perform table tasks. We show that our resulting Table-GPT models demonstrate (1) better \emph{table-understanding} capabilities, by consistently outperforming the vanilla GPT-3.5 and ChatGPT, on a wide-range of table tasks, including holdout unseen tasks, and (2) strong \emph{generalizability}, in its ability to respond to diverse human instructions to perform new table-tasks, in a manner similar to GPT-3.5 and ChatGPT.

final result, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2310.09263

Country:

North America > United States (1.00)
Europe (1.00)
Asia > Indonesia > Java (0.14)

Genre: Research Report > New Finding (0.92)

Industry: Leisure & Entertainment > Sports (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples

Li, Peng, He, Yeye, Yan, Cong, Wang, Yue, Chaudhuri, Surajit

arXiv.org Artificial IntelligenceAug-9-2023

Relational tables, where each row corresponds to an entity and each column corresponds to an attribute, have been the standard for tables in relational databases. However, such a standard cannot be taken for granted when dealing with tables "in the wild". Our survey of real spreadsheet-tables and web-tables shows that over 30% of such tables do not conform to the relational standard, for which complex table-restructuring transformations are needed before these tables can be queried easily using SQL-based analytics tools. Unfortunately, the required transformations are non-trivial to program, which has become a substantial pain point for technical and non-technical users alike, as evidenced by large numbers of forum questions in places like StackOverflow and Excel/Power-BI/Tableau forums. We develop an Auto-Tables system that can automatically synthesize pipelines with multi-step transformations (in Python or other languages), to transform non-relational tables into standard relational forms for downstream analytics, obviating the need for users to manually program transformations. We compile an extensive benchmark for this new task, by collecting 244 real test cases from user spreadsheets and online forums. Our evaluation suggests that Auto-Tables can successfully synthesize transformations for over 70% of test cases at interactive speeds, without requiring any input from users, making this an effective tool for both technical and non-technical users to prepare data for analytics.

data mining, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2307.14565

Genre:

Research Report (0.82)
Workflow (0.67)

Technology:

Information Technology > Software (1.00)
Information Technology > Databases (1.00)
Information Technology > Data Science > Data Mining (1.00)
(4 more...)

Add feedback

Auto-Validate by-History: Auto-Program Data Quality Constraints to Validate Recurring Data Pipelines

Tu, Dezhan, He, Yeye, Cui, Weiwei, Ge, Song, Zhang, Haidong, Shi, Han, Zhang, Dongmei, Chaudhuri, Surajit

arXiv.org Artificial IntelligenceJun-4-2023

Data pipelines are widely employed in modern enterprises to power a variety of Machine-Learning (ML) and Business-Intelligence (BI) applications. Crucially, these pipelines are \emph{recurring} (e.g., daily or hourly) in production settings to keep data updated so that ML models can be re-trained regularly, and BI dashboards refreshed frequently. However, data quality (DQ) issues can often creep into recurring pipelines because of upstream schema and data drift over time. As modern enterprises operate thousands of recurring pipelines, today data engineers have to spend substantial efforts to \emph{manually} monitor and resolve DQ issues, as part of their DataOps and MLOps practices. Given the high human cost of managing large-scale pipeline operations, it is imperative that we can \emph{automate} as much as possible. In this work, we propose Auto-Validate-by-History (AVH) that can automatically detect DQ issues in recurring pipelines, leveraging rich statistics from historical executions. We formalize this as an optimization problem, and develop constant-factor approximation algorithms with provable precision guarantees. Extensive evaluations using 2000 production data pipelines at Microsoft demonstrate the effectiveness and efficiency of AVH.

data mining, data quality, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2306.02421

Country: North America > United States > California > Los Angeles County (0.14)

Genre: Research Report (1.00)

Industry:

Energy > Oil & Gas (0.46)
Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Constraint-Based Reasoning (0.82)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.71)
(2 more...)

Add feedback

Efficient Identification of Approximate Best Configuration of Training in Large Datasets

Huang, Silu, Wang, Chi, Ding, Bolin, Chaudhuri, Surajit

arXiv.org Machine LearningNov-7-2018

A configuration of training refers to the combinations of feature engineering, learner, and its associated hyperparameters. Given a set of configurations and a large dataset randomly split into training and testing set, we study how to efficiently identify the best configuration with approximately the highest testing accuracy when trained from the training set. To guarantee small accuracy loss, we develop a solution using confidence interval (CI)-based progressive sampling and pruning strategy. Compared to using full data to find the exact best configuration, our solution achieves more than two orders of magnitude speedup, while the returned top configuration has identical or close test accuracy.

artificial intelligence, configuration, optimization problem, (20 more...)

arXiv.org Machine Learning

1811.0325

Country:

North America > United States > Washington > King County (0.14)
North America > United States > Illinois > Champaign County (0.14)

Genre: Research Report (0.82)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback