AITopics | Chan, Jun Shern

Collaborating Authors

Chan, Jun Shern

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Chan, Jun Shern, Chowdhury, Neil, Jaffe, Oliver, Aung, James, Sherburn, Dane, Mays, Evan, Starace, Giulio, Liu, Kevin, Maksin, Leon, Patwardhan, Tejal, Weng, Lilian, Mądry, Aleksander

arXiv.org Artificial IntelligenceDec-20-2024

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2410.07095

Genre:

Personal > Honors (1.00)
Research Report > Experimental Study (0.93)
Research Report > New Finding (0.68)

Industry:

Materials > Chemicals > Industrial Gases > Liquified Gas (1.00)
Materials > Chemicals > Commodity Chemicals > Petrochemicals > LNG (1.00)
Energy > Oil & Gas > Midstream (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.45)

Add feedback

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Pan, Alexander, Chan, Jun Shern, Zou, Andy, Li, Nathaniel, Basart, Steven, Woodside, Thomas, Ng, Jonathan, Zhang, Hanlin, Emmons, Scott, Hendrycks, Dan

arXiv.org Artificial IntelligenceJun-12-2023

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

machine learning, natural language, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2304.03279

Country: North America > United States > California (0.28)

Genre: Research Report > New Finding (0.86)

Industry:

Law (1.00)
Government (1.00)
Education (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)

Add feedback

Training Language Models with Language Feedback at Scale

Scheurer, Jérémy, Campos, Jon Ander, Korbak, Tomasz, Chan, Jun Shern, Chen, Angelica, Cho, Kyunghyun, Perez, Ethan

arXiv.org Artificial IntelligenceApr-9-2023

Pretrained language models often generate outputs that are not in line with human preferences, such as harmful text or factually incorrect summaries. Recent work approaches the above issues by learning from a simple form of human feedback: comparisons between pairs of model-generated outputs. However, comparison feedback only conveys limited information about human preferences. In this paper, we introduce Imitation learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback. ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements. Second, selecting the refinement incorporating the most feedback. Third, finetuning the language model to maximize the likelihood of the chosen refinement given the input. We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback. We evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic summarization task. Our experiments demonstrate that large language models accurately incorporate feedback and that finetuning with ILF scales well with the dataset size, even outperforming finetuning on human summaries. Learning from both language and comparison feedback outperforms learning from each alone, achieving human-level summarization performance.

artificial intelligence, natural language, refinement, (17 more...)

arXiv.org Artificial Intelligence

2303.16755

Country:

North America (0.28)
Europe (0.27)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.45)

Add feedback

Improving Code Generation by Training with Natural Language Feedback

Chen, Angelica, Scheurer, Jérémy, Korbak, Tomasz, Campos, Jon Ander, Chan, Jun Shern, Bowman, Samuel R., Cho, Kyunghyun, Perez, Ethan

arXiv.org Artificial IntelligenceMar-28-2023

The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the ground truth distribution and demonstrate a proof-of-concept on a neural program synthesis task. We use ILF to improve a Codegen-Mono 6.1B model's pass@1 rate by 38% relative (and 10% absolute) on the Mostly Basic Python Problems (MBPP) benchmark, outperforming both fine-tuning on MBPP and fine-tuning on repaired programs written by humans. Overall, our results suggest that learning from human-written natural language feedback is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM's performance on code generation tasks.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2303.16749

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.54)

Industry: Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Few-shot Adaptation Works with UnpredicTable Data

Chan, Jun Shern, Pieler, Michael, Jao, Jonathan, Scheurer, Jérémy, Perez, Ethan

arXiv.org Artificial IntelligenceAug-7-2022

Prior work on language models (LMs) shows that training on a large number of diverse tasks improves few-shot learning (FSL) performance on new tasks. We take this to the extreme, automatically extracting 413,299 tasks from internet tables - orders of magnitude more than the next-largest public datasets. Finetuning on the resulting dataset leads to improved FSL performance on Natural Language Processing (NLP) tasks, but not proportionally to dataset scale. In fact, we find that narrow subsets of our dataset sometimes outperform more diverse datasets. For example, finetuning on software documentation from support.google.com raises FSL performance by a mean of +7.5% on 52 downstream tasks, which beats training on 40 human-curated NLP datasets (+6.7%). Finetuning on various narrow datasets leads to similar broad improvements across test tasks, suggesting that the gains are not from domain adaptation but adapting to FSL in general. We do not observe clear patterns between the datasets that lead to FSL gains, leaving open questions about why certain data helps with FSL.

artificial intelligence, natural language, text processing, (20 more...)

arXiv.org Artificial Intelligence

2208.01009

Country: North America > United States (1.00)

Genre: Research Report > New Finding (0.93)

Industry:

Leisure & Entertainment > Sports (1.00)
Health & Medicine > Therapeutic Area (1.00)
Education (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback