Chevalier, Alexis
TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology
Chevalier, Alexis, Ghosh, Soumya, Awasthi, Urvi, Watkins, James, Bieniewska, Julia, Mitrea, Nichita, Kotova, Olga, Shkura, Kirill, Noble, Andrew, Steinbaugh, Michael, Delile, Julien, Meier, Christoph, Zhukov, Leonid, Khalil, Iya, Mukherjee, Srayanta, Mueller, Judith
The complexity of cell biology and the mechanisms of disease pathogenesis are driven by an intricate regulatory network of genes [Chatterjee and Ahituv, 2017, Theodoris et al., 2015, 2021]. A better resolution of this complex interactome network would enhance our ability to design drugs that target the causal mechanism of the disease rather than interventions that aim to modulate the downstream effects [Ding et al., 2022]. However, accurate inference of gene regulatory networks is challenging. The possible space for genetic interactions is vast [Bunne et al., 2024], the networks to be inferred are highly context-dependent, different cell types and tissue types exhibit different regulatory networks and exhibit significant variations across donors [Chen and Dahl, 2024]. Moreover, the data required to study gene regulatory networks for a specific disease is usually limited and highly specialized, often plagued by experimental artifacts [Hicks et al., 2018]. However, a confluence of recent technological progress promises to make this challenging problem more tractable. The advent of accurate single-cell sequencing technologies that remove the artifacts of bulk cell data, better reflect natural variability, and provide signals at higher resolutions. This, along with the increasing availability of atlas-scale scRNAseq datasets that span an extensive range of diseases, cell types, tissue types, and donors provide an unprecedented opportunity for studying disease mechanisms at scale.
Language Models as Science Tutors
Chevalier, Alexis, Geng, Jiayi, Wettig, Alexander, Chen, Howard, Mizera, Sebastian, Annala, Toni, Aragon, Max Jameson, Fanlo, Arturo Rodríguez, Frieder, Simon, Machado, Simon, Prabhakar, Akshara, Thieu, Ellie, Wang, Jiachen T., Wang, Zirui, Wu, Xindi, Xia, Mengzhou, Jia, Wenhan, Yu, Jiatong, Zhu, Jun-Jie, Ren, Zhiyong Jason, Arora, Sanjeev, Chen, Danqi
NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.
Adapting Language Models to Compress Contexts
Chevalier, Alexis, Wettig, Alexander, Ajith, Anirudh, Chen, Danqi
Transformer-based language models (LMs) are powerful and widely-applicable tools, but their usefulness is constrained by a finite context window and the expensive computational cost of processing long text documents. We propose to adapt pre-trained LMs into AutoCompressors. These language models are capable of compressing long contexts into compact summary vectors, which are then accessible to the model as soft prompts. Summary vectors are trained with an unsupervised objective, whereby long documents are processed in segments, and summary vectors from all previous segments are used in language modeling. We fine-tune OPT and Llama-2 models on sequences of up to 30,720 tokens and show that AutoCompressors can utilize long contexts to improve perplexity. We evaluate AutoCompressors on in-context learning by compressing task demonstrations and find that summary vectors are good substitutes for plain-text demonstrations, increasing accuracy while reducing inference costs. Finally, we explore the benefits of pre-computing summary vectors for large corpora by applying summary vectors to retrievalaugmented language modeling and a passage re-ranking task. Overall, AutoCompressors emerge as a simple and inexpensive solution to extend the context window of LMs while speeding up inference over long contexts.
Mathematical Capabilities of ChatGPT
Frieder, Simon, Pinchetti, Luca, Chevalier, Alexis, Griffiths, Ryan-Rhys, Salvatori, Tommaso, Lukasiewicz, Thomas, Petersen, Philipp Christian, Berner, Julius
We investigate the mathematical capabilities of two iterations of ChatGPT (released 9-January-2023 and 30-January-2023) and of GPT-4 by testing them on publicly available datasets, as well as hand-crafted ones, using a novel methodology. In contrast to formal mathematics, where large databases of formal proofs are available (e.g., the Lean Mathematical Library), current datasets of natural-language mathematics, used to benchmark language models, either cover only elementary mathematics or are very small. We address this by publicly releasing two new datasets: GHOSTS and miniGHOSTS. These are the first natural-language datasets curated by working researchers in mathematics that (1) aim to cover graduate-level mathematics, (2) provide a holistic overview of the mathematical capabilities of language models, and (3) distinguish multiple dimensions of mathematical reasoning. These datasets also test whether ChatGPT and GPT-4 can be helpful assistants to professional mathematicians by emulating use cases that arise in the daily professional activities of mathematicians. We benchmark the models on a range of fine-grained performance metrics. For advanced mathematics, this is the most detailed evaluation effort to date. We find that ChatGPT can be used most successfully as a mathematical assistant for querying facts, acting as a mathematical search engine and knowledge base interface. GPT-4 can additionally be used for undergraduate-level mathematics but fails on graduate-level difficulty. Contrary to many positive reports in the media about GPT-4 and ChatGPT's exam-solving abilities (a potential case of selection bias), their overall mathematical performance is well below the level of a graduate student. Hence, if your goal is to use ChatGPT to pass a graduate-level math exam, you would be better off copying from your average peer!