Large Language Model
How Old is GPT?: The HumBEL Framework for Evaluating Language Models using Human Demographic Data
Sicilia, Anthony, Gates, Jennifer C., Alikhani, Malihe
While large pre-trained language models (LMs) find greater use across NLP, existing evaluation protocols do not consider how LM language use aligns with particular human demographic groups, which can be an important consideration in conversational AI applications. To remedy this gap, we consider how LM language skills can be measured and compared to human sub-populations. We suggest clinical techniques from Speech Language Pathology, which has well-established norms for acquisition of language skills, organized by (human) age. We conduct evaluation with a domain expert (i.e., a clinically licensed speech language pathologist), and also propose automated techniques to substitute clinical evaluation at scale. We find LM capability varies widely depending on task with GPT-3.5 mimicking the ability of a typical 6-9 year old at tasks requiring inference about word meanings and simultaneously outperforming a typical 21 year old at memorization. GPT-3.5 (InstructGPT) also has trouble with social language use, exhibiting less than 50\% of the tested pragmatic skills. It shows errors in understanding particular word parts-of-speech and associative word relations, among other lexical features. Ultimately, findings reiterate the importance of considering demographic alignment and conversational goals when using these models as public-facing tools. Our framework will be publicly available via code, data, and a python package.
TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale
Zeng, Ziyun, Ge, Yixiao, Tong, Zhan, Liu, Xihui, Xia, Shu-Tao, Shan, Ying
The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS.
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
Wang, Lucy Lu, Otmakhova, Yulia, DeYoung, Jay, Truong, Thinh Hung, Kuehl, Bailey E., Bransom, Erin, Wallace, Byron C.
Evaluating multi-document summarization (MDS) quality is difficult. This is especially true in the case of MDS for biomedical literature reviews, where models must synthesize contradicting evidence reported across different documents. Prior work has shown that rather than performing the task, models may exploit shortcuts that are difficult to detect using standard n-gram similarity metrics such as ROUGE. Better automated evaluation metrics are needed, but few resources exist to assess metrics when they are proposed. Therefore, we introduce a dataset of human-assessed summary quality facets and pairwise preferences to encourage and support the development of better automated evaluation methods for literature review MDS. We take advantage of community submissions to the Multi-document Summarization for Literature Review (MSLR) shared task to compile a diverse and representative sample of generated summaries. We analyze how automated summarization evaluation metrics correlate with lexical features of generated summaries, to other automated metrics including several we propose in this work, and to aspects of human-assessed summary quality. We find that not only do automated metrics fail to capture aspects of quality as assessed by humans, in many cases the system rankings produced by these metrics are anti-correlated with rankings according to human annotators.
In the Rush to AI, We Can't Afford to Trust Big Tech
Gary Marcus delivered these remarks to the U.S. Senate Judiciary Subcommittee on Privacy, Technology and the Law on May 16. I am profoundly grateful to be here. I come as a scientist, as someone who has founded AI companies, and as someone who genuinely loves AI -- but who is increasingly worried. There are benefits; we don't yet know whether they will outweigh the risks. Fundamentally, these new systems are going to be destabilizing.
Meta's open-source speech AI recognizes over 4,000 spoken languages
Meta has created an AI language model that (in a refreshing change of pace) isn't a ChatGPT clone. The company's Massively Multilingual Speech (MMS) project can recognize over 4,000 spoken languages and produce speech (text-to-speech) in over 1,100. Like most of its other publicly announced AI projects, Meta is open-sourcing MMS today to help preserve language diversity and encourage researchers to build on its foundation. "Today, we are publicly sharing our models and code so that others in the research community can build upon our work," the company wrote. "Through this work, we hope to make a small contribution to preserve the incredible language diversity of the world."
The historic chess showdown between man and AI, decades before ChatGPT
With the eyes of the world on Philadelphia in February 1996 for Kasparov versus Deep Blue, a different sort of chess struggle was taking place elsewhere in the city. The North Philadelphia middle school named for Roberts Vaux had once been home to one of the country's best chess teams. From 1977 to 1983, the school claimed seven national championships. But then funding dried up and the program had difficulty attracting skilled coaches and players.
AI went to Washington and here's what you need to know about this mind-blowing technology
CEO says OpenAI CEO Sam Altman said language and cultural inclusivity is "very important" to his company's mission as it builds and trains powerful artificial intelligence systems. On Tuesday, May 16, Mr. Altman went to Washington. And today, the world feels a little scarier. There's rarely a day when we don't hear some new report about the groundbreaking impact – and potential danger – of this technology. Large learning models like ChatGPT have caught the world by surprise based on the speed of their learning and what they are now able to do.
Fears of AI hitting black market stir concerns of criminals evading government regulations: Expert
Dr. Harvey Castro said he's less concerned about AI being developed by big corporations because there are safeguards, but it can be created without safeguards and sold. Artificial intelligence – specifically large language models like ChatGPT – can theoretically give criminals information needed to cover their tracks before and after a crime, then erase that evidence, an expert warns. Large language models, or LLMs, make up a segment of AI technology that uses algorithms that can recognize, summarize, translate, predict and generate text and other content based on knowledge gained from massive datasets. ChatGPT is the most well known LLM, and its successful, rapid development has created unease among some experts and sparked a Senate hearing to hear from Sam Altman, the CEO of ChatGPT maker OpenAI, who pushed for oversight. Corporations like Google and Microsoft are developing AI at a fast pace. But when it comes to crime, that's not what scares Dr. Harvey Castro, a board-certified emergency medicine physician and national speaker on artificial intelligence who created his own LLM called "Sherlock."
LM-Switch: Lightweight Language Model Conditioning in Word Embedding Space
Han, Chi, Xu, Jialiang, Li, Manling, Fung, Yi, Sun, Chenkai, Jiang, Nan, Abdelzaher, Tarek, Ji, Heng
In recent years, large language models (LMs) have achieved remarkable progress across various natural language processing tasks. As pre-training and fine-tuning are costly and might negatively impact model performance, it is desired to efficiently adapt an existing model to different conditions such as styles, sentiments or narratives, when facing different audiences or scenarios. However, efficient adaptation of a language model to diverse conditions remains an open challenge. This work is inspired by the observation that text conditions are often associated with selection of certain words in a context. Therefore we introduce LM-Switch, a theoretically grounded, lightweight and simple method for generative language model conditioning. We begin by investigating the effect of conditions in Hidden Markov Models (HMMs), and establish a theoretical connection with language model. Our finding suggests that condition shifts in HMMs are associated with linear transformations in word embeddings. LM-Switch is then designed to deploy a learnable linear factor in the word embedding space for language model conditioning. We show that LM-Switch can model diverse tasks, and achieves comparable or better performance compared with state-of-the-art baselines in LM detoxification and generation control, despite requiring no more than 1% of parameters compared with baselines and little extra time overhead compared with base LMs. It is also able to learn from as few as a few sentences or one document. Moreover, a learned LM-Switch can be transferred to other LMs of different sizes, achieving a detoxification performance similar to the best baseline. We will make our code available to the research community following publication.
Cognitive network science reveals bias in GPT-3, ChatGPT, and GPT-4 mirroring math anxiety in high-school students
Abramski, Katherine, Citraro, Salvatore, Lombardi, Luigi, Rossetti, Giulio, Stella, Massimo
Large language models are becoming increasingly integrated into our lives. Hence, it is important to understand the biases present in their outputs in order to avoid perpetuating harmful stereotypes, which originate in our own flawed ways of thinking. This challenge requires developing new benchmarks and methods for quantifying affective and semantic bias, keeping in mind that LLMs act as psycho-social mirrors that reflect the views and tendencies that are prevalent in society. One such tendency that has harmful negative effects is the global phenomenon of anxiety toward math and STEM subjects. Here, we investigate perceptions of math and STEM fields provided by cutting-edge language models, namely GPT-3, Chat-GPT, and GPT-4, by applying an approach from network science and cognitive psychology. Specifically, we use behavioral forma mentis networks (BFMNs) to understand how these LLMs frame math and STEM disciplines in relation to other concepts. We use data obtained by probing the three LLMs in a language generation task that has previously been applied to humans. Our findings indicate that LLMs have an overall negative perception of math and STEM fields, with math being perceived most negatively. We observe significant differences across the three LLMs. We observe that newer versions (i.e. GPT-4) produce richer, more complex perceptions as well as less negative perceptions compared to older versions and N=159 high-school students. These findings suggest that advances in the architecture of LLMs may lead to increasingly less biased models that could even perhaps someday aid in reducing harmful stereotypes in society rather than perpetuating them.