Large Language Model
Anthropic says its Claude AI can now read a whole book in under a minute
Anthropic says it has vastly expanded the amount of information its generative AI, Claude, is able to process. Claude has gone from having a limit of 9,000 tokens to 100,000 tokens, which corresponds to roughly 75,000 words. To put that into perspective, Claude now has the ability to easily read and finish Ernest Hemingway's A Farewell to Arms (74,240 words), Mary Shelley's Frankenstein (74,800 words) and Mark Twain's The Adventures of Tom Sawyer (69,000 words). And, as The Verge notes, the company says Claude can read and analyze information from each book in under a minute. Generative AIs like Claude are still limited by the number of "tokens" they can process.
The open-source AI boom is built on Big Tech's handouts. How long will it last?
Companies like Google--which revealed at its annual product showcase this week that it is throwing generative AI at everything it has, from Gmail to Photos to Maps--were too busy looking over their shoulders to see the real competition coming, writes Sernau: "While we've been squabbling, a third faction has been quietly eating our lunch." Greater access to these models has helped drive innovation--it can also help catch their flaws. AI won't thrive if just a few mega-rich companies get to gatekeep this technology or decide how it is used. But this open-source boom is precarious. Most open-source releases still stand on the shoulders of giant models put out by big firms with deep pockets.
'Why would we employ people?' Experts on five ways AI will change work
In 1965, the political scientist and Nobel laureate Herbert Simon declared: "Machines will be capable, within 20 years, of doing any work a man can do." Today, in what is increasingly referred to as the fourth industrial revolution, the arrival of artificial intelligence (AI) in the workplace is igniting similar concerns. The European parliament's forthcoming Artificial Intelligence Act is likely to deem the use of AI across education, law enforcement and worker management to be "high risk". Geoffrey Hinton, known as the "godfather of AI", recently resigned from his position at Google, citing concerns about the technology's impact on the job market. And, in early May, striking members of the Writers Guild of America promised executives: "AI will replace you before it replaces us."
Hollywood writers' strike highlights AI: Industry creatives 'should be concerned' for future, expert says
Veritone CEO Ryan Steelberg says the Writers Guild of America needs to make sure their writers are protected as AI becomes more popular. Nearly two weeks into the national writers' strike spearheaded by the Writers Guild of America (WGA), little progress has been made between both sides. The WGA has a litany of requests for the Alliance of Motion Picture and Television Producers (AMPTP). Per its website, the WGA has specific proposals with regard to artificial intelligence, including the "regulation of AI on minimum basic agreement (MBA) -covered projects; AI can't write or rewrite literary material; can't be used as source material; and MBA-covered material can't be used to train AI." When it comes to these provisions that surround artificial intelligence, studios have put the kibosh on writers' requests, instead suggesting annual meetings to review evolving technology.
MedGPTEval: A Dataset and Benchmark to Evaluate Responses of Large Language Models in Medicine
Xu, Jie, Lu, Lu, Yang, Sen, Liang, Bilin, Peng, Xinwei, Pang, Jiali, Ding, Jinru, Shi, Xiaoming, Yang, Lingrui, Song, Huan, Li, Kang, Sun, Xin, Zhang, Shaoting
METHODS: First, a set of evaluation criteria is designed based on a comprehensive literature review. Second, existing candidate criteria are optimized for using a Delphi method by five experts in medicine and engineering. Third, three clinical experts design a set of medical datasets to interact with LLMs. Finally, benchmarking experiments are conducted on the datasets. The responses generated by chatbots based on LLMs are recorded for blind evaluations by five licensed medical experts. RESULTS: The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with sixteen detailed indicators. The medical datasets include twenty-seven medical dialogues and seven case reports in Chinese. Three chatbots are evaluated, ChatGPT by OpenAI, ERNIE Bot by Baidu Inc., and Doctor PuJiang (Dr. PJ) by Shanghai Artificial Intelligence Laboratory. Experimental results show that Dr. PJ outperforms ChatGPT and ERNIE Bot in both multiple-turn medical dialogue and case report scenarios.
Learning to Reason over Scene Graphs: A Case Study of Finetuning GPT-2 into a Robot Language Model for Grounded Task Planning
Chalvatzaki, Georgia, Younes, Ali, Nandha, Daljeet, Le, An, Ribeiro, Leonardo F. R., Gurevych, Iryna
Long-horizon task planning is essential for the development of intelligent assistive and service robots. In this work, we investigate the applicability of a smaller class of large language models (LLMs), specifically GPT-2, in robotic task planning by learning to decompose tasks into subgoal specifications for a planner to execute sequentially. Our method grounds the input of the LLM on the domain that is represented as a scene graph, enabling it to translate human requests into executable robot plans, thereby learning to reason over long-horizon tasks, as encountered in the ALFRED benchmark. We compare our approach with classical planning and baseline methods to examine the applicability and generalizability of LLM-based planners. Our findings suggest that the knowledge stored in an LLM can be effectively grounded to perform long-horizon task planning, demonstrating the promising potential for the future application of neuro-symbolic planning methods in robotics.
NL2TL: Transforming Natural Languages to Temporal Logics using Large Language Models
Chen, Yongchao, Gandhi, Rujul, Zhang, Yang, Fan, Chuchu
Temporal Logic (TL) can be used to rigorously specify complex high-level specification for systems in many engineering applications. The translation between natural language (NL) and TL has been under-explored due to the lack of dataset and generalizable model across different application domains. In this paper, we propose an accurate and generalizable transformation framework of English instructions from NL to TL, exploring the use of Large Language Models (LLMs) at multiple stages. Our contributions are twofold. First, we develop a framework to create a dataset of NL-TL pairs combining LLMs and human annotation. We publish a dataset with 28K NL-TL pairs. Then, we finetune T5 models on the lifted versions (i.e., the specific Atomic Propositions (AP) are hidden) of the NL and TL. The enhanced generalizability originates from two aspects: 1) Usage of lifted NL-TL characterizes common logical structures, without constraints of specific domains. 2) Application of LLMs in dataset creation largely enhances corpus richness. We test the generalization of trained models on five varied domains. To achieve full NL-TL transformation, we either combine the lifted model with AP recognition task or do the further finetuning on each specific domain. During the further finetuning, our model achieves higher accuracy (>95%) using only <10% training data, compared with the baseline sequence to sequence (Seq2Seq) model.
Surfacing Biases in Large Language Models using Contrastive Input Decoding
Yona, Gal, Honovich, Or, Laish, Itay, Aharoni, Roee
Ensuring that large language models (LMs) are fair, robust and useful requires an understanding of how different modifications to their inputs impact the model's behaviour. In the context of open-text generation tasks, however, such an evaluation is not trivial. For example, when introducing a model with an input text and a perturbed, "contrastive" version of it, meaningful differences in the next-token predictions may not be revealed with standard decoding strategies. With this motivation in mind, we propose Contrastive Input Decoding (CID): a decoding algorithm to generate text given two inputs, where the generated text is likely given one input but unlikely given the other. In this way, the contrastive generations can highlight potentially subtle differences in how the LM output differs for the two inputs in a simple and interpretable manner. We use CID to highlight context-specific biases that are hard to detect with standard decoding strategies and quantify the effect of different input perturbations.
Regulating ChatGPT and other Large Generative AI Models
Hacker, Philipp, Engel, Andreas, Mauer, Marco
Large generative AI models (LGAIMs), such as ChatGPT, GPT-4 or Stable Diffusion, are rapidly transforming the way we communicate, illustrate, and create. However, AI regulation, in the EU and beyond, has primarily focused on conventional AI models, not LGAIMs. This paper will situate these new generative models in the current debate on trustworthy AI regulation, and ask how the law can be tailored to their capabilities. After laying technical foundations, the legal part of the paper proceeds in four steps, covering (1) direct regulation, (2) data protection, (3) content moderation, and (4) policy proposals. It suggests a novel terminology to capture the AI value chain in LGAIM settings by differentiating between LGAIM developers, deployers, professional and non-professional users, as well as recipients of LGAIM output. We tailor regulatory duties to these different actors along the value chain and suggest strategies to ensure that LGAIMs are trustworthy and deployed for the benefit of society at large. Rules in the AI Act and other direct regulation must match the specificities of pre-trained models. The paper argues for three layers of obligations concerning LGAIMs (minimum standards for all LGAIMs; high-risk obligations for high-risk use cases; collaborations along the AI value chain). In general, regulation should focus on concrete high-risk applications, and not the pre-trained model itself, and should include (i) obligations regarding transparency and (ii) risk management. Non-discrimination provisions (iii) may, however, apply to LGAIM developers. Lastly, (iv) the core of the DSA content moderation rules should be expanded to cover LGAIMs. This includes notice and action mechanisms, and trusted flaggers. In all areas, regulators and lawmakers need to act fast to keep track with the dynamics of ChatGPT et al.
Zero-shot Item-based Recommendation via Multi-task Product Knowledge Graph Pre-Training
Fan, Ziwei, Liu, Zhiwei, Heinecke, Shelby, Zhang, Jianguo, Wang, Huan, Xiong, Caiming, Yu, Philip S.
Existing recommender systems face difficulties with zero-shot items, i.e. items that have no historical interactions with users during the training stage. Though recent works extract universal item representation via pre-trained language models (PLMs), they ignore the crucial item relationships. This paper presents a novel paradigm for the Zero-Shot Item-based Recommendation (ZSIR) task, which pre-trains a model on product knowledge graph (PKG) to refine the item features from PLMs. We identify three challenges for pre-training PKG, which are multi-type relations in PKG, semantic divergence between item generic information and relations and domain discrepancy from PKG to downstream ZSIR task. We address the challenges by proposing four pre-training tasks and novel task-oriented adaptation (ToA) layers. Moreover, this paper discusses how to fine-tune the model on new recommendation task such that the ToA layers are adapted to ZSIR task. Comprehensive experiments on 18 markets dataset are conducted to verify the effectiveness of the proposed model in both knowledge prediction and ZSIR task.