Large Language Model
Artificial general intelligence in the wrong hands could do 'really dangerous stuff,' experts warn
AGI, while powerful, could have negative consequences, warned Diveplane CEO Mike Capps and Liberty Blockchain CCO Christopher Alexander. Artificial general intelligence โ the kind of AI that has capabilities similar to humans โ may be far off and offer new opportunities, but experts warn it could be potentially dangerous, and have drastic implications for white-collar workers. "I'm about as excited about AGI as I am about nuclear fission," Diveplane CEO Dr. Michael Capps told Fox News Digital. "It's really amazing what we can do with it, it can power our society, but in the wrong hands, it can do some really dangerous stuff." While there is no one definition of AGI, a 2020 report from consulting giant McKinsey said such a machine would need to master human-like skills, such as fine motor skills and natural language processing.
OpenAI leaders call for regulation to prevent AI destroying humanity
The leaders of the ChatGPT developer OpenAI have called for the regulation of "superintelligent" AIs, arguing that an equivalent to the International Atomic Energy Agency is needed to protect humanity from the risk of accidentally creating something with the power to destroy it. In a short note published to the company's website, co-founders Greg Brockman and Ilya Sutskever and the chief executive, Sam Altman, call for an international regulator to begin working on how to "inspect systems, require audits, test for compliance with safety standards, [and] place restrictions on degrees of deployment and levels of security" in order to reduce the "existential risk" such systems could pose. "It's conceivable that within the next 10 years, AI systems will exceed expert skill level in most domains, and carry out as much productive activity as one of today's largest corporations," they write. "In terms of both potential upsides and downsides, superintelligence will be more powerful than other technologies humanity has had to contend with in the past. We can have a dramatically more prosperous future; but we have to manage risk to get there. Given the possibility of existential risk, we can't just be reactive."
M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection
Wang, Yuxia, Mansurov, Jonibek, Ivanov, Petar, Su, Jinyan, Shelmanov, Artem, Tsvigun, Akim, Whitehouse, Chenxi, Afzal, Osama Mohammed, Mahmoud, Tarek, Aji, Alham Fikri, Nakov, Preslav
Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries, but this has also resulted in concerns regarding the potential misuse of such texts in journalism, educational, and academic context. In this work, we aim to develop automatic systems to identify machine-generated text and to detect potential misuse. We first introduce a large-scale benchmark M4, which is multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Using the dataset, we experiment with a number of methods and we show that it is challenging for detectors to generalize well on unseen examples if they are either from different domains or are generated by different large language models. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and there is a lot of room for improvement. We believe that our dataset M4, which covers different generators, domains and languages, will enable future research towards more robust approaches for this pressing societal problem. The M4 dataset is available at https://github.com/mbzuai-nlp/M4.
DialogVCS: Robust Natural Language Understanding in Dialogue System Upgrade
Cai, Zefan, Zheng, Xin, Liu, Tianyu, Wang, Xu, Meng, Haoran, Han, Jiaqi, Yuan, Gang, Lin, Binghuai, Chang, Baobao, Cao, Yunbo
In the constant updates of the product dialogue systems, we need to retrain the natural language understanding (NLU) model as new data from the real users would be merged into the existent data accumulated in the last updates. Within the newly added data, new intents would emerge and might have semantic entanglement with the existing intents, e.g. new intents that are semantically too specific or generic are actually subset or superset of some existing intents in the semantic space, thus impairing the robustness of the NLU model. As the first attempt to solve this problem, we setup a new benchmark consisting of 4 Dialogue Version Control dataSets (DialogVCS). We formulate the intent detection with imperfect data in the system update as a multi-label classification task with positive but unlabeled intents, which asks the models to recognize all the proper intents, including the ones with semantic entanglement, in the inference. We also propose comprehensive baseline models and conduct in-depth analyses for the benchmark, showing that the semantically entangled intents can be effectively recognized with an automatic workflow.
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Chen, Jun, Zhu, Deyao, Haydarov, Kilichbek, Li, Xiang, Elhoseiny, Mohamed
Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, BLIP-2 is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. Through the human evaluation experiments, we found that nearly 62.5% of participants agree that Video ChatCaptioner can cover more visual information compared to ground-truth captions.
Waiting, Banning, and Embracing: An Empirical Analysis of Adapting Policies for Generative AI in Higher Education
Xiao, Ping, Chen, Yuanyuan, Bao, Weining
Generative AI tools such as ChatGPT have recently gained significant attention in higher education. This study aims to understand how universities establish policies regarding the use of AI tools and explore the factors that influence their decisions. Our study examines ChatGPT policies implemented at universities around the world, including their existence, content, and issuance dates. Specifically, we analyzed the top 500 universities according to the 2022 QS World University Rankings. Our findings indicate that there is significant variation in university policies. Less than one-third of the universities included in the study had implemented ChatGPT policies. Of the universities with ChatGPT policies, approximately 67 percent embraced ChatGPT in teaching and learning, more than twice the number of universities that banned it. The majority of the universities that ban the use of ChatGPT in assessments allow individual instructors to deviate from this restrictive policy. Our empirical analysis identifies several factors that are significantly and positively correlated with a university's likelihood of having a ChatGPT policy, including the university's academic reputation score, being in an English-speaking country, and the general public attitudes toward ChatGPT. In addition, we found that a university's likelihood of having a ban policy is positively associated with faculty student ratio, citations, and the English-speaking country dummy, while negatively associated with the number of peer universities within the same country that have banned ChatGPT. We discuss the challenges faced by universities based our empirical findings.
Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering
Caciularu, Avi, Peters, Matthew E., Goldberger, Jacob, Dagan, Ido, Cohan, Arman
The integration of multi-document pre-training objectives into language models has resulted in remarkable improvements in multi-document downstream tasks. In this work, we propose extending this idea by pre-training a generic multi-document model from a novel cross-document question answering pre-training objective. To that end, given a set (or cluster) of topically-related documents, we systematically generate semantically-oriented questions from a salient sentence in one document and challenge the model, during pre-training, to answer these questions while "peeking" into other topically-related documents. In a similar manner, the model is also challenged to recover the sentence from which the question was generated, again while leveraging cross-document information. This novel multi-document QA formulation directs the model to better recover cross-text informational relations, and introduces a natural augmentation that artificially increases the pre-training data. Further, unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation (e.g., QA) and long text generation (e.g., summarization). Following this scheme, we pre-train our model -- termed QAmden -- and evaluate its performance across several multi-document tasks, including multi-document QA, summarization, and query-focused summarization, yielding improvements of up to 7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.
Bias-to-Text: Debiasing Unknown Visual Biases through Language Interpretation
Kim, Younghyun, Mo, Sangwoo, Kim, Minkyu, Lee, Kyungmin, Lee, Jaeho, Shin, Jinwoo
Biases in models pose a critical issue when deploying machine learning systems, but diagnosing them in an explainable manner can be challenging. To address this, we introduce the bias-to-text (B2T) framework, which uses language interpretation to identify and mitigate biases in vision models, such as image classifiers and text-to-image generative models. Our language descriptions of visual biases provide explainable forms that enable the discovery of novel biases and effective model debiasing. To achieve this, we analyze common keywords in the captions of mispredicted or generated images. Here, we propose novel score functions to avoid biases in captions by comparing the similarities between bias keywords and those images. Additionally, we present strategies to debias zero-shot classifiers and text-to-image diffusion models using the bias keywords from the B2T framework. We demonstrate the effectiveness of our framework on various image classification and generation tasks. For classifiers, we discover a new spurious correlation between the keywords "(sports) player" and "female" in Kaggle Face and improve the worst-group accuracy on Waterbirds by 11% through debiasing, compared to the baseline. For generative models, we detect and effectively prevent unfair (e.g., gender-biased) and unsafe (e.g., "naked") image generation.
Inverse scaling can become U-shaped
Wei, Jason, Kim, Najoung, Tay, Yi, Le, Quoc V.
Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks. However, if we were to observe worse performance as a function of scale ("inverse scaling") on certain tasks, this would indicate that scaling can also encourage behaviors that are misaligned with human preferences. The Inverse Scaling Prize (McKenzie et al. 2022) identified eleven such inverse scaling tasks, evaluated on models of up to 280B parameters and up to 500 zettaFLOPs of training compute. This paper takes a closer look at these inverse scaling tasks. We evaluate models of up to 540B parameters, trained on five times more compute than those evaluated in the Inverse Scaling Prize. With this increased range of model sizes and training compute, only four out of the eleven tasks remain inverse scaling. Six out of the eleven tasks exhibit "U-shaped scaling", where performance decreases up to a certain size, and then increases again up to the largest model evaluated (the one remaining task displays positive scaling). In addition, we find that 1-shot examples and chain-of-thought can help mitigate undesirable scaling patterns even further. U-shaped scaling suggests that the inverse scaling trend observed in McKenzie et al. (2022) may not continue to hold for larger models, which we attribute to the presence of distractor tasks that only sufficiently large models can avoid.
EvEval: A Comprehensive Evaluation of Event Semantics for Large Language Models
Tao, Zhengwei, Jin, Zhi, Bai, Xiaoying, Zhao, Haiyan, Feng, Yanlin, Li, Jia, Hu, Wenpeng
Events serve as fundamental units of occurrence within various contexts. The processing of event semantics in textual information forms the basis of numerous natural language processing (NLP) applications. Recent studies have begun leveraging large language models (LLMs) to address event semantic processing. However, the extent that LLMs can effectively tackle these challenges remains uncertain. Furthermore, the lack of a comprehensive evaluation framework for event semantic processing poses a significant challenge in evaluating these capabilities. In this paper, we propose an overarching framework for event semantic processing, encompassing understanding, reasoning, and prediction, along with their fine-grained aspects. To comprehensively evaluate the event semantic processing abilities of models, we introduce a novel benchmark called EVEVAL. We collect 8 datasets that cover all aspects of event semantic processing. Extensive experiments are conducted on EVEVAL, leading to several noteworthy findings based on the obtained results.