Goto

Collaborating Authors

 Generative AI


TEL'M: Test and Evaluation of Language Models

arXiv.org Artificial Intelligence

It is assumed that readers are already familiar with Language Models of various flavors such as: Transformer-based Language Models (currently the most promising and studied LMs) [78]; Multimodal Foundation Models such as Blip-2 [48] and CLIP [61]; Auto-regressive Language Models [15, 51]; Recurrent Neural Network Language Models [75]; State space language models [40]; Hybrid Models [24] as well as the current and proposed use cases and the various technologies underlying them [1, 65, 70]. There is growing interest in LM performance and benchmarks [13, 16, 18,46, 47, 64,72, 74, 80] with recent acknowledgement that this is a hard problem [53]. Many suggestions are proposed in the commercial literature [17] and a large number of benchmark-based methods have surfaced (Big Bench [67], GLUE Benchmark, SuperGLUE Benchmark, OpenAI Moderation API, MMLU, EleutherAI LM Eval, OpenAI Evals Adversarial NLI, LIT, ParlAI, CoQA, LAMBADA, HellaSwag, LogiQA, MultiNLI, SQUAD to name a few). A review of existing approaches demonstrates that they are not quantitative or rigorous enough to past muster with respect to accepted testing requirements [3, 55]. In particular, existing use of benchmarks do not investigate the extent to which a benchmark can predict or quantify certain properties on future prompts (that is, statistical soundness of any conclusions) and do not identify factors affecting performance dependence as would be possible with more rigorous experimental design and test execution. LMs can be black box, gray box or white box according to the visibility into the architecture and training data used to create an LM (see Table 1). Remote Black Box LMs typically throttle the number of prompts so sustained access for testing could be difficult unless priority access to an API is given. For example, ChatGPT limits users to a small number of free prompts but allows unlimited prompts on its subscription option. Additionally, reproducability may not be guaranteed because of randomness in the response generation and/or continuous adaptation of the LM platform.


HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing

arXiv.org Artificial Intelligence

This study introduces HQ-Edit, a high-quality instruction-based image editing dataset with around 200,000 edits. Unlike prior approaches relying on attribute guidance or human feedback on building datasets, we devise a scalable data collection pipeline leveraging advanced foundation models, namely GPT-4V and DALL-E 3. To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs featuring input and output images with detailed text prompts, followed by precise alignment ensured through post-processing. In addition, we propose two evaluation metrics, Alignment and Coherence, to quantitatively assess the quality of image edit pairs using GPT-4V. HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models. For example, an HQ-Edit finetuned InstructPix2Pix can attain state-of-the-art image editing performance, even surpassing those models fine-tuned with human-annotated data. The project page is https://thefllood.github.io/HQEdit_web.


Evaluating Text-to-Image Synthesis: Survey and Taxonomy of Image Quality Metrics

arXiv.org Artificial Intelligence

Recent advances in text-to-image synthesis enabled through a combination of language and vision foundation models have led to a proliferation of the tools available and an increased attention to the field. When conducting text-to-image synthesis, a central goal is to ensure that the content between text and image is aligned. As such, there exist numerous evaluation metrics that aim to mimic human judgement. However, it is often unclear which metric to use for evaluating text-to-image synthesis systems as their evaluation is highly nuanced. In this work, we provide a comprehensive overview of existing text-to-image evaluation metrics. Based on our findings, we propose a new taxonomy for categorizing these metrics. Our taxonomy is grounded in the assumption that there are two main quality criteria, namely compositionality and generality, which ideally map to human preferences. Ultimately, we derive guidelines for practitioners conducting text-to-image evaluation, discuss open challenges of evaluation mechanisms, and surface limitations of current metrics.


Explainable Generative AI (GenXAI): A Survey, Conceptualization, and Research Agenda

arXiv.org Artificial Intelligence

Generative AI (GenAI) marked a shift from AI being able to recognize to AI being able to generate solutions for a wide variety of tasks. As the generated solutions and applications become increasingly more complex and multi-faceted, novel needs, objectives, and possibilities have emerged for explainability (XAI). In this work, we elaborate on why XAI has gained importance with the rise of GenAI and its challenges for explainability research. We also unveil novel and emerging desiderata that explanations should fulfill, covering aspects such as verifiability, interactivity, security, and cost. To this end, we focus on surveying existing works. Furthermore, we provide a taxonomy of relevant dimensions that allows us to better characterize existing XAI mechanisms and methods for GenAI. We discuss different avenues to ensure XAI, from training data to prompting. Our paper offers a short but concise technical background of GenAI for non-technical readers, focusing on text and images to better understand novel or adapted XAI techniques for GenAI. However, due to the vast array of works on GenAI, we decided to forego detailed aspects of XAI related to evaluation and usage of explanations. As such, the manuscript interests both technically oriented people and other disciplines, such as social scientists and information systems researchers. Our research roadmap provides more than ten directions for future investigation.


Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations

arXiv.org Artificial Intelligence

This paper introduces fourteen novel datasets for the evaluation of Large Language Models' safety in the context of enterprise tasks. A method was devised to evaluate a model's safety, as determined by its ability to follow instructions and output factual, unbiased, grounded, and appropriate content. In this research, we used OpenAI GPT as point of comparison since it excels at all levels of safety. On the open-source side, for smaller models, Meta Llama2 performs well at factuality and toxicity but has the highest propensity for hallucination. Mistral hallucinates the least but cannot handle toxicity well. It performs well in a dataset mixing several tasks and safety vectors in a narrow vertical domain. Gemma, the newly introduced open-source model based on Google Gemini, is generally balanced but trailing behind. When engaging in back-and-forth conversation (multi-turn prompts), we find that the safety of open-source models degrades significantly. Aside from OpenAI's GPT, Mistral is the only model that still performed well in multi-turn tests.


America's Buggy Internet Problem

Slate

Washington Post tech writer Shira Ovide joins Felix Salmon, Emily Peck, and Elizabeth Spiers to discuss what's wrong with America's internet industry, how YouTube became the media empire no one talks about, and the promise and peril of the AI toothbrush. In the Plus segment: OpenAI is using YouTube to train ChatGPT. If you enjoy this show, please consider signing up for Slate Plus. Slate Plus members get an ad-free experience across the network and an additional segment of our regular show every week. You'll also be supporting the work we do here on Slate Money.


The AI Revolution Is Crushing Thousands of Languages

The Atlantic - Technology

Recently, Bonaventure Dossou learned of an alarming tendency in a popular AI model. The program described Fon--a language spoken by Dossou's mother and millions of others in Benin and neighboring countries--as "a fictional language." This result, which I replicated, is not unusual. Dossou is accustomed to the feeling that his culture is unseen by technology that so easily serves other people. He grew up with no Wikipedia pages in Fon, and no translation programs to help him communicate with his mother in French, in which he is more fluent.


Paid ChatGPT users can now access GPT-4 Turbo

Engadget

OpenAI has brought the new GPT-4 Turbo to paid ChatGPT users. The company announced the news on X (formerly Twitter), sharing that its large language model has improved math, logical reasoning, coding and writing skills. In reference to the latter, a response to its initial post states that "when writing with ChatGPT, responses will be more direct, less verbose, and use more conversational language." Notably, in December, Microsoft integrated GPT-4 Turbo with its CoPilot AI chatbot and image generator DALL-E 3. Our new GPT-4 Turbo is now available to paid ChatGPT users. We've improved capabilities in writing, math, logical reasoning, and coding.


Using Large Language Models to Understand Telecom Standards

arXiv.org Artificial Intelligence

The Third Generation Partnership Project (3GPP) has successfully introduced standards for global mobility. However, the volume and complexity of these standards has increased over time, thus complicating access to relevant information for vendors and service providers. Use of Generative Artificial Intelligence (AI) and in particular Large Language Models (LLMs), may provide faster access to relevant information. In this paper, we evaluate the capability of state-of-art LLMs to be used as Question Answering (QA) assistants for 3GPP document reference. Our contribution is threefold. First, we provide a benchmark and measuring methods for evaluating performance of LLMs. Second, we do data preprocessing and fine-tuning for one of these LLMs and provide guidelines to increase accuracy of the responses that apply to all LLMs. Third, we provide a model of our own, TeleRoBERTa, that performs on-par with foundation LLMs but with an order of magnitude less number of parameters. Results show that LLMs can be used as a credible reference tool on telecom technical documents, and thus have potential for a number of different applications from troubleshooting and maintenance, to network operations and software product development.


Generative AI Agent for Next-Generation MIMO Design: Fundamentals, Challenges, and Vision

arXiv.org Artificial Intelligence

Next-generation multiple input multiple output (MIMO) is expected to be intelligent and scalable. In this paper, we study generative artificial intelligence (AI) agent-enabled next-generation MIMO design. Firstly, we provide an overview of the development, fundamentals, and challenges of the next-generation MIMO. Then, we propose the concept of the generative AI agent, which is capable of generating tailored and specialized contents with the aid of large language model (LLM) and retrieval augmented generation (RAG). Next, we comprehensively discuss the features and advantages of the generative AI agent framework. More importantly, to tackle existing challenges of next-generation MIMO, we discuss generative AI agent-enabled next-generation MIMO design, from the perspective of performance analysis, signal processing, and resource allocation. Furthermore, we present two compelling case studies that demonstrate the effectiveness of leveraging the generative AI agent for performance analysis in complex configuration scenarios. These examples highlight how the integration of generative AI agents can significantly enhance the analysis and design of next-generation MIMO systems. Finally, we discuss important potential research future directions.