Goto

Collaborating Authors

 good example


MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

arXiv.org Artificial Intelligence

Automatically evaluating multimodal generation presents a significant challenge, as automated metrics often struggle to align reliably with human evaluation, especially for complex tasks that involve multiple modalities. To address this, we present MMMG, a comprehensive and human-aligned benchmark for multimodal generation across 4 modality combinations (image, audio, interleaved text and image, interleaved text and audio), with a focus on tasks that present significant challenges for generation models, while still enabling reliable automatic evaluation through a combination of models and programs. MMMG encompasses 49 tasks (including 29 newly developed ones), each with a carefully designed evaluation pipeline, and 937 instructions to systematically assess reasoning, controllability, and other key capabilities of multimodal generation models. Extensive validation demonstrates that MMMG is highly aligned with human evaluation, achieving an average agreement of 94.3%. Benchmarking results on 24 multimodal generation models reveal that even though the state-of-the-art model, GPT Image, achieves 78.3% accuracy for image generation, it falls short on multimodal reasoning and interleaved generation. Furthermore, results suggest considerable headroom for improvement in audio generation, highlighting an important direction for future research.


The application of GPT-4 in grading design university students' assignment and providing feedback: An exploratory study

arXiv.org Artificial Intelligence

This study aims to investigate whether GPT-4 can effectively grade assignments for design university students and provide useful feedback. In design education, assignments do not have a single correct answer and often involve solving an open-ended design problem. This subjective nature of design projects often leads to grading problems,as grades can vary between different raters,for instance instructor from engineering background or architecture background. This study employs an iterative research approach in developing a Custom GPT with the aim of achieving more reliable results and testing whether it can provide design students with constructive feedback. The findings include: First,through several rounds of iterations the inter-reliability between GPT and human raters reached a level that is generally accepted by educators. This indicates that by providing accurate prompts to GPT,and continuously iterating to build a Custom GPT, it can be used to effectively grade students' design assignments, serving as a reliable complement to human raters. Second, the intra-reliability of GPT's scoring at different times is between 0.65 and 0.78. This indicates that, with adequate instructions, a Custom GPT gives consistent results which is a precondition for grading students. As consistency and comparability are the two main rules to ensure the reliability of educational assessment, this study has looked at whether a Custom GPT can be developed that adheres to these two rules. We finish the paper by testing whether Custom GPT can provide students with useful feedback and reflecting on how educators can develop and iterate a Custom GPT to serve as a complementary rater.


Testing a Bayesian Measure of Representativeness Using a Large Image Database

Neural Information Processing Systems

How do people determine which elements of a set are most representative of that set? We extend an existing Bayesian measure of representativeness, which indicates the representativeness of a sample from a distribution, to define a measure of the representativeness of an item to a set. We show that this measure is formally related to a machine learning method known as Bayesian Sets. Building on this connection, we derive an analytic expression for the representativeness of objects described by a sparse vector of binary features. We then apply this measure to a large database of images, using it to determine which images are the most representative members of different sets. Comparing the resulting predictions to human judgments of representativeness provides a test of this measure with naturalistic stimuli, and illustrates how databases that are more commonly used in computer vision and machine learning can be used to evaluate psychological theories.


What's The Difference Between Artificial Intelligence In Film and its Limitations in Real Life?

#artificialintelligence

Artificial Intelligence is one of the most misunderstood technological innovations to ever be presented to the general public. Movies like The Terminator represent AI as monstrous killing machines that take pleasure in wiping out all of humanity because it's the "logical" thing to do. Star Trek Nemesis represents the Borg as a sentient species of AI that hijacks the human body and bends its victims to its will. Ultron was presented in the Avengers as nothing more than a destructive force that saw humanity as evil. The problem with these portrayals of Artificial Intelligence is that nothing could be further from the truth.


Generative AI (2/2): what will the future look like?

#artificialintelligence

In my previous Article about Generative AI, I tried to set up the basics by giving a definition of this new technology trend, explaining its use cases and how the underlying algorithms where working. Now I want to extend a little bit on which type of actors will emerge of this trends, what are the opportunities for entrepreneurs and what will be the challenges they will be facing. Most industries could be impacted by Generative AI, but some more than others. Here I'll discuss some of the most important ones from my point of view. Copywriting is the most obvious and notorious usage of Generative AI.


AutoReply: Detecting Nonsense in Dialogue Introspectively with Discriminative Replies

arXiv.org Artificial Intelligence

Existing approaches built separate classifiers to detect nonsense in dialogues. In this paper, we show that without external classifiers, dialogue models can detect errors in their own messages introspectively, by calculating the likelihood of replies that are indicative of poor messages. For example, if an agent believes its partner is likely to respond "I don't understand" to a candidate message, that message may not make sense, so an alternative message should be chosen. We evaluate our approach on a dataset from the game Diplomacy, which contains long dialogues richly grounded in the game state, on which existing models make many errors. We first show that hand-crafted replies can be effective for the task of detecting nonsense in applications as complex as Diplomacy. We then design AutoReply, an algorithm to search for such discriminative replies automatically, given a small number of annotated dialogue examples. We find that AutoReply-generated replies outperform handcrafted replies and perform on par with carefully fine-tuned large supervised models. Results also show that one single reply without much computation overheads can also detect dialogue nonsense reasonably well.


Ubitec Is A Good Example Of A NLU Agnostic Platform

#artificialintelligence

Ubitec is an Austrian based chatbot development framework with voice capabilities. Their focus is on-premise installation and language specific implementations. I came across ubitec the first time while gleaning insights from the Gartner Peer Reviews Of Conversational AI Platforms. In the peer reviews, there are some fairly unknown frameworks including Ubitec, Laiyle Chatbot and others. Currently the Ubitec Bot Framework is doing most of its work for Government Organisations and the like; while Ubitec does virtually all of their work in the EMEA region.


Robots? Some Companies Find Only Humans Can Do the Job

WSJ.com: WSJD - Technology

Among the disenchanted, FedEx Corp. said last month it was powering down Roxo, its last-mile delivery robot, to prioritize several "nearer-term opportunities," a spokeswoman said. Inc. said it was ending field tests of Scout, its home-delivery robot, after learning that some aspects of its "unique delivery experience" weren't "meeting customers' needs," a company spokeswoman said. And over the summer, DoorDash Inc. said it was shutting down its Chowbotics business--best known for Sally, the salad-making robot--roughly 18 months after buying it. "While we gained valuable insights into how to better serve this market, we concluded our current approach was not meeting our very high thresholds for continued investment," a DoorDash spokesman said. Companies have entertained hopes that the growing variety of robots could help them not only weather the worker shortage, but speed up labor-intensive tasks, improve customer service by reducing the number of things the human workers have to do, and as an added bonus, position their brands as innovative and forward-leaning.


How Brands Can Drive Personalization at Scale

#artificialintelligence

In year three of a global pandemic, consumers want businesses to be more empathetic. In response, brands are aligning more closely to changing consumer preferences. To deliver a superior customer experience, we see several B2C brands investing in technologies that enable personalization. But, doing this at scale, with millions of consumers, is quite a challenge. Conversational AI helps brands navigate this.


5 Ways AI is Changing Public Relations

#artificialintelligence

The application of AI technologies has promoted a revolution in the business industry. When it comes to the public relations industry, AI applications also change the original workflow. Nowadays, to become a good PR participant, an employee is not only required to have practiced communications skills but also the ability to collaborate with AI-based platforms. This article will show you the new vision of public relations, exploring the 5 ways AI is changing the industry. With the digital transformation of business, social media now has become a critical battlefield for business operations.