AITopics | Wang, Daling

Plotting

Wang, Daling

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response

Liu, Yongkang, Feng, Shi, Wang, Daling, Zhang, Yifei, Schütze, Hinrich

arXiv.org Artificial IntelligenceNov-14-2023

LLMs (large language models) such as ChatGPT have shown remarkable language understanding and generation capabilities. Although reference-free evaluators based on LLMs show better human alignment than traditional reference-based evaluators, there are many challenges in using reference-free evaluators based on LLMs. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. But not all examples are open-ended. For closed-ended examples with unique correct semantic response, reference-free evaluators will still consider it high quality when giving a response that is inconsistent with the facts and the semantic of reference. In order to comprehensively evaluate the reliability of evaluators based on LLMs, we construct two adversarial meta-evaluation dialogue generation datasets KdConv-ADV and DSTC7-ADV based on KdConv and DSTC7-AVSD, respectively. Compared to previous meta-evaluation benchmarks, KdConv-ADV and DSTC7-ADV are much more challenging since they requires evaluators to be able to reasonably evaluate closed-ended examples with the help of external knowledge or even its own knowledge. Empirical results show that the ability of LLMs to identify unreasonable responses is insufficient. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2305.14658

Country:

Asia > Japan > Honshū (0.14)
Asia > Japan > Shikoku > Ehime Prefecture (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

MM-BigBench: Evaluating Multimodal Models on Multimodal Content Comprehension Tasks

Yang, Xiaocui, Wu, Wenfang, Feng, Shi, Wang, Ming, Wang, Daling, Li, Yang, Sun, Qi, Zhang, Yifei, Fu, Xiaoming, Poria, Soujanya

arXiv.org Artificial IntelligenceOct-13-2023

The popularity of multimodal large language models (MLLMs) has triggered a recent surge in research efforts dedicated to evaluating these models. Nevertheless, existing evaluation studies of MLLMs primarily focus on the comprehension and reasoning of unimodal (vision) content, neglecting performance evaluations in the domain of multimodal (vision-language) content understanding. Beyond multimodal reasoning, tasks related to multimodal content comprehension necessitate a profound understanding of multimodal contexts, achieved through the multimodal interaction to obtain a final answer. In this paper, we introduce a comprehensive assessment framework called MM-BigBench, which incorporates a diverse range of metrics to offer an extensive evaluation of the performance of various models and instructions across a wide spectrum of diverse multimodal content comprehension tasks. Consequently, our work complements research on the performance of MLLMs in multimodal comprehension tasks, achieving a more comprehensive and holistic evaluation of MLLMs. To begin, we employ the Best Performance metric to ascertain each model's performance upper bound on different datasets. Subsequently, the Mean Relative Gain metric offers an assessment of the overall performance of various models and instructions, while the Stability metric measures their sensitivity. Furthermore, previous research centers on evaluating models independently or solely assessing instructions, neglecting the adaptability between models and instructions. We propose the Adaptability metric to quantify the adaptability between models and instructions. Our paper evaluates a total of 20 language models (14 MLLMs) on 14 multimodal datasets spanning 6 tasks, with 10 instructions for each task, and derives novel insights. Our code will be released at https://github.com/declare-lab/MM-BigBench.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2310.09036

Country:

North America > United States > Louisiana (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
Europe > Germany > Lower Saxony > Gottingen (0.14)
Asia > Middle East > UAE (0.14)

Genre:

Overview (0.67)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

T-COL: Generating Counterfactual Explanations for General User Preferences on Variable Machine Learning Systems

Wang, Ming, Wang, Daling, Wu, Wenfang, Feng, Shi, Zhang, Yifei

arXiv.org Artificial IntelligenceSep-27-2023

Machine learning (ML) based systems have been suffering a lack of interpretability. To address this problem, counterfactual explanations (CEs) have been proposed. CEs are unique as they provide workable suggestions to users, in addition to explaining why a certain outcome was predicted. However, the application of CEs has been hindered by two main challenges, namely general user preferences and variable ML systems. User preferences, in particular, tend to be general rather than specific feature values. Additionally, CEs need to be customized to suit the variability of ML models, while also maintaining robustness even when these validation models change. To overcome these challenges, we propose several possible general user preferences that have been validated by user research and map them to the properties of CEs. We also introduce a new method called \uline{T}ree-based \uline{C}onditions \uline{O}ptional \uline{L}inks (T-COL), which has two optional structures and several groups of conditions for generating CEs that can be adapted to general user preferences. Meanwhile, a group of conditions lead T-COL to generate more robust CEs that have higher validity when the ML model is replaced. We compared the properties of CEs generated by T-COL experimentally under different user preferences and demonstrated that T-COL is better suited for accommodating user preferences and variable ML systems compared to baseline methods including Large Language Models.

artificial intelligence, generating counterfactual explanation, natural language, (3 more...)

arXiv.org Artificial Intelligence

2309.16146

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic Fusion Prompts

Yang, Xiaocui, Feng, Shi, Wang, Daling, Hong, Pengfei, Poria, Soujanya

arXiv.org Artificial IntelligenceAug-1-2023

Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media. However, existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect. Thus, there is a need to address the challenge of few-shot multimodal sentiment analysis. To tackle this problem, we propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario. Specifically, we start by introducing a Consistently Distributed Sampling approach called CDS, which ensures that the few-shot dataset has the same category distribution as the full dataset. Unlike previous approaches primarily using prompts based on the text modality, we design unified multimodal prompts to reduce discrepancies between different modalities and dynamically incorporate multimodal demonstrations into the context of each multimodal instance. To enhance the model's robustness, we introduce a probabilistic fusion method to fuse output predictions from multiple diverse prompts for each input. Our extensive experiments on six datasets demonstrate the effectiveness of our approach. First, our method outperforms strong baselines in the multimodal few-shot setting. Furthermore, under the same amount of data (1% of the full dataset), our CDS-based experimental results significantly outperform those based on previously sampled datasets constructed from the same number of instances of each class.

artificial intelligence, natural language, social media, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3581783.3612181

2211.06607

Country:

Europe (0.67)
North America > United States > Louisiana (0.14)
Asia > Middle East > UAE (0.14)

Genre: Research Report > Promising Solution (0.66)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Extraction (0.94)
Information Technology > Artificial Intelligence > Natural Language > Discourse & Dialogue (0.94)

Add feedback

RoCar: A Relationship Network-based Evaluation Method to Large Language Models

Wang, Ming, Wu, Wenfang, Gao, Chongyun, Wang, Daling, Feng, Shi, Zhang, Yifei

arXiv.org Artificial IntelligenceJul-29-2023

Pre-trained Models have become the dominant approach in the field of deep learning since Transformer [1]. Buy now, the Large Language Models (LLMs) represented by ChatGPT [2] have received the widest attention from researchers in the field of Artificial Intelligence (AI), especially Natural Language Processing (NLP). Like LLaMA [3], many open-source LLMs [4, 5, 6, 3, 7, 8] have been published. Due to the strong reasoning, generative and memory abilities acquired by LLMs during training, they are able to operate a variety of traditional tasks based on specific prompts and achieve great performance. As a result, LLMs have gained widespread interest and applications, such as in the financial [9], emotional [10, 11], legal [12], medical [13, 14, 15] and educational [16] fields. To evaluate the capability of LLMs and to guide the selection of more appropriate LLMs in applications, many evaluation approaches [17] for LLMs have been proposed by researchers. C-Eval [18] constructed a reasoning test set of 13,948 questions in 52 subjects ranging from junior school to postgraduate university and vocational exams to evaluate LLM's problem-solving skills. Gaokao-Bench [19] collected questions from the 2010-2022 Chinese national college entrance examination papers, including 1,781 objective questions and 1,030 subjective questions, and constructed a framework for assessing the language comprehension and logical reasoning ability of LLMs. Microsoft has released a new benchmark test, AGIEval [20], by selecting 20 official, public, high-standard exams, including general university entrance exams (Chinese national college entrance examination and the U.S. SAT), law school entrance exams, maths competitions, bar exams, national civil service exams, and more.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2307.15997

Country: Asia > China (0.14)

Genre: Research Report (0.66)

Industry:

Education > Assessment & Standards (0.94)
Education > Educational Setting > Higher Education (0.54)
Education > Curriculum > Subject-Specific Education (0.54)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism

Liu, Yongkang, Feng, Shi, Wang, Daling, Zhang, Yifei, Schütze, Hinrich

arXiv.org Artificial IntelligenceMay-16-2023

We investigate response generation for multi-turn dialogue in generative-based chatbots. Existing generative models based on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the sequences, which makes models unable to capture the subtle variability observed in different dialogues and cannot distinguish the differences between dialogues that are similar in composition. In this paper, we propose a Pseudo-Variational Gated Recurrent Unit (PVGRU) component without posterior knowledge through introducing a recurrent summarizing variable into the GRU, which can aggregate the accumulated distribution variations of subsequences. PVGRU can perceive the subtle semantic variability through summarizing variables that are optimized by the devised distribution consistency and reconstruction objectives. In addition, we build a Pseudo-Variational Hierarchical Dialogue (PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity and relevance of responses on two benchmark datasets.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2212.09086

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Graph Reasoning Network for Multi-turn Response Selection via Customized Pre-training

Liu, Yongkang, Feng, Shi, Wang, Daling, Song, Kaisong, Ren, Feiliang, Zhang, Yifei

arXiv.org Artificial IntelligenceJan-14-2021

We investigate response selection for multi-turn conversation in retrieval-based chatbots. Existing studies pay more attention to the matching between utterances and responses by calculating the matching score based on learned features, leading to insufficient model reasoning ability. In this paper, we propose a graph-reasoning network (GRN) to address the problem. GRN first conducts pre-training based on ALBERT using next utterance prediction and utterance order prediction tasks specifically devised for response selection. These two customized pre-training tasks can endow our model with the ability of capturing semantical and chronological dependency between utterances. We then fine-tune the model on an integrated network with sequence reasoning and graph reasoning structures. The sequence reasoning module conducts inference based on the highly summarized context vector of utterance-response pairs from the global perspective. The graph reasoning module conducts the reasoning on the utterance-level graph neural network from the local perspective. Experiments on two conversational reasoning datasets show that our model can dramatically outperform the strong baseline methods and can achieve performance which is close to human-level.

neural network, survey article, utterance, (17 more...)

arXiv.org Artificial Intelligence

2012.11099

Country: Asia > China (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)

Add feedback