AITopics | Fang, Qixiang

Collaborating Authors

Fang, Qixiang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

PATCH -- Psychometrics-AssisTed benCHmarking of Large Language Models: A Case Study of Mathematics Proficiency

Fang, Qixiang, Oberski, Daniel L., Nguyen, Dong

arXiv.org Artificial IntelligenceApr-2-2024

Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs' academic proficiency, often with also an interest in comparing model performance with human test takers. While these benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics - a field dedicated to the measurement of latent variables like academic proficiency - into LLM benchmarking. We make three primary contributions. First, we introduce PATCH: a novel framework for Psychometrics-AssisTed benCHmarking of LLMs. PATCH addresses the aforementioned limitations, presenting a new direction for LLM benchmark research. Second, we implement PATCH by measuring GPT-4 and Gemini-Pro-Vision's proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on existing benchmarking practices. Third, we release 4 datasets to support measuring and comparing LLM proficiency in grade school mathematics and science against human populations.

large language model, machine learning, test item, (19 more...)

arXiv.org Artificial Intelligence

2404.01799

Country:

North America (1.00)
Europe (1.00)
Africa (0.93)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)

Genre: Research Report (0.82)

Industry: Education > Educational Setting > K-12 Education > Middle School (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

USE: Dynamic User Modeling with Stateful Sequence Models

Zhou, Zhihan, Fang, Qixiang, Neves, Leonardo, Barbieri, Francesco, Liu, Yozen, Liu, Han, Bos, Maarten W., Dotsch, Ron

arXiv.org Artificial IntelligenceMar-20-2024

User embeddings play a crucial role in user engagement forecasting and personalized services. Recent advances in sequence modeling have sparked interest in learning user embeddings from behavioral data. Yet behavior-based user embedding learning faces the unique challenge of dynamic user modeling. As users continuously interact with the apps, user embeddings should be periodically updated to account for users' recent and long-term behavior patterns. Existing methods highly rely on stateless sequence models that lack memory of historical behavior. They have to either discard historical data and use only the most recent data or reprocess the old and new data jointly. Both cases incur substantial computational overhead. To address this limitation, we introduce User Stateful Embedding (USE). USE generates user embeddings and reflects users' evolving behaviors without the need for exhaustive reprocessing by storing previous model states and revisiting them in the future. Furthermore, we introduce a novel training objective named future W-behavior prediction to transcend the limitations of next-token prediction by forecasting a broader horizon of upcoming user behaviors. By combining it with the Same User Prediction, a contrastive learning-based objective that predicts whether different segments of behavior sequences belong to the same user, we further improve the embeddings' distinctiveness and representativeness. We conducted experiments on 8 downstream tasks using Snapchat users' behavioral logs in both static (i.e., fixed user behavior sequences) and dynamic (i.e., periodically updated user behavior sequences) settings. We demonstrate USE's superior performance over established baselines. The results underscore USE's effectiveness and efficiency in integrating historical and recent user behavior sequences into user embeddings in dynamic user modeling.

behavior sequence, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2403.13344

Country:

North America > United States (0.14)
Europe > Italy (0.14)
Asia > China (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications > Social Media (1.00)
(4 more...)

Add feedback

Missing Information, Unresponsive Authors, Experimental Flaws: The Impossibility of Assessing the Reproducibility of Previous Human Evaluations in NLP

Belz, Anya, Thomson, Craig, Reiter, Ehud, Abercrombie, Gavin, Alonso-Moral, Jose M., Arvan, Mohammad, Braggaar, Anouck, Cieliebak, Mark, Clark, Elizabeth, van Deemter, Kees, Dinkar, Tanvi, Dušek, Ondřej, Eger, Steffen, Fang, Qixiang, Gao, Mingqi, Gatt, Albert, Gkatzia, Dimitra, González-Corbelle, Javier, Hovy, Dirk, Hürlimann, Manuela, Ito, Takumi, Kelleher, John D., Klubicka, Filip, Krahmer, Emiel, Lai, Huiyuan, van der Lee, Chris, Li, Yiru, Mahamood, Saad, Mieskes, Margot, van Miltenburg, Emiel, Mosteiro, Pablo, Nissim, Malvina, Parde, Natalie, Plátek, Ondřej, Rieser, Verena, Ruan, Jie, Tetreault, Joel, Toral, Antonio, Wan, Xiaojun, Wanner, Leo, Watson, Lewis, Yang, Diyi

arXiv.org Artificial IntelligenceAug-7-2023

We report our efforts in identifying a set of previous human evaluations in NLP that would be suitable for a coordinated study examining what makes human evaluations in NLP more/less reproducible. We present our results and findings, which include that just 13\% of papers had (i) sufficiently low barriers to reproduction, and (ii) enough obtainable information, to be considered for reproduction, and that all but one of the experiments we selected for reproduction was discovered to have flaws that made the meaningfulness of conducting a reproduction questionable. As a result, we had to change our coordinated study design from a reproduce approach to a standardise-then-reproduce-twice approach. Our overall (negative) finding that the great majority of human evaluations in NLP is not repeatable and/or not reproducible and/or too flawed to justify reproduction, paints a dire picture, but presents an opportunity for a rethink about how to design and report human evaluations in NLP.

artificial intelligence, experiment, natural language, (19 more...)

arXiv.org Artificial Intelligence

2305.01633

Country:

Europe (1.00)
North America > United States > Maine (0.14)
North America > United States > Illinois (0.14)
Asia > Japan > Honshū (0.14)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.88)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Communications > Social Media > Crowdsourcing (0.46)

Add feedback

On Text-based Personality Computing: Challenges and Future Directions

Fang, Qixiang, Giachanou, Anastasia, Bagheri, Ayoub, Boeschoten, Laura, van Kesteren, Erik-Jan, Kamalabad, Mahdi Shafiee, Oberski, Daniel L

arXiv.org Artificial IntelligenceMay-22-2023

Text-based personality computing (TPC) has gained many research interests in NLP. In this paper, we describe 15 challenges that we consider deserving the attention of the research community. These challenges are organized by the following topics: personality taxonomies, measurement quality, datasets, performance evaluation, modelling choices, as well as ethics and fairness. When addressing each challenge, not only do we combine perspectives from both NLP and social sciences, but also offer concrete suggestions. We hope to inspire more valid and reliable TPC research.

data mining, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2212.06711

Country:

Europe (0.67)
Asia (0.46)
North America > United States > New Jersey (0.14)

Genre:

Research Report (1.00)
Overview (0.93)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.67)
Health & Medicine > Therapeutic Area > Neurology (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Mining (0.93)
(2 more...)

Add feedback

Epicurus at SemEval-2023 Task 4: Improving Prediction of Human Values behind Arguments by Leveraging Their Definitions

Fang, Christian, Fang, Qixiang, Nguyen, Dong

arXiv.org Artificial IntelligenceMay-18-2023

We describe our experiments for SemEval-2023 Task 4 on the identification of human values behind arguments (ValueEval). Because human values are subjective concepts which require precise definitions, we hypothesize that incorporating the definitions of human values (in the form of annotation instructions and validated survey items) during model training can yield better prediction performance. We explore this idea and show that our proposed models perform better than the challenge organizers' baselines, with improvements in macro F1 scores of up to 18%.

definitional statement, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2302.13925

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Questionnaire & Opinion Survey (0.51)
Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Modelling Stance Detection as Textual Entailment Recognition and Leveraging Measurement Knowledge from Social Sciences

Fang, Qixiang, Giachanou, Anastasia, Bagheri, Ayoub

arXiv.org Artificial IntelligenceDec-13-2022

Stance detection (SD) can be considered a special case of textual entailment recognition (TER), a generic natural language task. Modelling SD as TER may offer benefits like more training data and a more general learning scheme. In this paper, we present an initial empirical analysis of this approach. We apply it to a difficult but relevant test case where no existing labelled SD dataset is available, because this is where modelling SD as TER may be especially helpful. We also leverage measurement knowledge from social sciences to improve model performance. We discuss our findings and suggest future research directions.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2212.06543

Country:

Europe > Portugal (0.14)
Europe > Iceland (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.70)

Add feedback

Evaluating the Construct Validity of Text Embeddings with Application to Survey Questions

Fang, Qixiang, Nguyen, Dong, Oberski, Daniel L

arXiv.org Artificial IntelligenceFeb-18-2022

Text embedding models from Natural Language Processing can map text data (e.g. words, sentences, documents) to supposedly meaningful numerical representations (a.k.a. text embeddings). While such models are increasingly applied in social science research, one important issue is often not addressed: the extent to which these embeddings are valid representations of constructs relevant for social science research. We therefore propose the use of the classic construct validity framework to evaluate the validity of text embeddings. We show how this framework can be adapted to the opaque and high-dimensional nature of text embeddings, with application to survey questions. We include several popular text embedding methods (e.g. fastText, GloVe, BERT, Sentence-BERT, Universal Sentence Encoder) in our construct validity analyses. We find evidence of convergent and discriminant validity in some cases. We also show that embeddings can be used to predict respondent's answers to completely new survey questions. Furthermore, BERT-based embedding techniques and the Universal Sentence Encoder provide more valid representations of survey questions than do others. Our results thus highlight the necessity to examine the construct validity of text embeddings before deploying them in social science research.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1140/epjds/s13688-022-00353-7

2202.09166

Country:

Europe (0.67)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Questionnaire & Opinion Survey (1.00)

Industry:

Health & Medicine (1.00)
Government (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback