AITopics | Li, Zhaolin

Collaborating Authors

Li, Zhaolin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

Dinh, Tu Anh, Mullov, Carlos, Bärmann, Leonard, Li, Zhaolin, Liu, Danni, Reiß, Simon, Lee, Jueun, Lerzer, Nathan, Ternava, Fabian, Gao, Jianfeng, Röddiger, Tobias, Waibel, Alexander, Asfour, Tamim, Beigl, Michael, Stiefelhagen, Rainer, Dachsbacher, Carsten, Böhm, Klemens, Niehues, Jan

arXiv.org Artificial IntelligenceJul-12-2024

With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giving mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark consisting of university computer science exam questions, to evaluate LLMs ability on solving scientific tasks. SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current LLMs, where the best LLM only achieves 59.4\% exam grade on average. We also provide detailed comparisons between LLM performance and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving 0.948 Pearson correlation with expert grading.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

2406.10421

Country:

Europe (0.93)
North America > United States > Pennsylvania (0.14)

Genre: Research Report (0.50)

Industry:

Education > Assessment & Standards > Student Performance (0.69)
Education > Educational Setting (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

Huber, Christian, Dinh, Tu Anh, Mullov, Carlos, Pham, Ngoc Quan, Nguyen, Thai Binh, Retkowski, Fabian, Constantin, Stefan, Ugan, Enes Yavuz, Liu, Danni, Li, Zhaolin, Koneru, Sai, Niehues, Jan, Waibel, Alexander

arXiv.org Artificial IntelligenceOct-23-2023

The challenge of low-latency speech translation has recently draw significant interest in the research community as shown by several publications and shared tasks. Therefore, it is essential to evaluate these different approaches in realistic scenarios. However, currently only specific aspects of the systems are evaluated and often it is not possible to compare different approaches. In this work, we propose the first framework to perform and evaluate the various aspects of low-latency speech translation under realistic conditions. The evaluation is carried out in an end-to-end fashion. This includes the segmentation of the audio as well as the run-time of the different components. Secondly, we compare different approaches to low-latency speech translation using this framework. We evaluate models with the option to revise the output as well as methods with fixed output. Furthermore, we directly compare state-of-the-art cascaded as well as end-to-end systems. Finally, the framework allows to automatically evaluate the translation quality as well as latency and also provides a web interface to show the low-latency model outputs to the user.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2308.03415

Country:

Europe (0.93)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.28)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback