Goto

Collaborating Authors

 Instructional Material


Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

arXiv.org Artificial Intelligence

Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.


Trustworthy and Efficient LLMs Meet Databases

arXiv.org Artificial Intelligence

In the rapidly evolving AI era with large language models (LLMs) at the core, making LLMs more trustworthy and efficient, especially in output generation (inference), has gained significant attention. This is to reduce plausible but faulty LLM outputs (a.k.a hallucinations) and meet the highly increased inference demands. This tutorial explores such efforts and makes them transparent to the database community. Understanding these efforts is essential in harnessing LLMs in database tasks and adapting database techniques to LLMs. Furthermore, we delve into the synergy between LLMs and databases, highlighting new opportunities and challenges in their intersection. This tutorial aims to share with database researchers and practitioners essential concepts and strategies around LLMs, reduce the unfamiliarity of LLMs, and inspire joining in the intersection between LLMs and databases.


Is ChatGPT Massively Used by Students Nowadays? A Survey on the Use of Large Language Models such as ChatGPT in Educational Settings

arXiv.org Artificial Intelligence

Few inventions and innovations have genuinely transformed education at large, particularly by enhancing access to knowledge. Notable among these are the advent of writing around 3300 BCE, which facilitated the transmission of knowledge across generations and cultures; the Gutenberg printing press in approximately 1440 CE, which greatly simplified the duplication and dissemination of ideas and knowledge, thereby encouraging wider literacy and education; the large-scale deployment of the World Wide Web in the late 1990s and early 2000s, which allowed for rapid, affordable, and accessible information sharing via the Internet, especially through online encyclopedias such as Wikipedia; and, more recently, the public emergence of Large Language Models (LLMs) [1] in 2022, such as ChatGPT (Chat Generative Pre-Trained Transformer) [2], which have made information access even more straightforward. However, LLMs differ from previous inventions that facilitated the spread of information and knowledge in several key ways [3, 4]. While writing, the printing press, and the Internet primarily made information more accessible, LLMs provide an array of additional functions, such as multi-language translation, summarisation, simplification of complex information, and advanced writing capabilities to structure and organise content. In other words, LLMs assist people not only with accessing information but also with tasks traditionally considered cognitive.


WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models

arXiv.org Artificial Intelligence

Despite recent progress achieved by code large language models (LLMs), their remarkable abilities are largely dependent on fine-tuning on the high-quality data, posing challenges for data collection and annotation. To address this, current methods often design various data flywheels to gather complex code instructions, enabling models to handle more intricate tasks. However, these approaches typically rely on off-the-shelf datasets and data augmentation from the limited pool of proprietary LLMs (e.g., Claude, GPT4, and so on), which limits the diversity of the constructed data and makes it prone to systemic biases. In this paper, we propose WarriorCoder which learns from expert battles to address these limitations. Specifically, we create an arena for current expert code LLMs, where each model challenges and responds to others' challenges, with evaluations conducted by uninvolved judge models. This competitive framework generates novel training data constructed from scratch, harnessing the strengths of all participants. Experimental results demonstrate that WarriorCoder achieves competitive performance compared to previous methods, even without relying on proprietary LLMs.


"From Unseen Needs to Classroom Solutions": Exploring AI Literacy Challenges & Opportunities with Project-based Learning Toolkit in K-12 Education

arXiv.org Artificial Intelligence

"From Unseen Needs to Classroom Solutions": Exploring AI Literacy Challenges & Opportunities with Project-Based Learning T oolkit in K-12 Education Hanqi Li * 1, Ruiwei Xiao * 2, Hsuan Nieu 3, Ying-Jui Tseng 2, Guanze Liao 3 1 New Y ork University 2 Carnegie Mellon University 3 Taiwan National Tsing Hua University hl4893@nyu.edu, Abstract As artificial intelligence (AI) becomes increasingly central to various fields, there is a growing need to equip K-12 students with AI literacy skills that extend beyond computer science. This paper explores the integration of a Project-Based Learning (PBL) AI toolkit into diverse subject areas, aimed at helping educators teach AI concepts more effectively. Through interviews and co-design sessions with K-12 teachers, we examined their current AI literacy levels and how these teachers adapt AI tools like the AI Art Lab, AI Music Studio, and AI Chatbot into their course designs. While teachers appreciated the potential of AI tools to foster creativity and critical thinking, they also expressed concerns about the accuracy, trustworthiness, and ethical implications of AI-generated content. Our findings reveal the challenges teachers face, including limited resources, varying student and instructor skill levels, and the need for scalable, adaptable AI tools. This research contributes insights that can inform the development of AI curricula tailored to diverse educational contexts. Introduction As accessible Artificial Intelligence (AI) tools have gained increasing interest among K-12 educators in incorporating AI literacy into their classrooms. K-12 educators recognize the need to teach students about its capabilities and limitations(Ng et al. 2023a). Existing AI education efforts focus on dedicated curricula and professional learning for teachers (Amplo and Butler 2023; Lee and Perret 2022).


Hierarchically Gated Experts for Efficient Online Continual Learning

arXiv.org Artificial Intelligence

Continual Learning models aim to learn a set of tasks under the constraint that the tasks arrive sequentially with no way to access data from previous tasks. The Online Continual Learning framework poses a further challenge where the tasks are unknown and instead the data arrives as a single stream. Building on existing work, we propose a method for identifying these underlying tasks: the Gated Experts (GE) algorithm, where a dynamically growing set of experts allows for new knowledge to be acquired without catastrophic forgetting. Furthermore, we extend GE to Hierarchically Gated Experts (HGE), a method which is able to efficiently select the best expert for each data sample by organising the experts into a hierarchical structure. On standard Continual Learning benchmarks, GE and HGE are able to achieve results comparable with current methods, with HGE doing so more efficiently.


Generative Diffusion Modeling: A Practical Handbook

arXiv.org Artificial Intelligence

This handbook offers a unified perspective on diffusion models, encompassing diffusion probabilistic models, score-based generative models, consistency models, rectified flow, and related methods. By standardizing notations and aligning them with code implementations, it aims to bridge the "paper-to-code" gap and facilitate robust implementations and fair comparisons. The content encompasses the fundamentals of diffusion models, the pre-training process, and various post-training methods. Post-training techniques include model distillation and reward-based fine-tuning. Designed as a practical guide, it emphasizes clarity and usability over theoretical depth, focusing on widely adopted approaches in generative modeling with diffusion models.


From Creation to Curriculum: Examining the role of generative AI in Arts Universities

arXiv.org Artificial Intelligence

The age of Artificial Intelligence (AI) is marked by its transformative "generative" capabilities, distinguishing it from prior iterations. This burgeoning characteristic of AI has enabled it to produce new and original content, inherently showcasing its creative prowess. This shift challenges and requires a recalibration in the realm of arts education, urging a departure from established pedagogies centered on human-driven image creation. The paper meticulously addresses the integration of AI tools, with a spotlight on Stable Diffusion (SD), into university arts curricula. Drawing from practical insights gathered from workshops conducted in July 2023, which culminated in an exhibition of AI-driven artworks, the paper aims to provide a roadmap for seamlessly infusing these tools into academic settings. Given their recent emergence, the paper delves into a comprehensive overview of such tools, emphasizing the intricate dance between artists, developers, and researchers in the open-source AI art world. This discourse extends to the challenges and imperatives faced by educational institutions. It presents a compelling case for the swift adoption of these avant-garde tools, underscoring the paramount importance of equipping students with the competencies required to thrive in an AI-augmented artistic landscape.


A Similarity-Based Oversampling Method for Multi-label Imbalanced Text Data

arXiv.org Artificial Intelligence

In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.


FairREAD: Re-fusing Demographic Attributes after Disentanglement for Fair Medical Image Classification

arXiv.org Artificial Intelligence

Recent advancements in deep learning have shown transformative potential in medical imaging, yet concerns about fairness persist due to performance disparities across demographic subgroups. Existing methods aim to address these biases by mitigating sensitive attributes in image data; however, these attributes often carry clinically relevant information, and their removal can compromise model performance-a highly undesirable outcome. To address this challenge, we propose Fair Re-fusion After Disentanglement (FairREAD), a novel, simple, and efficient framework that mitigates unfairness by re-integrating sensitive demographic attributes into fair image representations. FairREAD employs orthogonality constraints and adversarial training to disentangle demographic information while using a controlled re-fusion mechanism to preserve clinically relevant details. Additionally, subgroup-specific threshold adjustments ensure equitable performance across demographic groups. Comprehensive evaluations on a large-scale clinical X-ray dataset demonstrate that FairREAD significantly reduces unfairness metrics while maintaining diagnostic accuracy, establishing a new benchmark for fairness and performance in medical image classification.