Enhancing Data Quality in Federated Fine-Tuning of Foundation Models

Zhao, Wanru, Du, Yaxin, Lane, Nicholas Donald, Chen, Siheng, Wang, Yanfeng

Mar-7-2024–arXiv.org Artificial Intelligence

The PubMedQA task is designed to answer research questions with responses categorized as yes/no/maybe, effectively framing it as a multiple-choice question format. The dataset is divided into three subsets: 1,000 manually labeled question-answer pairs (denoted as PQA-L), 61,200 unlabeled pairs (PQA-U), and 211,300 pairs that have been artificially generated (PQA-A). Consistent with previous studies (Diao et al., 2023; Singhal et al., 2023), we employ the PQA-L subset as the test set for evaluating the model's performance. USMLE USMLE (Jin et al., 2021) consists of multiple-choice questions (with 4 choices per question) that are based on the United States Medical Licensing Exams. This dataset has been compiled from questions used in professional medical board examinations and is unique in its multilingual composition, including English, Simplified Chinese, and Traditional Chinese versions. It contains 12,724 questions in English, 34,251 in Simplified Chinese, and 14,123 in Traditional Chinese. For our purposes, we focus on the English component of the dataset, which is further divided into 10,178 questions for the training set, 1,273 for the validation set, and 1,273 for the test set, adhering to the official distribution of the dataset.

arxiv preprint arxiv, dataset, low-quality data, (13 more...)

arXiv.org Artificial Intelligence

Mar-7-2024

arXiv.org PDF

Add feedback

Country:
- North America > United States
  - Virginia (0.04)
  - Louisiana > Orleans Parish
    - New Orleans (0.04)
  - California > San Diego County
    - San Diego (0.04)
- Europe
  - Monaco (0.04)
  - United Kingdom > England
    - Cambridgeshire > Cambridge (0.04)
- Asia
  - Middle East > Jordan (0.04)
  - China > Shanghai
    - Shanghai (0.04)

Genre:
- Research Report > New Finding (0.48)

Industry:
- Health & Medicine > Therapeutic Area (1.00)
- Information Technology > Security & Privacy (0.93)
- Education > Health & Safety
  - School Nutrition (0.47)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence
    - Natural Language > Large Language Model (1.00)
    - Representation & Reasoning (0.93)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found