dataverse
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Park, Hyunbyung, Lee, Sukyung, Gim, Gyoungjin, Kim, Yungi, Kim, Dahyun, Park, Chanjun
To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.
Identifying epidemic related Tweets using noisy learning
Tekumalla, Ramya, Banda, Juan M.
Supervised learning algorithms are heavily reliant on annotated datasets to train machine learning models. However, the curation of the annotated datasets is laborious and time consuming due to the manual effort involved and has become a huge bottleneck in supervised learning. In this work, we apply the theory of noisy learning to generate weak supervision signals instead of manual annotation. We curate a noisy labeled dataset using a labeling heuristic to identify epidemic related tweets. We evaluated the performance using a large epidemic corpus and our results demonstrate that models trained with noisy data in a class imbalanced and multi-classification weak supervision setting achieved performance greater than 90%.
- North America > United States (1.00)
- North America > Puerto Rico (0.04)
Into the Dataverse!
Industry 4.0 - the fourth industrial revolution – is upon us. Artificial intelligence (AI) is forever changing the way information is used across all business lines of the Government and private sector alike. An important DoD priority is to use AI to improve maintenance activities. However, AI depends on the quality of data, so the DoD must first be able to capture data on maintenance activities that is complete, structured, and readily accessible. DoD maintenance faces growing challenges that threaten the strategic advantage the United States Military has long held in both combat and humanitarian missions.
- Government > Regional Government > North America Government > United States Government (1.00)
- Government > Military (1.00)