ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models
Feuer, Benjamin, Liu, Yurong, Hegde, Chinmay, Freire, Juliana
–arXiv.org Artificial Intelligence
Existing deep-learning approaches to semantic column type annotation (CTA) have important shortcomings: they rely on semantic types which are fixed at training time; require a large number of training samples per type and incur large run-time inference costs; and their performance can degrade when evaluated on novel datasets, even when types remain constant. Large language models have exhibited strong zero-shot classification performance on a wide range of tasks and in this paper we explore their use for CTA. We introduce ArcheType, a simple, practical method for context sampling, prompt serialization, model querying, and label remapping, which enables large language models to solve CTA problems in a fully zero-shot manner. We ablate each component of our method separately, and establish that improvements to context sampling and label remapping provide the most consistent gains. ArcheType establishes a new state-of-the-art performance on zero-shot CTA benchmarks (including three new domain-specific benchmarks which we release along with this paper), and when used in conjunction with classical CTA techniques, it outperforms a SOTA DoDuo model on the fine-tuned SOTAB benchmark. Our code is available at https://github.com/penfever/ArcheType.
arXiv.org Artificial Intelligence
Nov-6-2023
- Country:
- Asia
- Armenia (0.04)
- Middle East > Israel (0.04)
- Europe
- Austria (0.04)
- Germany > Berlin (0.04)
- Liechtenstein (0.04)
- Ukraine > Donetsk Oblast
- Donetsk (0.04)
- North America
- Canada > Ontario
- Toronto (0.04)
- Jamaica (0.04)
- United States
- Kentucky (0.04)
- Colorado (0.04)
- New Jersey (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Pennsylvania > Philadelphia County
- Philadelphia (0.04)
- Arizona (0.04)
- New York
- Bronx County > New York City (0.04)
- New York County > New York City (0.04)
- Richmond County > New York City (0.04)
- Nevada (0.04)
- Arkansas (0.04)
- Alaska (0.04)
- Canada > Ontario
- South America > Brazil
- Rio de Janeiro > Rio de Janeiro (0.04)
- Asia
- Genre:
- Research Report > New Finding (0.93)
- Industry:
- Education (0.68)
- Health & Medicine (0.67)
- Technology: