IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus
Gui, Honghao, Yuan, Lin, Ye, Hongbin, Zhang, Ningyu, Sun, Mengshu, Liang, Lei, Chen, Huajun
–arXiv.org Artificial Intelligence
Large Language Models (LLMs) demonstrate remarkable potential across various domains; however, they exhibit a significant performance gap in Information Extraction (IE). Note that high-quality instruction data is the vital key for enhancing the specific capabilities of LLMs, while current IE datasets tend to be small in scale, fragmented, and lack standardized schema. To this end, we introduce IEPile, a comprehensive bilingual (English and Chinese) IE instruction corpus, which contains approximately 0.32B tokens. We construct IEPile by collecting and cleaning 33 existing IE datasets, and introduce schema-based instruction generation to unearth a large-scale corpus. Experimentally, IEPile enhance the performance of LLMs for IE, with notable improvements in zero-shot generalization. We open-source the resource and pre-trained models, hoping to provide valuable support to the NLP community.
arXiv.org Artificial Intelligence
May-26-2024
- Country:
- Oceania
- Palau (0.04)
- Australia
- Victoria > Melbourne (0.04)
- Queensland > Brisbane (0.04)
- New South Wales > Sydney (0.04)
- North America
- United States
- Massachusetts (0.04)
- Washington > King County
- Seattle (0.04)
- New York > New York County
- New York City (0.04)
- Hawaii > Honolulu County
- Honolulu (0.04)
- Colorado > Boulder County
- Boulder (0.04)
- California > Los Angeles County
- Long Beach (0.04)
- Canada
- United States
- Europe
- Italy (0.04)
- Moldova (0.04)
- Spain
- Valencian Community > Valencia Province
- Valencia (0.04)
- Catalonia > Barcelona Province
- Barcelona (0.04)
- Valencian Community > Valencia Province
- Sweden > Uppsala County
- Uppsala (0.04)
- Portugal > Lisbon
- Lisbon (0.04)
- France
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Marseille (0.04)
- Auvergne-Rhône-Alpes > Lyon
- Lyon (0.04)
- Provence-Alpes-Côte d'Azur > Bouches-du-Rhône
- Ukraine > Kyiv Oblast
- Kyiv (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Asia
- Singapore (0.04)
- Middle East
- Palestine (0.04)
- UAE > Abu Dhabi Emirate
- Abu Dhabi (0.04)
- China
- Henan Province > Zhengzhou (0.04)
- Hong Kong (0.04)
- Africa > Rwanda
- Oceania
- Genre:
- Research Report (0.81)
- Industry:
- Health & Medicine (0.46)
- Law (0.46)
- Technology: