Cuckoo: An IE Free Rider Hatched by Massive Nutrition in LLM's Nest
Peng, Letian, Wang, Zilong, Yao, Feng, Shang, Jingbo
–arXiv.org Artificial Intelligence
Massive high-quality data, both pre-training raw texts and post-training annotations, have been carefully prepared to incubate advanced large language models (LLMs). In contrast, for information extraction (IE), pre-training data, such as BIO-tagged sequences, are hard to scale up. We show that IE models can act as free riders on LLM resources by reframing next-token \emph{prediction} into \emph{extraction} for tokens already present in the context. Specifically, our proposed next tokens extraction (NTE) paradigm learns a versatile IE model, \emph{Cuckoo}, with 102.6M extractive data converted from LLM's pre-training and post-training data. Under the few-shot setting, Cuckoo adapts effectively to traditional and complex instruction-following IE with better performance than existing pre-trained IE models. As a free rider, Cuckoo can naturally evolve with the ongoing advancements in LLM data preparation, benefiting from improvements in LLM training pipelines without additional manual effort.
arXiv.org Artificial Intelligence
Feb-16-2025
- Country:
- Asia
- China > Hong Kong (0.04)
- Middle East > UAE
- Abu Dhabi Emirate > Abu Dhabi (0.04)
- Singapore (0.05)
- Thailand > Bangkok
- Bangkok (0.04)
- Europe
- Austria > Vienna (0.14)
- Belgium > Brussels-Capital Region
- Brussels (0.04)
- Ireland > Leinster
- County Dublin > Dublin (0.04)
- Switzerland > Geneva
- Geneva (0.04)
- North America
- Canada > Alberta
- United States
- California
- Los Angeles County > Los Angeles (0.04)
- San Diego County > San Diego (0.04)
- Massachusetts (0.04)
- Washington > King County
- Seattle (0.04)
- Florida > Miami-Dade County
- Miami (0.04)
- Louisiana > Orleans Parish
- New Orleans (0.04)
- Texas (0.04)
- Maryland > Baltimore (0.04)
- Minnesota > Hennepin County
- Minneapolis (0.14)
- Michigan > Washtenaw County
- Ann Arbor (0.04)
- California
- Oceania > Australia
- Asia
- Genre:
- Research Report (0.82)
- Industry:
- Leisure & Entertainment (0.46)
- Technology: