Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data
Meisenbacher, Stephen, Nestorov, Svetlozar, Norlander, Peter
–arXiv.org Artificial Intelligence
Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.
arXiv.org Artificial Intelligence
Oct-3-2025
- Country:
- Asia > Vietnam (0.04)
- Europe > Germany
- Bavaria > Upper Bavaria > Munich (0.04)
- North America > United States
- New York > New York County
- New York City (0.04)
- California
- Los Angeles County > Los Angeles (0.14)
- Ventura County > Thousand Oaks (0.04)
- District of Columbia > Washington (0.04)
- Colorado (0.04)
- Washington > King County
- Seattle (0.04)
- North Carolina > Mecklenburg County
- Charlotte (0.04)
- Virginia (0.04)
- Illinois > Cook County
- Chicago (0.04)
- Massachusetts > Middlesex County
- Cambridge (0.14)
- Wisconsin (0.04)
- Texas (0.04)
- Minnesota (0.04)
- New York > New York County
- Genre:
- Research Report (1.00)
- Industry:
- Technology: