Extracting O*NET Features from the NLx Corpus to Build Public Use Aggregate Labor Market Data

Meisenbacher, Stephen, Nestorov, Svetlozar, Norlander, Peter

Oct-3-2025–arXiv.org Artificial Intelligence

Data from online job postings are difficult to access and are not built in a standard or transparent manner. Data included in the standard taxonomy and occupational information database (O*NET) are updated infrequently and based on small survey samples. We adopt O*NET as a framework for building natural language processing tools that extract structured information from job postings. We publish the Job Ad Analysis Toolkit (JAAT), a collection of open-source tools built for this purpose, and demonstrate its reliability and accuracy in out-of-sample and LLM-as-a-Judge testing. We extract more than 10 billion data points from more than 155 million online job ads provided by the National Labor Exchange (NLx) Research Hub, including O*NET tasks, occupation codes, tools, and technologies, as well as wages, skills, industry, and more features. We describe the construction of a dataset of occupation, state, and industry level features aggregated by monthly active jobs from 2015 - 2025. We illustrate the potential for research and future uses in education and workforce development.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

Oct-3-2025

arXiv.org PDF

Add feedback

Country:
- North America > United States > California (0.45)

Genre:
- Research Report (1.00)

Industry:
- Law Enforcement & Public Safety > Fire & Emergency Services (1.00)
- Law > Labor & Employment Law (1.00)
- Education (1.00)
- Government > Regional Government
  - North America Government > United States Government (1.00)
- Banking & Finance
  - Insurance (1.00)
  - Economy (1.00)

Technology:
- Information Technology
  - Data Science (1.00)
  - Artificial Intelligence
    - Natural Language
      - Text Processing (1.00)
      - Large Language Model (1.00)
    - Machine Learning
      - Performance Analysis > Accuracy (1.00)
      - Neural Networks > Deep Learning (0.93)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found