A Computational Framework for Interpretable Text-Based Personality Assessment from Social Media

Oct-6-2025–arXiv.org Artificial Intelligence

Personality refers to individual differences in behavior, thinking, and feeling. With the growing availability of digital footprints, especially from social media, automated methods for personality assessment have become increasingly important. Natural language processing (NLP) enables the analysis of unstructured text data to identify personality indicators. However, two main challenges remain central to this thesis: the scarcity of large, personality-labeled datasets and the disconnect between personality psychology and NLP, which restricts model validity and interpretability. To address these challenges, this thesis presents two datasets -- MBTI9k and PANDORA -- collected from Reddit, a platform known for user anonymity and diverse discussions. The PANDORA dataset contains 17 million comments from over 10,000 users and integrates the MBTI and Big Five personality models with demographic information, overcoming limitations in data size, quality, and label coverage. Experiments on these datasets show that demographic variables influence model validity. In response, the SIMPA (Statement-to-Item Matching Personality Assessment) framework was developed - a computational framework for interpretable personality assessment that matches user-generated statements with validated questionnaire items. By using machine learning and semantic similarity, SIMPA delivers personality assessments comparable to human evaluations while maintaining high interpretability and efficiency. Although focused on personality assessment, SIMPA's versatility extends beyond this domain. Its model-agnostic design, layered cue detection, and scalability make it suitable for various research and practical applications involving complex label taxonomies and variable cue associations with target concepts.

large language model, machine learning, myer-briggs type indicator, (24 more...)

arXiv.org Artificial Intelligence

Oct-6-2025

arXiv.org PDF

Add feedback

Country:
- Asia (1.00)
- North America > United States
  - California (0.45)
  - Minnesota (0.27)
- Europe > United Kingdom
  - England (0.27)

Genre:
- Workflow (1.00)
- Overview (1.00)
- Questionnaire & Opinion Survey (0.88)
- Research Report
  - New Finding (1.00)
  - Experimental Study (0.93)
  - Promising Solution (0.67)

Industry:
- Health & Medicine (1.00)
- Education (1.00)
- Media > News (0.89)
- Information Technology
  - Security & Privacy (0.92)
  - Services (0.67)
- Government > Regional Government
  - North America Government > United States Government (0.45)

Technology:
- Information Technology
  - Communications > Social Media (1.00)
  - Artificial Intelligence
    - Natural Language
      - Text Processing (1.00)
      - Large Language Model (1.00)
      - Grammars & Parsing (1.00)
      - Chatbot (1.00)
    - Machine Learning
      - Statistical Learning (1.00)
      - Performance Analysis > Accuracy (1.00)
      - Neural Networks > Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found