A document processing pipeline for the construction of a dataset for topic modeling based on the judgments of the Italian Supreme Court

Marulli, Matteo, Panattoni, Glauco, Bertini, Marco

May-14-2025–arXiv.org Artificial Intelligence

Topic modeling in Italian legal research is hindered by the lack of public datasets, limiting the analysis of legal themes in Supreme Court judgments. To address this, we developed a document processing pipeline that produces an anonymized dataset optimized for topic modeling. The pipeline integrates document layout analysis (YOLOv8x), optical character recognition, and text anonymization. The DLA module achieved a mAP@50 of 0.964 and a mAP@50-95 of 0.800. The OCR detector reached a mAP@50-95 of 0.9022, and the text recognizer (TrOCR) obtained a character error rate of 0.0047 and a word error rate of 0.0248. Compared to OCR-only methods, our dataset improved topic modeling with a diversity score of 0.6198 and a coherence score of 0.6638. We applied BERTopic to extract topics and used large language models to generate labels and summaries. Outputs were evaluated against domain expert interpretations. Claude Sonnet 3.7 achieved a BERTScore F1 of 0.8119 for labeling and 0.9130 for summarization.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

May-14-2025

arXiv.org PDF

Add feedback

Country:
- Europe > Italy > Tuscany > Florence (0.04)

Genre:
- Overview (1.00)
- Workflow (0.92)
- Research Report > New Finding (0.67)

Industry:
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Law
  - Litigation (0.92)
  - Government & the Courts (0.85)
  - Statutes (0.67)

Technology:
- Information Technology > Artificial Intelligence
  - Vision > Optical Character Recognition (0.86)
  - Natural Language
    - Text Processing (1.00)
    - Large Language Model (1.00)
  - Machine Learning
    - Statistical Learning (1.00)
    - Neural Networks > Deep Learning (1.00)
    - Performance Analysis > Accuracy (0.87)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found