SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation

Perełkiewicz, Michał, Dadas, Sławomir, Poświata, Rafał

Jul-8-2025–arXiv.org Artificial Intelligence

This article introduces semantically meaningful causal language modeling (SMCLM), a selfsupervised method of training autoregressive models to generate semantically equivalent text. Our approach involves using semantically meaningful text representation as an initial embedding in the autoregressive training and generation processes. The extensive empirical study demonstrates that the SMCLM approach makes autoregressive models capable of learning robust and high-quality paraphrase generation. The proposed method is competitive with the supervised method and achieves state-of-the-art results in unsupervised approaches. This article also presents a comprehensive set of automatic metrics that cover a wide range of autogenerated paraphrase evaluation aspects. Simultaneously, this article highlights the low reliability of the metrics that are widely used in paraphrase generation evaluation, including BLEU, ROUGE, and BERTScore.

computational linguistic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

Jul-8-2025

arXiv.org PDF

Add feedback

Country:
- North America
  - Dominican Republic (0.04)
  - United States
    - New York (0.04)
    - Minnesota > Hennepin County
      - Minneapolis (0.14)
  - Canada > British Columbia
    - Metro Vancouver Regional District > Vancouver (0.04)
- Europe
  - Germany > Berlin (0.04)
  - Middle East (0.04)
  - Czechia > Prague (0.04)
  - Poland
    - Masovia Province > Warsaw (0.04)
    - Łódź Province > Łódź (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - France > Provence-Alpes-Côte d'Azur
    - Bouches-du-Rhône > Marseille (0.04)
  - Denmark > Capital Region
    - Copenhagen (0.04)
  - Portugal > Lisbon
    - Lisbon (0.04)
  - Spain > Catalonia
    - Barcelona Province > Barcelona (0.04)
  - Finland > Uusimaa
    - Helsinki (0.04)
  - Croatia > Dubrovnik-Neretva County
    - Dubrovnik (0.04)
  - Ireland > Leinster
    - County Dublin > Dublin (0.04)
  - United Kingdom
    - England (0.04)
    - Scotland (0.04)
  - Belgium > Brussels-Capital Region
    - Brussels (0.04)
- Asia
  - Pakistan (0.04)
  - Afghanistan (0.04)
  - Middle East
    - Syria (0.04)
    - Iraq (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.04)
  - Japan > Honshū
    - Kansai > Osaka Prefecture > Osaka (0.04)
  - China
    - Hong Kong (0.04)
    - Beijing > Beijing (0.04)
- Africa
  - North Africa (0.04)
  - Middle East
    - Libya (0.04)
    - Egypt (0.04)
  - Ethiopia > Addis Ababa
    - Addis Ababa (0.04)

Genre:
- Overview (1.00)
- Research Report > New Finding (0.93)

Industry:
- Leisure & Entertainment > Sports (0.93)

Technology:
- Information Technology > Artificial Intelligence
  - Representation & Reasoning (1.00)
  - Natural Language
    - Text Processing (1.00)
    - Large Language Model (1.00)
    - Chatbot (1.00)
  - Machine Learning > Neural Networks
    - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found