Text Embeddings Reveal (Almost) As Much As Text

Morris, John X., Kuleshov, Volodymyr, Shmatikov, Vitaly, Rush, Alexander M.

Oct-10-2023–arXiv.org Artificial Intelligence

How much private information do text embeddings reveal about the original text? We investigate the problem of embedding \textit{inversion}, reconstructing the full text represented in dense text embeddings. We frame the problem as controlled generation: generating text that, when reembedded, is close to a fixed point in latent space. We find that although a na\"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92\%$ of $32\text{-token}$ text inputs exactly. We train our model to decode text embeddings from two state-of-the-art embedding models, and also show that our model can recover important personal information (full names) from a dataset of clinical notes. Our code is available on Github: \href{https://github.com/jxmorris12/vec2text}{github.com/jxmorris12/vec2text}.

correction, hypothesis, similarity, (17 more...)

arXiv.org Artificial Intelligence

Oct-10-2023

arXiv.org PDF

Add feedback

Country:
- Europe > Germany (0.05)
- North America > United States
  - Kentucky (0.04)
  - Pennsylvania (0.04)
  - Massachusetts > Suffolk County
    - Boston (0.04)
- Asia > Middle East
  - Jordan (0.04)
  - UAE > Abu Dhabi Emirate
    - Abu Dhabi (0.04)

Genre:
- Research Report (0.64)
- Workflow (0.47)

Industry:
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)

Technology:
- Information Technology
  - Security & Privacy (1.00)
  - Artificial Intelligence
    - Representation & Reasoning (1.00)
    - Machine Learning > Neural Networks (1.00)
    - Natural Language
      - Text Processing (0.69)
      - Large Language Model (0.68)
      - Information Retrieval (0.68)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found