Habernal, Ivan
Transparent NLP: Using RAG and LLM Alignment for Privacy Q&A
Leschanowsky, Anna, Kolagar, Zahra, Çano, Erion, Habernal, Ivan, Hallinan, Dara, Habets, Emanuël A. P., Popp, Birgit
The transparency principle of the General Data Protection Regulation (GDPR) requires data processing information to be clear, precise, and accessible. While language models show promise in this context, their probabilistic nature complicates truthfulness and comprehensibility. This paper examines state-of-the-art Retrieval Augmented Generation (RAG) systems enhanced with alignment techniques to fulfill GDPR obligations. We evaluate RAG systems incorporating an alignment module like Rewindable Auto-regressive Inference (RAIN) and our proposed multidimensional extension, MultiRAIN, using a Privacy Q&A dataset. Responses are optimized for preciseness and comprehensibility and are assessed through 21 metrics, including deterministic and large language model-based evaluations. Our results show that RAG systems with an alignment module outperform baseline RAG systems on most metrics, though none fully match human answers. Principal component analysis of the results reveals complex interactions between metrics, highlighting the need to refine metrics. This study provides a foundation for integrating advanced natural language processing systems into legal compliance frameworks.
A Comprehensive Survey on Legal Summarization: Challenges and Future Directions
Akter, Mousumi, Çano, Erion, Weber, Erik, Dobler, Dennis, Habernal, Ivan
The constant engagement with extensive written materials is fundamental and immensely time-consuming [104]. Legal professionals often spend hours, if not days, combing through documents to find precedents or relevant cases that could be pivotal to their current cases. This laborious process is a significant part of the workload of legal professionals like lawyers and judges, taking up lots of time that could be invested otherwise. Automatic summarization tools could help to condense lengthy legal documents into concise summaries, helping to save both time and costs. Moreover, integrating advanced Natural Language Processing (NLP) techniques into legal research holds significant promise for democratizing access to legal information. Figure 1 shows the general pipeline for legal summarization. Compared to other domains, legal texts present unique challenges that distinguish them from other document types. Legal documents tend to be longer and more detailed than those from other domains.
Private Synthetic Text Generation with Diffusion Models
Ochs, Sebastian, Habernal, Ivan
How capable are diffusion models of generating synthetics texts? Recent research shows their strengths, with performance reaching that of auto-regressive LLMs. But are they also good in generating synthetic data if the training was under differential privacy? Here the evidence is missing, yet the promises from private image generation look strong. In this paper we address this open question by extensive experiments. At the same time, we critically assess (and reimplement) previous works on synthetic private text generation with LLMs and reveal some unmet assumptions that might have led to violating the differential privacy guarantees. Our results partly contradict previous non-private findings and show that fully open-source LLMs outperform diffusion models in the privacy regime. Our complete source codes, datasets, and experimental setup is publicly available to foster future research.
The Impact of Inference Acceleration Strategies on Bias of LLMs
Kirsten, Elisabeth, Habernal, Ivan, Nanda, Vedant, Zafar, Muhammad Bilal
Last few years have seen unprecedented advances in capabilities of Large Language Models (LLMs). These advancements promise to deeply benefit a vast array of application domains. However, due to their immense size, performing inference with LLMs is both costly and slow. Consequently, a plethora of recent work has proposed strategies to enhance inference efficiency, e.g., quantization, pruning, and caching. These acceleration strategies reduce the inference cost and latency, often by several factors, while maintaining much of the predictive performance measured via common benchmarks. In this work, we explore another critical aspect of LLM performance: demographic bias in model generations due to inference acceleration optimizations. Using a wide range of metrics, we probe bias in model outputs from a number of angles. Analysis of outputs before and after inference acceleration shows significant change in bias. Worryingly, these bias effects are complex and unpredictable. A combination of an acceleration strategy and bias type may show little bias change in one model but may lead to a large effect in another. Our results highlight a need for in-depth and case-by-case evaluation of model bias after it has been modified to accelerate inference.
Private Language Models via Truncated Laplacian Mechanism
Huang, Tianhao, Yang, Tao, Habernal, Ivan, Hu, Lijie, Wang, Di
Deep learning models for NLP tasks are prone to variants of privacy attacks. To prevent privacy leakage, researchers have investigated word-level perturbations, relying on the formal guarantees of differential privacy (DP) in the embedding space. However, many existing approaches either achieve unsatisfactory performance in the high privacy regime when using the Laplacian or Gaussian mechanism, or resort to weaker relaxations of DP that are inferior to the canonical DP in terms of privacy strength. This raises the question of whether a new method for private word embedding can be designed to overcome these limitations. In this paper, we propose a novel private embedding method called the high dimensional truncated Laplacian mechanism. Specifically, we introduce a non-trivial extension of the truncated Laplacian mechanism, which was previously only investigated in one-dimensional space cases. Theoretically, we show that our method has a lower variance compared to the previous private word embedding methods. To further validate its effectiveness, we conduct comprehensive experiments on private embedding and downstream tasks using three datasets. Remarkably, even in the high privacy regime, our approach only incurs a slight decrease in utility compared to the non-private scenario.
LaCour!: Enabling Research on Argumentation in Hearings of the European Court of Human Rights
Held, Lena, Habernal, Ivan
What can we learn about law and legal argumentation from court judgments alone? Contemporary research addresses empirical legal questions (e.g., which arguments are used) or legal NLP questions (e.g., predicting case outcomes) by relying on the availability of the final'products' of each case, the court decisions (Habernal et al, 2023; Medvedeva et al, 2020). The European Court of Human Rights (ECHR) is a prominent data source, as its decisions are freely available in a large amount, along with the metadata of the violated articles and other attributes. This makes ECHR a popular choice among NLP researchers (Aletras et al, 2016; Chalkidis et al, 2020). However, whether or not the legal arguments in ECHR's cases are created as a part of legal deliberation or are created post-hoc after reaching a decision remains an open (and partly controversial) question. In order to better understand the legal argument mechanics, that is which arguments of the parties were presented, discussed, or questioned, and thus might have influenced the case outcome, we must take the oral hearings into account. We witness that the availability of oral hearing transcripts of the U.S. Supreme Court enables further legal research (Ashley et al, 2007). However, empirical research into the interplay of arguments at the court hearings and the final judgments has been so far impossible for the ECHR, as there are no hearing transcripts available.
DP-NMT: Scalable Differentially-Private Machine Translation
Igamberdiev, Timour, Vu, Doan Nam Long, Künnecke, Felix, Yu, Zhuo, Holmer, Jannik, Habernal, Ivan
Neural machine translation (NMT) is a widely popular text generation task, yet there is a considerable research gap in the development of privacy-preserving NMT models, despite significant data privacy concerns for NMT systems. Differentially private stochastic gradient descent (DP-SGD) is a popular method for training machine learning models with concrete privacy guarantees; however, the implementation specifics of training a model with DP-SGD are not always clarified in existing models, with differing software libraries used and code bases not always being public, leading to reproducibility issues. To tackle this, we introduce DP-NMT, an open-source framework for carrying out research on privacy-preserving NMT with DP-SGD, bringing together numerous models, datasets, and evaluation metrics in one systematic software package. Our goal is to provide a platform for researchers to advance the development of privacy-preserving NMT systems, keeping the specific details of the DP-SGD algorithm transparent and intuitive to implement. We run a set of experiments on datasets from both general and privacy-related domains to demonstrate our framework in use. We make our framework publicly available and welcome feedback from the community.
Differentially Private Natural Language Models: Recent Advances and Future Directions
Hu, Lijie, Habernal, Ivan, Shen, Lei, Wang, Di
Recent developments in deep learning have led to great success in various natural language processing (NLP) tasks. However, these applications may involve data that contain sensitive information. Therefore, how to achieve good performance while also protecting the privacy of sensitive data is a crucial challenge in NLP. To preserve privacy, Differential Privacy (DP), which can prevent reconstruction attacks and protect against potential side knowledge, is becoming a de facto technique for private data analysis. In recent years, NLP in DP models (DP-NLP) has been studied from different perspectives, which deserves a comprehensive review. In this paper, we provide the first systematic review of recent advances in DP deep learning models in NLP. In particular, we first discuss some differences and additional challenges of DP-NLP compared with the standard DP deep learning. Then, we investigate some existing work on DP-NLP and present its recent developments from three aspects: gradient perturbation based methods, embedding vector perturbation based methods, and ensemble model based methods. We also discuss some challenges and future directions.
To share or not to share: What risks would laypeople accept to give sensitive data to differentially-private NLP systems?
Weiss, Christopher, Kreuter, Frauke, Habernal, Ivan
Although the NLP community has adopted central differential privacy as a go-to framework for privacy-preserving model training or data sharing, the choice and interpretation of the key parameter, privacy budget $\varepsilon$ that governs the strength of privacy protection, remains largely arbitrary. We argue that determining the $\varepsilon$ value should not be solely in the hands of researchers or system developers, but must also take into account the actual people who share their potentially sensitive data. In other words: Would you share your instant messages for $\varepsilon$ of 10? We address this research gap by designing, implementing, and conducting a behavioral experiment (311 lay participants) to study the behavior of people in uncertain decision-making situations with respect to privacy-threatening situations. Framing the risk perception in terms of two realistic NLP scenarios and using a vignette behavioral study help us determine what $\varepsilon$ thresholds would lead lay people to be willing to share sensitive textual data - to our knowledge, the first study of its kind.
DP-BART for Privatized Text Rewriting under Local Differential Privacy
Igamberdiev, Timour, Habernal, Ivan
Privatized text rewriting with local differential privacy (LDP) is a recent approach that enables sharing of sensitive textual documents while formally guaranteeing privacy protection to individuals. However, existing systems face several issues, such as formal mathematical flaws, unrealistic privacy guarantees, privatization of only individual words, as well as a lack of transparency and reproducibility. In this paper, we propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees. We run experiments on five textual datasets of varying sizes, rewriting them at different privacy guarantees and evaluating the rewritten texts on downstream text classification tasks. Finally, we thoroughly discuss the privatized text rewriting approach and its limitations, including the problem of the strict text adjacency constraint in the LDP paradigm that leads to the high noise requirement.