Goto

Collaborating Authors

 anonymisation


Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example

Ghosh, Suhita, Jouaiti, Melanie, Das, Arnab, Sinha, Yamini, Polzehl, Tim, Siegert, Ingo, Stober, Sebastian

arXiv.org Artificial Intelligence

Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content. Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains, which is essential for remote health monitoring. To address this gap, we propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example. The proposed method, trained with novel losses, aids in disentangling linguistic, prosodic, and domain representations, enabling the model to adapt to uncommon speech patterns. Objective and subjective evaluations show that DDSP-QbE significantly outperforms the voice conversion state-of-the-art concerning intelligibility, prosody, and domain preservation across diverse datasets, pathologies, and speakers while maintaining quality and speaker anonymity. Experts validate domain preservation by analysing twelve clinically pertinent domain attributes.


Privacy-Preserving Synthetically Augmented Knowledge Graphs with Semantic Utility

Bellomarini, Luigi, Catalano, Costanza, Coletta, Andrea, Iezzi, Michela, Samarati, Pierangela

arXiv.org Artificial Intelligence

Knowledge Graphs (KGs) have recently gained relevant attention in many application domains, from healthcare to biotechnology, from logistics to finance. Financial organisations, central banks, economic research entities, and national supervision authorities apply ontological reasoning on KGs to address crucial business tasks, such as economic policymaking, banking supervision, anti-money laundering, and economic research. Reasoning allows for the generation of derived knowledge capturing complex business semantics and the set up of effective business processes. A major obstacle in KGs sharing is represented by privacy considerations since the identity of the data subjects and their sensitive or company-confidential information may be improperly exposed. In this paper, we propose a novel framework to enable KGs sharing while ensuring that information that should remain private is not directly released nor indirectly exposed via derived knowledge, while maintaining the embedded knowledge of the KGs to support business downstream tasks. Our approach produces a privacy-preserving synthetic KG as an augmentation of the input one via the introduction of structural anonymisation. We introduce a novel privacy measure for KGs, which considers derived knowledge and a new utility metric that captures the business semantics we want to preserve, and propose two novel anonymization algorithms. Our extensive experimental evaluation, with both synthetic graphs and real-world datasets, confirms the effectiveness of our approach achieving up to a 70% improvement in the privacy of entities compared to existing methods not specifically designed for KGs.


Evaluating the Efficacy of AI Techniques in Textual Anonymization: A Comparative Study

Asimopoulos, Dimitris, Siniosoglou, Ilias, Argyriou, Vasileios, Goudos, Sotirios K., Psannis, Konstantinos E., Karditsioti, Nikoleta, Saoulidis, Theocharis, Sarigiannidis, Panagiotis

arXiv.org Artificial Intelligence

In the digital era, with escalating privacy concerns, it's imperative to devise robust strategies that protect private data while maintaining the intrinsic value of textual information. This research embarks on a comprehensive examination of text anonymisation methods, focusing on Conditional Random Fields (CRF), Long Short-Term Memory (LSTM), Embeddings from Language Models (ELMo), and the transformative capabilities of the Transformers architecture. Each model presents unique strengths since LSTM is modeling long-term dependencies, CRF captures dependencies among word sequences, ELMo delivers contextual word representations using deep bidirectional language models and Transformers introduce self-attention mechanisms that provide enhanced scalability. Our study is positioned as a comparative analysis of these models, emphasising their synergistic potential in addressing text anonymisation challenges. Preliminary results indicate that CRF, LSTM, and ELMo individually outperform traditional methods. The inclusion of Transformers, when compared alongside with the other models, offers a broader perspective on achieving optimal text anonymisation in contemporary settings.


Benchmarking Advanced Text Anonymisation Methods: A Comparative Study on Novel and Traditional Approaches

Asimopoulos, Dimitris, Siniosoglou, Ilias, Argyriou, Vasileios, Karamitsou, Thomai, Fountoukidis, Eleftherios, Goudos, Sotirios K., Moscholios, Ioannis D., Psannis, Konstantinos E., Sarigiannidis, Panagiotis

arXiv.org Artificial Intelligence

In the realm of data privacy, the ability to effectively anonymise text is paramount. With the proliferation of deep learning and, in particular, transformer architectures, there is a burgeoning interest in leveraging these advanced models for text anonymisation tasks. This paper presents a comprehensive benchmarking study comparing the performance of transformer-based models and Large Language Models(LLM) against traditional architectures for text anonymisation. Utilising the CoNLL-2003 dataset, known for its robustness and diversity, we evaluate several models. Our results showcase the strengths and weaknesses of each approach, offering a clear perspective on the efficacy of modern versus traditional methods. Notably, while modern models exhibit advanced capabilities in capturing con textual nuances, certain traditional architectures still keep high performance. This work aims to guide researchers in selecting the most suitable model for their anonymisation needs, while also shedding light on potential paths for future advancements in the field.


Vocoder drift compensation by x-vector alignment in speaker anonymisation

Panariello, Michele, Todisco, Massimiliano, Evans, Nicholas

arXiv.org Artificial Intelligence

For the most popular x-vector-based approaches to speaker anonymisation, the bulk of the anonymisation can stem from vocoding rather than from the core anonymisation function which is used to substitute an original speaker x-vector with that of a fictitious pseudo-speaker. This phenomenon can impede the design of better anonymisation systems since there is a lack of fine-grained control over the x-vector space. The work reported in this paper explores the origin of so-called vocoder drift and shows that it is due to the mismatch between the substituted x-vector and the original representations of the linguistic content, intonation and prosody. Also reported is an original approach to vocoder drift compensation. While anonymisation performance degrades as expected, compensation reduces vocoder drift substantially, offers improved control over the x-vector space and lays a foundation for the design of better anonymisation functions in the future.


Natural Language Processing for low-resource languages

AIHub

Clearly, such an imbalance is undesirable, putting those who do not use English at a disadvantage. In this article, we highlight some of the work and initiatives being carried out on low-resource languages. Africa is one of the most linguistically diverse regions in the world. Despite this, African languages are barely represented in technology and research. Lanfrica aims to mitigate the difficulty encountered in the discovery of African language resources by creating a centralised hub.


Textwash -- automated open-source text anonymisation

Kleinberg, Bennett, Davies, Toby, Mozes, Maximilian

arXiv.org Artificial Intelligence

With the increasing digitisation of society and human communication, text data are becoming more important for research in the social and behavioural sciences (Gentzkow, Kelly, and Taddy 2019; Salganik 2019). Advances made in natural language processing (NLP) in particular have led to exciting insights derived from text data (e.g., on emotional responses to the pandemic (Kleinberg, Vegt, and Mozes 2020) or on the rhetoric around immigration in political speeches (Card et al. 2022); for an overview, see (Boyd and Schwartz 2021)). Importantly, the use of computational techniques to quantify and analyse text data has triggered a demand, especially for large datasets (often of several tens of thousands of documents) that can be harnessed for machine learning approaches (e.g., (Socher et al. 2013; Lewis et al. 2020)). That status quo of a need for larger datasets and an appetite to use text data for the study of social science phenomena has resulted in a dilemma: many of the important questions require targeted, primary data collection or access to potentially sensitive data. However, such data are hard to obtain, not because they do not exist but because sharing them is constrained by data protection regulations and ethical concerns. One potential consequence is that research activity may be biased toward topics for which suitable data is more readily available rather than those most important. One of the few viable solutions to this dilemma is automated text anonymisation; that is, the large-scale processing of text data so that individuals cannot be identified from the resulting output. Such a method would allow for the flow of sensitive data so that the staggering potential of text data can be exploited for scientific progress. With this paper and the tool it introduces, we seek to enable researchers to work with such sensitive data in a way that protects the privacy of individuals whilst retaining the usefulness of anonymised data for computational text analysis.


The MAPA toolkit: sharing your data privately

#artificialintelligence

Think of all the data sources which include your personal information within the public administration services; be it bank account details, financial or medical records, tax information, etc. We often take it for granted that our data is safe and protected. However, what happens when this information is shared among different public administration entities? In reality, the General Data Protection Regulation (GDPR) laws safeguard the general public by limiting what data can be shared among entities, requiring that the data be anonymised before it is shared among different entities, including those within the public administration. The Multilingual Anonymisation for Public Administration (MAPA) Project is a European-funded project which is developing an open-source toolkit that enables effective and reliable text anonymisation, focusing on the medical and legal domains.


The Difficulty of Graph Anonymisation - KDnuggets

#artificialintelligence

This article is written in response to the recent TraceTogether privacy saga. For the non-Singaporeans out there, TraceTogether is Singapore's contact tracing initiative in response to the COVID-19 pandemic in Singapore. The objective of the program was to quickly identify people who might be in close contact with anyone who has tested positive for the virus. It comprises of an app or physical token which uses Bluetooth signals to store proximity records. As at the end December 2020, 70% of Singapore residents were supposedly on the programme.


Ethical aspects of Artificial Intelligence, part 2/2: Differential privacy - Datascience.aero

#artificialintelligence

As the second installment in this series of posts, I will touch upon on the topic of privacy in data science and algorithms. In particular, I'm going to discuss a relatively novel concept of privacy called differential privacy that promises, similar to algorithmic fairness, a way of quantifying the privacy of AI algorithms. When we, as humans, talk about privacy, we mostly refer to a desire to not be observed by others. However, what does privacy mean in the context of algorithms that "observe" us by using data that has information on us? In a very general sense, we could say that privacy will be preserved if, after analysis, the algorithm that used our data (e.g. an application on our smartphones) doesn't know anything about us.