Goto

Collaborating Authors

 Kawamata, Yuji


Anomaly Detection in Double-entry Bookkeeping Data by Federated Learning System with Non-model Sharing Approach

arXiv.org Artificial Intelligence

Anomaly detection is crucial in financial auditing and effective detection often requires obtaining large volumes of data from multiple organizations. However, confidentiality concerns hinder data sharing among audit firms. Although the federated learning (FL)-based approach, FedAvg, has been proposed to address this challenge, its use of mutiple communication rounds increases its overhead, limiting its practicality. In this study, we propose a novel framework employing Data Collaboration (DC) analysis -- a non-model share-type FL method -- to streamline model training into a single communication round. Our method first encodes journal entry data via dimensionality reduction to obtain secure intermediate representations, then transforms them into collaboration representations for building an autoencoder that detects anomalies. We evaluate our approach on a synthetic dataset and real journal entry data from multiple organizations. The results show that our method not only outperforms single-organization baselines but also exceeds FedAvg in non-i.i.d. experiments on real journal entry data that closely mirror real-world conditions. By preserving data confidentiality and reducing iterative communication, this study addresses a key auditing challenge -- ensuring data confidentiality while integrating knowledge from multiple audit firms. Our findings represent a significant advance in artificial intelligence-driven auditing and underscore the potential of FL methods in high-security domains.


Estimation of conditional average treatment effects on distributed data: A privacy-preserving approach

arXiv.org Artificial Intelligence

Estimation of conditional average treatment effects (CATEs) is an important topic in various fields such as medical and social sciences. CATEs can be estimated with high accuracy if distributed data across multiple parties can be centralized. However, it is difficult to aggregate such data if they contain privacy information. To address this issue, we proposed data collaboration double machine learning (DC-DML), a method that can estimate CATE models with privacy preservation of distributed data, and evaluated the method through numerical experiments. Our contributions are summarized in the following three points. First, our method enables estimation and testing of semi-parametric CATE models without iterative communication on distributed data. Semi-parametric or non-parametric CATE models enable estimation and testing that is more robust to model mis-specification than parametric models. However, to our knowledge, no communication-efficient method has been proposed for estimating and testing semi-parametric or non-parametric CATE models on distributed data. Second, our method enables collaborative estimation between different parties as well as multiple time points because the dimensionality-reduced intermediate representations can be accumulated. Third, our method performed as well or better than other methods in evaluation experiments using synthetic, semi-synthetic and real-world datasets.


Collaborative causal inference on distributed data

arXiv.org Artificial Intelligence

In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation. Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores. Through numerical experiments on both artificial and real-world data, we confirm that our method leads to better estimation results than individual analyses. While dimensionality reduction loses some information in the private data and causes performance degradation, we observe that sharing intermediate representations with many parties to resolve the lack of subjects and covariates sufficiently improves performance to overcome the degradation caused by dimensionality reduction. Although external validity is not necessarily guaranteed, our results suggest that DC-QE is a promising method. With the widespread use of our method, intermediate representations can be published as open data to help researchers find causalities and accumulate a knowledge base.