Ayed, Samy-Safwan
FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings
Terrail, Jean Ogier du, Ayed, Samy-Safwan, Cyffers, Edwige, Grimberg, Felix, He, Chaoyang, Loeb, Regis, Mangold, Paul, Marchand, Tanguy, Marfoq, Othmane, Mushtaq, Erum, Muzellec, Boris, Philippenko, Constantin, Silva, Santiago, Teleńczuk, Maria, Albarqouni, Shadi, Avestimehr, Salman, Bellet, Aurélien, Dieuleveut, Aymeric, Jaggi, Martin, Karimireddy, Sai Praneeth, Lorenzi, Marco, Neglia, Giovanni, Tommasi, Marc, Andreux, Mathieu
Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of few ($2$--$50$) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works have proposed representative datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist, thereby slowing algorithmic research in this critical application. In this work, we propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap between theory and practice of cross-silo FL. FLamby encompasses 7 healthcare datasets with natural splits, covering multiple tasks, modalities, and data volumes, each accompanied with baseline training code. As an illustration, we additionally benchmark standard FL algorithms on all datasets. Our flexible and modular suite allows researchers to easily download datasets, reproduce results and re-use the different components for their research. FLamby is available at~\url{www.github.com/owkin/flamby}.
Fed-BioMed: Open, Transparent and Trusted Federated Learning for Real-world Healthcare Applications
Cremonesi, Francesco, Vesin, Marc, Cansiz, Sergen, Bouillard, Yannick, Balelli, Irene, Innocenti, Lucia, Silva, Santiago, Ayed, Samy-Safwan, Taiello, Riccardo, Kameni, Laetita, Vidal, Richard, Orlhac, Fanny, Nioche, Christophe, Lapel, Nathan, Houis, Bastien, Modzelewski, Romain, Humbert, Olivier, Önen, Melek, Lorenzi, Marco
The need for large amounts of data to develop Artificial Intelligence (AI) in healthcare has motivated a number of national and international initiatives aimed at creating medical data lakes accessible to researchers, such as the French Health Data Hub [10], the UK BioBank [59], the US ADNI [26] and TCGA [60], among the many [58, 40, 7]. In spite of these initiatives, there are still major bottlenecks preventing the widespread availability of large centralized repositories of healthcare information [63]. To overcome these limitations, Federated Learning (FL) has been proposed as a working paradigm to enable the training of ML models on large datasets from diverse sources while guaranteeing the respect of data privacy and governance. The basic paradigm of FL consists of iterating the following steps: i) model training is performed locally in the hospitals starting from a common initialization, ii) the resulting model parameters are subsequently shared (instead of the data) and aggregated, to define a global model iii) transmitted back to the hospitals to initiate a new local training step. Under certain conditions [39], this procedure is guaranteed to converge to a final global model representing an optimal consensus among the hospitals participating in the experiment. FL is particularly suited for applications in sensitive domains, such as healthcare and biomedical research [48, 9, 13].