Sharma, Divya
Mitigating Bad Ground Truth in Supervised Machine Learning based Crop Classification: A Multi-Level Framework with Sentinel-2 Images
A, Sanayya, Shetty, Amoolya, Sharma, Abhijeet, Ravichandran, Venkatesh, Gosuvarapalli, Masthan Wali, Jain, Sarthak, Nanjundiah, Priyamvada, Dutta, Ujjal Kr, Sharma, Divya
In agricultural management, precise Ground Truth (GT) data is crucial for accurate Machine Learning (ML) based crop classification. Yet, issues like crop mislabeling and incorrect land identification are common. We propose a multi-level GT cleaning framework while utilizing multi-temporal Sentinel-2 data to address these issues. Specifically, this framework utilizes generating embeddings for farmland, clustering similar crop profiles, and identification of outliers indicating GT errors. We validated clusters with False Colour Composite (FCC) checks and used distance-based metrics to scale and automate this verification process. The importance of cleaning the GT data became apparent when the models were trained on the clean and unclean data. For instance, when we trained a Random Forest model with the clean GT data, we achieved upto 70\% absolute percentage points higher for the F1 score metric. This approach advances crop classification methodologies, with potential for applications towards improving loan underwriting and agricultural decision-making.
OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes
Therrien, Félix, Haibeh, Jamal Abou, Sharma, Divya, Hendley, Rhiannon, Hernández-García, Alex, Sun, Sun, Tchagang, Alain, Su, Jiang, Huberman, Samuel, Bengio, Yoshua, Guo, Hongyu, Shin, Homin
Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of $\sim$600 synthesized solid electrolyte materials and their experimentally measured room temperature ionic conductivities gathered from literature. Each material is described by their measured composition, space group and lattice parameters. A full-crystal description in the form of a crystallographic information file (CIF) is provided for ~320 structures for which atomic positions were available. We discuss various statistics and features of the dataset and provide training and testing splits that avoid data leakage. Finally, we benchmark seven existing ML models on the task of predicting ionic conductivity and discuss their performance. The goal of this work is to facilitate the use of machine learning for solid-state electrolyte materials discovery.
SAModified: A Foundation Model-Based Zero-Shot Approach for Refining Noisy Land-Use Land-Cover Maps
Pekhale, Sparsh, Sathish, Rakshith, Basavaraju, Sathisha, Sharma, Divya
Land-use and land cover (LULC) analysis is critical in remote sensing, with wide-ranging applications across diverse fields such as agriculture, utilities, and urban planning. However, automating LULC map generation using machine learning is rendered challenging due to noisy labels. Typically, the ground truths (e.g. ESRI LULC, MapBioMass) have noisy labels that hamper the model's ability to learn to accurately classify the pixels. Further, these erroneous labels can significantly distort the performance metrics of a model, leading to misleading evaluations. Traditionally, the ambiguous labels are rectified using unsupervised algorithms. These algorithms struggle not only with scalability but also with generalization across different geographies. To overcome these challenges, we propose a zero-shot approach using the foundation model, Segment Anything Model (SAM), to automatically delineate different land parcels/regions and leverage them to relabel the unsure pixels by using the local label statistics within each detected region. We achieve a significant reduction in label noise and an improvement in the performance of the downstream segmentation model by $\approx 5\%$ when trained with denoised labels.
Crystal-GFN: sampling crystals with desirable properties and constraints
AI4Science, Mila, Hernandez-Garcia, Alex, Duval, Alexandre, Volokhova, Alexandra, Bengio, Yoshua, Sharma, Divya, Carrier, Pierre Luc, Benabed, Yasmine, Koziarski, Michał, Schmidt, Victor
Accelerating material discovery holds the potential to greatly help mitigate the climate crisis. Discovering new solid-state materials such as electrocatalysts, super-ionic conductors or photovoltaic materials can have a crucial impact, for instance, in improving the efficiency of renewable energy production and storage. In this paper, we introduce Crystal-GFN, a generative model of crystal structures that sequentially samples structural properties of crystalline materials, namely the space group, composition and lattice parameters. This domain-inspired approach enables the flexible incorporation of physical and structural hard constraints, as well as the use of any available predictive model of a desired physicochemical property as an objective function. To design stable materials, one must target the candidates with the lowest formation energy. Here, we use as objective the formation energy per atom of a crystal structure predicted by a new proxy machine learning model trained on MatBench. The results demonstrate that Crystal-GFN is able to sample highly diverse crystals with low (median -3.1 eV/atom) predicted formation energy.