Goto

Collaborating Authors

 diversity index


Jet Image Generation in High Energy Physics Using Diffusion Models

arXiv.org Artificial Intelligence

--This article presents, for the first time, the application of diffusion models for generating jet images corresponding to proton-proton collision events at the Large Hadron Collider (LHC). The kinematic variables of quark, gluon, W-boson, Z-boson, and top quark jets from the JetNet simulation dataset are mapped to two-dimensional image representations. Diffusion models are trained on these images to learn the spatial distribution of jet constituents. We compare the performance of score-based diffusion models and consistency models in accurately generating class-conditional jet images. Unlike approaches based on latent distributions, our method operates directly in image space. The fidelity of the generated images is evaluated using several metrics, including the Fr echet Inception Distance (FID), which demonstrates that consistency models achieve higher fidelity and generation stability compared to score-based diffusion models. These advancements offer significant improvements in computational efficiency and generation accuracy, providing valuable tools for High Energy Physics (HEP) research. IFFUSION models have been used for a wide range of image generation tasks, including grayscale images, RGB color images, hyperspectral images, and physics-based images. Grayscale and color image generation using diffusion models have demonstrated significant advancements in capturing details and color distributions. In grayscale image generation, these models effectively reproduce variations in intensity and texture, as shown in recent studies [1], [2].


Scito2M: A 2 Million, 30-Year Cross-disciplinary Dataset for Temporal Scientometric Analysis

arXiv.org Artificial Intelligence

Understanding the creation, evolution, and dissemination of scientific knowledge is crucial for bridging diverse subject areas and addressing complex global challenges such as pandemics, climate change, and ethical AI. Scientometrics, the quantitative and qualitative study of scientific literature, provides valuable insights into these processes. We introduce Scito2M, a longitudinal scientometric dataset with over two million academic publications, providing comprehensive contents information and citation graphs to support cross-disciplinary analyses. Using Scito2M, we conduct a temporal study spanning over 30 years to explore key questions in scientometrics: the evolution of academic terminology, citation patterns, and interdisciplinary knowledge exchange. Our findings reveal critical insights, such as disparities in epistemic cultures, knowledge production modes, and citation practices. For example, rapidly developing, application-driven fields like LLMs exhibit significantly shorter citation age (2.48 years) compared to traditional theoretical disciplines like oral history (9.71 years).


Diversity and Inclusion Index with Networks and Similarity: Analysis and its Application

arXiv.org Artificial Intelligence

In recent years, the concepts of ``diversity'' and ``inclusion'' have attracted considerable attention across a range of fields, encompassing both social and biological disciplines. To fully understand these concepts, it is critical to not only examine the number of categories but also the similarities and relationships among them. In this study, I introduce a novel index for diversity and inclusion that considers similarities and network connections. I analyzed the properties of these indices and investigated their mathematical relationships using established measures of diversity and networks. Moreover, I developed a methodology for estimating similarities based on the utility of diversity. I also created a method for visualizing proportions, similarities, and network connections. Finally, I evaluated the correlation with external metrics using real-world data, confirming that both the proposed indices and our index can be effectively utilized. This study contributes to a more nuanced understanding of diversity and inclusion analysis.


Analyzing Diversity in Healthcare LLM Research: A Scientometric Perspective

arXiv.org Artificial Intelligence

The deployment of large language models (LLMs) in healthcare has demonstrated substantial potential for enhancing clinical decision-making, administrative efficiency, and patient outcomes. However, the underrepresentation of diverse groups in the development and application of these models can perpetuate biases, leading to inequitable healthcare delivery. This paper presents a comprehensive scientometric analysis of LLM research for healthcare, including data from January 1, 2021, to June 16, 2024. By analyzing metadata from PubMed and Dimensions, including author affiliations, countries, and funding sources, we assess the diversity of contributors to LLM research. Our findings highlight significant gender and geographic disparities, with a predominance of male authors and contributions primarily from high-income countries (HICs). We introduce a novel journal diversity index based on Gini impurity to measure the inclusiveness of scientific publications. Our results underscore the necessity for greater representation in order to ensure the equitable application of LLMs in healthcare. We propose actionable strategies to enhance diversity and inclusivity in artificial intelligence research, with the ultimate goal of fostering a more inclusive and equitable future in healthcare innovation.


Machine Translationese: Effects of Algorithmic Bias on Linguistic Complexity in Machine Translation

arXiv.org Artificial Intelligence

Recent studies in the field of Machine Translation (MT) and Natural Language Processing (NLP) have shown that existing models amplify biases observed in the training data. The amplification of biases in language technology has mainly been examined with respect to specific phenomena, such as gender bias. In this work, we go beyond the study of gender in MT and investigate how bias amplification might affect language in a broader sense. We hypothesize that the 'algorithmic bias', i.e. an exacerbation of frequently observed patterns in combination with a loss of less frequent ones, not only exacerbates societal biases present in current datasets but could also lead to an artificially impoverished language: 'machine translationese'. We assess the linguistic richness (on a lexical and morphological level) of translations created by different data-driven MT paradigms - phrase-based statistical (PB-SMT) and neural MT (NMT). Our experiments show that there is a loss of lexical and morphological richness in the translations produced by all investigated MT paradigms for two language pairs (EN<=>FR and EN<=>ES).


Ensemble Deep Learning on Large, Mixed-Site fMRI Datasets in Autism and Other Tasks

arXiv.org Machine Learning

Deep learning models for MRI classification face two recurring problems: they are typically limited by low sample size, and are abstracted by their own complexity (the "black box problem"). In this paper, we train a convolutional neural network (CNN) with the largest multi-source, functional MRI (fMRI) connectomic dataset ever compiled, consisting of 43,858 datapoints. We apply this model to a cross-sectional comparison of autism (ASD) vs typically developing (TD) controls that has proved difficult to characterise with inferential statistics. To contextualise these findings, we additionally perform classifications of gender and task vs rest. Employing class-balancing to build a training set, we trained 3$\times$300 modified CNNs in an ensemble model to classify fMRI connectivity matrices with overall AUROCs of 0.6774, 0.7680, and 0.9222 for ASD vs TD, gender, and task vs rest, respectively. Additionally, we aim to address the black box problem in this context using two visualization methods. First, class activation maps show which functional connections of the brain our models focus on when performing classification. Second, by analyzing maximal activations of the hidden layers, we were also able to explore how the model organizes a large and mixed-centre dataset, finding that it dedicates specific areas of its hidden layers to processing different covariates of data (depending on the independent variable analyzed), and other areas to mix data from different sources. Our study finds that deep learning models that distinguish ASD from TD controls focus broadly on temporal and cerebellar connections, with a particularly high focus on the right caudate nucleus and paracentral sulcus.


Measuring Diversity of Artificial Intelligence Conferences

arXiv.org Artificial Intelligence

The lack of diversity of the Artificial Intelligence (AI) field is nowadays a concern, and several initiatives such as funding schemes and mentoring programs have been designed to fight against it. However, there is no indication on how these initiatives actually impact AI diversity in the short and long term. This work studies the concept of diversity in this particular context and proposes a small set of diversity indicators (i.e. indexes) of AI scientific events. These indicators are designed to quantify the lack of diversity of the AI field and monitor its evolution. We consider diversity in terms of gender, geographical location and business (understood as the presence of academia versus industry). We compute these indicators for the different communities of a conference: authors, keynote speakers and organizing committee. From these components we compute a summarized diversity indicator for each AI event. We evaluate the proposed indexes for a set of recent major AI conferences and we discuss their values and limitations.


The statistical Minkowski distances: Closed-form formula for Gaussian Mixture Models

arXiv.org Machine Learning

The traditional Minkowski distances are induced by the corresponding Minkowski norms in real-valued vector spaces. In this work, we propose novel statistical symmetric distances based on the Minkowski's inequality for probability densities belonging to Lebesgue spaces. These statistical Minkowski distances admit closed-form formula for Gaussian mixture models when parameterized by integer exponents. This result extends to arbitrary mixtures of exponential families with natural parameter spaces being cones: This includes the binomial, the multinomial, the zero-centered Laplacian, the Gaussian and the Wishart mixtures, among others. We also derive a Minkowski's diversity index of a normalized weighted set of probability distributions from Minkowski's inequality.


Assessment of LDAT as a Grammatical Diversity Assessment Tool

AAAI Conferences

The purpose of this study is to evaluate the validity of measuring grammatical diversity with a specifically designed Lexical Diversity Assessment Tool (LDAT). A secondary objective is to use LDAT to determine if the level of difficulty assigned to English as a Second Language (ESL) texts corresponds to increases in grammatical, lexical, and temporal diversity. Other methods of lexical diversity assessment, such as type-token ratio (TTR), have been used with varying accuracy in an effort to determine the complexity or level of texts. We analyzed 120 ESL texts independently assigned by their sources to one of four levels (Beginner, Lower-intermediate, Upper-intermediate, and Advanced). We demonstrated that LDAT significantly reflected the grammatical diversity within these texts. While the findings conflicted with the prediction that grammatical and lexical diversity would increase with assigned level, we concluded that the implementation of LDAT in text design could provide reliable assessments of grammatical diversity.


Relationship between Diversity and Perfomance of Multiple Classifiers for Decision Support

arXiv.org Artificial Intelligence

The paper presents the investigation and implementation of the relationship between diversity and the performance of multiple classifiers on classification accuracy. The study is critical as to build classifiers that are strong and can generalize better. The parameters of the neural network within the committee were varied to induce diversity; hence structural diversity is the focus for this study. The hidden nodes and the activation function are the parameters that were varied. The diversity measures that were adopted from ecology such as Shannon and Simpson were used to quantify diversity. Genetic algorithm is used to find the optimal ensemble by using the accuracy as the cost function. The results observed shows that there is a relationship between structural diversity and accuracy. It is observed that the classification accuracy of an ensemble increases as the diversity increases. There was an increase of 3%-6% in the classification accuracy.