AITopics | Aghazadeh, Amirali

Collaborating Authors

Aghazadeh, Amirali

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

AstroAgents: A Multi-Agent AI for Hypothesis Generation from Mass Spectrometry Data

Saeedi, Daniel, Buckner, Denise, Aponte, Jose C., Aghazadeh, Amirali

arXiv.org Artificial IntelligenceMar-29-2025

With upcoming sample return missions across the solar system and the increasing availability of mass spectrometry data, there is an urgent need for methods that analyze such data within the context of existing astrobiology literature and generate plausible hypotheses regarding the emergence of life on Earth. Hypothesis generation from mass spectrometry data is challenging due to factors such as environmental contaminants, the complexity of spectral peaks, and difficulties in cross-matching these peaks with prior studies. To address these challenges, we introduce AstroAgents, a large language model-based, multi-agent AI system for hypothesis generation from mass spectrometry data. AstroAgents is structured around eight collaborative agents: a data analyst, a planner, three domain scientists, an accumulator, a literature reviewer, and a critic. The system processes mass spectrometry data alongside user-provided research papers. The data analyst interprets the data, and the planner delegates specific segments to the scientist agents for in-depth exploration. The accumulator then collects and deduplicates the generated hypotheses, and the literature reviewer identifies relevant literature using Semantic Scholar. The critic evaluates the hypotheses, offering rigorous suggestions for improvement. To assess AstroAgents, an astrobiology expert evaluated the novelty and plausibility of more than a hundred hypotheses generated from data obtained from eight meteorites and ten soil samples. Of these hypotheses, 36% were identified as plausible, and among those, 66% were novel. Project website: https://astroagents.github.io/

artificial intelligence, hypothesis, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.2317

Country: North America > United States (1.00)

Genre: Research Report (0.82)

Industry:

Materials > Chemicals (0.96)
Government > Regional Government > North America Government (0.46)
Energy > Oil & Gas > Downstream (0.30)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Efficient Algorithm for Sparse Fourier Transform of Generalized q-ary Functions

Tsui, Darin, Talreja, Kunal, Aghazadeh, Amirali

arXiv.org Artificial IntelligenceJan-21-2025

Computing the Fourier transform of a $q$-ary function $f:\mathbb{Z}_{q}^n\rightarrow \mathbb{R}$, which maps $q$-ary sequences to real numbers, is an important problem in mathematics with wide-ranging applications in biology, signal processing, and machine learning. Previous studies have shown that, under the sparsity assumption, the Fourier transform can be computed efficiently using fast and sample-efficient algorithms. However, in many practical settings, the function is defined over a more general space -- the space of generalized $q$-ary sequences $\mathbb{Z}_{q_1} \times \mathbb{Z}_{q_2} \times \cdots \times \mathbb{Z}_{q_n}$ -- where each $\mathbb{Z}_{q_i}$ corresponds to integers modulo $q_i$. A naive approach involves setting $q=\max_i{q_i}$ and treating the function as $q$-ary, which results in heavy computational overheads. Herein, we develop GFast, an algorithm that computes the $S$-sparse Fourier transform of $f$ with a sample complexity of $O(Sn)$, computational complexity of $O(Sn \log N)$, and a failure probability that approaches zero as $N=\prod_{i=1}^n q_i \rightarrow \infty$ with $S = N^\delta$ for some $0 \leq \delta < 1$. In the presence of noise, we further demonstrate that a robust version of GFast computes the transform with a sample complexity of $O(Sn^2)$ and computational complexity of $O(Sn^2 \log N)$ under the same high probability guarantees. Using large-scale synthetic experiments, we demonstrate that GFast computes the sparse Fourier transform of generalized $q$-ary functions using $16\times$ fewer samples and running $8\times$ faster than existing algorithms. In real-world protein fitness datasets, GFast explains the predictive interactions of a neural network with $>25\%$ smaller normalized mean-squared error compared to existing algorithms.

artificial intelligence, data quality, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.12365

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Data Science > Data Quality > Data Transformation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

SHAP zero Explains Genomic Models with Near-zero Marginal Cost for Future Queried Sequences

Tsui, Darin, Musharaf, Aryan, Erginbas, Yigit Efe, Kang, Justin Singh, Aghazadeh, Amirali

arXiv.org Artificial IntelligenceDec-20-2024

With the rapid growth of large-scale machine learning models in genomics, Shapley values have emerged as a popular method for model explanations due to their theoretical guarantees. While Shapley values explain model predictions locally for an individual input query sequence, extracting biological knowledge requires global explanation across thousands of input sequences. This demands exponential model evaluations per sequence, resulting in significant computational cost and carbon footprint. Herein, we develop SHAP zero, a method that estimates Shapley values and interactions with a near-zero marginal cost for future queried sequences after paying a one-time fee for model sketching. SHAP zero achieves this by establishing a surprisingly underexplored connection between the Shapley values and interactions and the Fourier transform of the model. Explaining two genomic models, one trained to predict guide RNA binding and the other to predict DNA repair outcome, we demonstrate that SHAP zero achieves orders of magnitude reduction in amortized computational cost compared to state-of-the-art algorithms, revealing almost all predictive motifs -- a finding previously inaccessible due to the combinatorial space of possible interactions.

artificial intelligence, machine learning, shap zero, (20 more...)

arXiv.org Artificial Intelligence

2410.19236

Country: North America > United States (0.46)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

On Recovering Higher-order Interactions from Protein Language Models

Tsui, Darin, Aghazadeh, Amirali

arXiv.org Artificial IntelligenceMar-15-2024

Protein language models leverage evolutionary information to perform state-of-the-art 3D structure and zero-shot variant prediction. Yet, extracting and explaining all the mutational interactions that govern model predictions remains difficult as it requires querying the entire amino acid space for $n$ sites using $20^n$ sequences, which is computationally expensive even for moderate values of $n$ (e.g., $n\sim10$). Although approaches to lower the sample complexity exist, they often limit the interpretability of the model to just single and pairwise interactions. Recently, computationally scalable algorithms relying on the assumption of sparsity in the Fourier domain have emerged to learn interactions from experimental data. However, extracting interactions from language models poses unique challenges: it's unclear if sparsity is always present or if it is the only metric needed to assess the utility of Fourier algorithms. Herein, we develop a framework to do a systematic Fourier analysis of the protein language model ESM2 applied on three proteins-green fluorescent protein (GFP), tumor protein P53 (TP53), and G domain B1 (GB1)-across various sites for 228 experiments. We demonstrate that ESM2 is dominated by three regions in the sparsity-ruggedness plane, two of which are better suited for sparse Fourier transforms. Validations on two sample proteins demonstrate recovery of all interactions with $R^2=0.72$ in the more sparse region and $R^2=0.66$ in the more dense region, using only 7 million out of $20^{10}\sim10^{13}$ ESM2 samples, reducing the computational time by a staggering factor of 15,000. All codes and data are available on our GitHub repository https://github.com/amirgroup-codes/InteractionRecovery.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2405.06645

Country: North America > United States (0.29)

Genre: Research Report > New Finding (0.94)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

ProtiGeno: a prokaryotic short gene finder using protein language models

Tu, Tony, Krishna, Gautham, Aghazadeh, Amirali

arXiv.org Artificial IntelligenceJul-19-2023

Prokaryotic gene prediction plays an important role in understanding the biology of organisms and their function with applications in medicine and biotechnology. Although the current gene finders are highly sensitive in finding long genes, their sensitivity decreases noticeably in finding shorter genes (<180 nts). The culprit is insufficient annotated gene data to identify distinguishing features in short open reading frames (ORFs). We develop a deep learning-based method called ProtiGeno, specifically targeting short prokaryotic genes using a protein language model trained on millions of evolved proteins. In systematic large-scale experiments on 4,288 prokaryotic genomes, we demonstrate that ProtiGeno predicts short coding and noncoding genes with higher accuracy and recall than the current state-of-the-art gene finders. We discuss the predictive features of ProtiGeno and possible limitations by visualizing the three-dimensional structure of the predicted short genes.

genome, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2307.10343

Genre: Research Report (0.83)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficiently Computing Sparse Fourier Transforms of $q$-ary Functions

Erginbas, Yigit Efe, Kang, Justin Singh, Aghazadeh, Amirali, Ramchandran, Kannan

arXiv.org Artificial IntelligenceJan-15-2023

Fourier transformations of pseudo-Boolean functions are popular tools for analyzing functions of binary sequences. Real-world functions often have structures that manifest in a sparse Fourier transform, and previous works have shown that under the assumption of sparsity the transform can be computed efficiently. But what if we want to compute the Fourier transform of functions defined over a $q$-ary alphabet? These types of functions arise naturally in many areas including biology. A typical workaround is to encode the $q$-ary sequence in binary, however, this approach is computationally inefficient and fundamentally incompatible with the existing sparse Fourier transform techniques. Herein, we develop a sparse Fourier transform algorithm specifically for $q$-ary functions of length $n$ sequences, dubbed $q$-SFT, which provably computes an $S$-sparse transform with vanishing error as $q^n \rightarrow \infty$ in $O(Sn)$ function evaluations and $O(S n^2 \log q)$ computations, where $S = q^{n\delta}$ for some $\delta < 1$. Under certain assumptions, we show that for fixed $q$, a robust version of $q$-SFT has a sample complexity of $O(Sn^2)$ and a computational complexity of $O(Sn^3)$ with the same asymptotic guarantees. We present numerical simulations on synthetic and real-world RNA data, demonstrating the scalability of $q$-SFT to massively high dimensional $q$-ary functions.

artificial intelligence, data quality, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2301.062

Country:

Europe (0.46)
North America (0.46)

Genre: Research Report (0.50)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.46)

Technology:

Information Technology > Data Science > Data Quality > Data Transformation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Group-Structured Adversarial Training

Farnia, Farzan, Aghazadeh, Amirali, Zou, James, Tse, David

arXiv.org Machine LearningJun-18-2021

Robust training methods against perturbations to the input data have received great attention in the machine learning literature. A standard approach in this direction is adversarial training which learns a model using adversarially-perturbed training samples. However, adversarial training performs suboptimally against perturbations structured across samples such as universal and group-sparse shifts that are commonly present in biological data such as gene expression levels of different tissues. In this work, we seek to close this optimality gap and introduce Group-Structured Adversarial Training (GSAT) which learns a model robust to perturbations structured across samples. We formulate GSAT as a non-convex concave minimax optimization problem which minimizes a group-structured optimal transport cost. Specifically, we focus on the applications of GSAT for group-sparse and rank-constrained perturbations modeled using group and nuclear norm penalties. In order to solve GSAT's non-smooth optimization problem in those cases, we propose a new minimax optimization algorithm called GDADMM by combining Gradient Descent Ascent (GDA) and Alternating Direction Method of Multipliers (ADMM). We present several applications of the GSAT framework to gain robustness against structured perturbations for image recognition and computational biology datasets.

artificial intelligence, machine learning, optimization problem, (4 more...)

arXiv.org Machine Learning

2106.10324

Genre: Research Report (0.40)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.53)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.87)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.73)

Add feedback

MISSION: Ultra Large-Scale Feature Selection using Count-Sketches

Aghazadeh, Amirali, Spring, Ryan, LeJeune, Daniel, Dasarathy, Gautam, Shrivastava, Anshumali, Baraniuk, Richard G.

arXiv.org Machine LearningJun-11-2018

Feature selection is an important challenge in machine learning. It plays a crucial role in the explainability of machine-driven decisions that are rapidly permeating throughout modern society. Unfortunately, the explosion in the size and dimensionality of real-world datasets poses a severe challenge to standard feature selection algorithms. Today, it is not uncommon for datasets to have billions of dimensions. At such scale, even storing the feature vector is impossible, causing most existing feature selection methods to fail. Workarounds like feature hashing, a standard approach to large-scale machine learning, helps with the computational feasibility, but at the cost of losing the interpretability of features. In this paper, we present MISSION, a novel framework for ultra large-scale feature selection that performs stochastic gradient descent while maintaining an efficient representation of the features in memory using a Count-Sketch data structure. MISSION retains the simplicity of feature hashing without sacrificing the interpretability of the features while using only O(log^2(p)) working memory. We demonstrate that MISSION accurately and efficiently performs feature selection on real-world, large-scale datasets with billions of dimensions.

artificial intelligence, feature selection, health & medicine, (17 more...)

arXiv.org Machine Learning

1806.0431

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.88)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.70)

Add feedback