AITopics | Heidari, Alireza

Collaborating Authors

Heidari, Alireza

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

DobLIX: A Dual-Objective Learned Index for Log-Structured Merge Trees

Heidari, Alireza, Ahmadi, Amirhossein, Zhang, Wei

arXiv.org Artificial IntelligenceFeb-7-2025

In this paper, we introduce DobLIX, a dual-objective learned index specifically designed for Log-Structured Merge(LSM) tree-based key-value stores. Although traditional learned indexes focus exclusively on optimizing index lookups, they often overlook the impact of data access from storage, resulting in performance bottlenecks. DobLIX addresses this by incorporating a second objective, data access optimization, into the learned index training process. This dual-objective approach ensures that both index lookup efficiency and data access costs are minimized, leading to significant improvements in read performance while maintaining write efficiency in real-world LSM-tree systems. Additionally, DobLIX features a reinforcement learning agent that dynamically tunes the system parameters, allowing it to adapt to varying workloads in real-time. Experimental results using real-world datasets demonstrate that DobLIX reduces indexing overhead and improves throughput by 1.19 to 2.21 times compared to state-of-the-art methods within RocksDB, a widely used LSM-tree-based storage engine.

machine learning, reinforcement learning, throughput, (20 more...)

arXiv.org Artificial Intelligence

2502.05369

Country: Europe (0.14)

Genre: Research Report > Promising Solution (0.34)

Industry: Information Technology (0.68)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Unlabeled Out-Of-Domain Data Improves Generalization

Saberi, Amir Hossein, Najafi, Amir, Heidari, Alireza, Movasaghinia, Mohammad Hosein, Motahari, Abolfazl, Khalaj, Babak H.

arXiv.org Machine LearningSep-28-2023

We propose a novel framework for incorporating unlabeled data into semi-supervised classification problems, where scenarios involving the minimization of either i) adversarially robust or ii) non-robust loss functions have been considered. Notably, we allow the unlabeled samples to deviate slightly (in total variation sense) from the in-domain distribution. The core idea behind our framework is to combine Distributionally Robust Optimization (DRO) with self-supervised training. As a result, we also leverage efficient polynomial-time algorithms for the training stage. From a theoretical standpoint, we apply our framework on the classification problem of a mixture of two Gaussians in $\mathbb{R}^d$, where in addition to the $m$ independent and labeled samples from the true distribution, a set of $n$ (usually with $n\gg m$) out of domain and unlabeled samples are gievn as well. Using only the labeled data, it is known that the generalization error can be bounded by $\propto\left(d/m\right)^{1/2}$. However, using our method on both isotropic and non-isotropic Gaussian mixture models, one can derive a new set of analytically explicit and non-asymptotic bounds which show substantial improvement on the generalization error compared ERM. Our results underscore two significant insights: 1) out-of-domain samples, even when unlabeled, can be harnessed to narrow the generalization gap, provided that the true data distribution adheres to a form of the "cluster assumption", and 2) the semi-supervised learning paradigm can be regarded as a special case of our framework when there are no distributional shifts. We validate our claims through experiments conducted on a variety of synthetic and real-world datasets.

artificial intelligence, machine learning, unlabeled sample, (18 more...)

arXiv.org Machine Learning

2310.00027

Country:

North America > United States (0.14)
Asia > Middle East > Iran (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Information Technology > Security & Privacy (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

On sampling from data with duplicate records

Heidari, Alireza, Kushagra, Shrinu, Ilyas, Ihab F.

arXiv.org Machine LearningAug-24-2020

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.

artificial intelligence, dataset, survey article, (18 more...)

arXiv.org Machine Learning

2008.10549

Country: North America > United States (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Clustering (0.46)

Add feedback

Record fusion: A learning approach

Heidari, Alireza, Michalopoulos, George, Kushagra, Shrinu, Ilyas, Ihab F., Rekatsinas, Theodoros

arXiv.org Machine LearningJun-17-2020

Record fusion is the task of aggregating multiple records that correspond to the same real-world entity in a database. We can view record fusion as a machine learning problem where the goal is to predict the "correct" value for each attribute for each entity. Given a database, we use a combination of attribute-level, recordlevel, and database-level signals to construct a feature vector for each cell (or (row, col)) of that database. We use this feature vector alongwith the ground-truth information to learn a classifier for each of the attributes of the database. Our learning algorithm uses a novel stagewise additive model. At each stage, we construct a new feature vector by combining a part of the original feature vector with features computed by the predictions from the previous stage. We then learn a softmax classifier over the new feature space. This greedy stagewise approach can be viewed as a deep model where at each stage, we are adding more complicated non-linear transformations of the original feature vector. We show that our approach fuses records with an average precision of ~98% when source information of records is available, and ~94% without source information across a diverse array of real-world datasets. We compare our approach to a comprehensive collection of data fusion and entity consolidation methods considered in the literature. We show that our approach can achieve an average precision improvement of ~20%/~45% with/without source information respectively.

artificial intelligence, information, machine learning, (17 more...)

arXiv.org Machine Learning

2006.10208

Country:

North America > United States > Washington > King County > Seattle (0.14)
North America > United States > California > Santa Clara County > Mountain View (0.14)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

Approximate Inference in Structured Instances with Noisy Categorical Observations

Heidari, Alireza, Ilyas, Ihab F., Rekatsinas, Theodoros

arXiv.org Machine LearningJul-5-2019

Statistical inference over structured instances of dependent variables (e.g., labeled sequences, trees, or general graphs) is a fundamental problem in many areas. Examples include computer vision (Nowozin et al., 2011; Dollár & Zitnick, 2013; Chen et al., 2018), natural language processing (Huang et al., 2015; Hu et al., 2016), and computational biology (Li et al., 2007). In many practical setups (Shin et al., 2015; Rekatsinas et al., 2017; Sa et al., 2019; Heidari et al., 2019b), inference problems involve noisy observations of discrete labels assigned to the nodes and edges of a given structured instance and the goal is to infer a labeling of the vertices that achieves low disagreement rate between the correct ground truth labels Y and the predicted labels Ŷ, i.e., low Hamming error. We refer to this problem as statistical recovery. Our motivation to study the problem of statistical recovery stems from our recent work on data cleaning (Rekatsinas et al., 2017; Sa et al., 2019; Heidari et al., 2019b). This work introduces HoloClean, a state-of-the-art inference engine for data curation that casts data cleaning as a structured prediction problem (Sa et al., 2019): Given a dataset as input, it associates each of its cells with a random variable, and uses logical integrity constraints over this dataset (e.g., key constraints or functional dependencies) to introduce dependencies over these random variables. The labels that each random variable can take are determined by the domain of the attribute associated with the corresponding cell. Since we focus on data cleaning, the input dataset corresponds to a noisy version of the latent, clean dataset. Our goal is to recover the latter.

algorithm, deep learning, neural network, (23 more...)

arXiv.org Machine Learning

1907.00141

Country:

North America > United States > Wisconsin (0.14)
Europe > Spain (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Data Science > Data Quality > Data Cleaning (0.94)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.83)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback