Amini, Arash
FaMTEB: Massive Text Embedding Benchmark in Persian Language
Zinvandi, Erfan, Alikhani, Morteza, Sarmadi, Mehran, Pourbahman, Zahra, Arvin, Sepehr, Kazemi, Reza, Amini, Arash
In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are formed as a combination of existing, translated, and newly generated data, offering a diverse evaluation framework for Persian language models. Given the increasing use of text embedding models in chatbots, evaluation datasets are becoming inseparable ingredients in chatbot challenges and Retrieval-Augmented Generation systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. In addition, in this paper, we introduce the new task of summary retrieval which is not part of the tasks included in standard MTEB. Another contribution of this paper is the introduction of a substantial number of new Persian language NLP datasets suitable for training and evaluation, some of which have no previous counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models in a range of tasks. This work introduces an open-source benchmark with datasets, code and a public leaderboard.
Clustering Time Series Data with Gaussian Mixture Embeddings in a Graph Autoencoder Framework
Afzali, Amirabbas, Hosseini, Hesam, Mirzai, Mohmmadamin, Amini, Arash
Time series data analysis is prevalent across various domains, including finance, healthcare, and environmental monitoring. Traditional time series clustering methods often struggle to capture the complex temporal dependencies inherent in such data. In this paper, we propose the Variational Mixture Graph Autoencoder (VMGAE), a graph-based approach for time series clustering that leverages the structural advantages of graphs to capture enriched data relationships and produces Gaussian mixture embeddings for improved separability. Comparisons with baseline methods are included with experimental results, demonstrating that our method significantly outperforms state-of-the-art time-series clustering techniques. We further validate our method on real-world financial data, highlighting its practical applications in finance. By uncovering community structures in stock markets, our method provides deeper insights into stock relationships, benefiting market prediction, portfolio optimization, and risk management.
CrediRAG: Network-Augmented Credibility-Based Retrieval for Misinformation Detection in Reddit
Ram, Ashwin, Bayiz, Yigit Ege, Amini, Arash, Munir, Mustafa, Marculescu, Radu
Fake news threatens democracy and exacerbates the polarization and divisions in society; therefore, accurately detecting online misinformation is the foundation of addressing this issue. We present CrediRAG, the first fake news detection model that combines language models with access to a rich external political knowledge base with a dense social network to detect fake news across social media at scale. CrediRAG uses a news retriever to initially assign a misinformation score to each post based on the source credibility of similar news articles to the post title content. CrediRAG then improves the initial retrieval estimations through a novel weighted post-to-post network connected based on shared commenters and weighted by the average stance of all shared commenters across every pair of posts. We achieve 11% increase in the F1-score in detecting misinformative posts over state-of-the-art methods. Extensive experiments conducted on curated real-world Reddit data of over 200,000 posts demonstrate the superior performance of CrediRAG on existing baselines. Thus, our approach offers a more accurate and scalable solution to combat the spread of fake news across social media platforms.
Scalable Networked Feature Selection with Randomized Algorithm for Robot Navigation
Pandey, Vivek, Amini, Arash, Liu, Guangyi, Topcu, Ufuk, Sun, Qiyu, Daniilidis, Kostas, Motee, Nader
We address the problem of sparse selection of visual features for localizing a team of robots navigating an unknown environment, where robots can exchange relative position measurements with neighbors. We select a set of the most informative features by anticipating their importance in robots localization by simulating trajectories of robots over a prediction horizon. Through theoretical proofs, we establish a crucial connection between graph Laplacian and the importance of features. We show that strong network connectivity translates to uniformity in feature importance, which enables uniform random sampling of features and reduces the overall computational complexity. We leverage a scalable randomized algorithm for sparse sums of positive semidefinite matrices to efficiently select the set of the most informative features and significantly improve the probabilistic performance bounds. Finally, we support our findings with extensive simulations.
Joint Signal Recovery and Graph Learning from Incomplete Time-Series
Javaheri, Amirhossein, Amini, Arash, Marvasti, Farokh, Palomar, Daniel P.
Learning a graph from data is the key to taking advantage of graph signal processing tools. Most of the conventional algorithms for graph learning require complete data statistics, which might not be available in some scenarios. In this work, we aim to learn a graph from incomplete time-series observations. From another viewpoint, we consider the problem of semi-blind recovery of time-varying graph signals where the underlying graph model is unknown. We propose an algorithm based on the method of block successive upperbound minimization (BSUM), for simultaneous inference of the signal and the graph from incomplete data. Simulation results on synthetic and real time-series demonstrate the performance of the proposed method for graph learning and signal recovery.
Bayesian community detection for networks with covariates
Shen, Luyi, Amini, Arash, Josephs, Nathaniel, Lin, Lizhen
The increasing prevalence of network data in a vast variety of fields and the need to extract useful information out of them have spurred fast developments in related models and algorithms. Among the various learning tasks with network data, community detection, the discovery of node clusters or "communities," has arguably received the most attention in the scientific community. In many real-world applications, the network data often come with additional information in the form of node or edge covariates that should ideally be leveraged for inference. In this paper, we add to a limited literature on community detection for networks with covariates by proposing a Bayesian stochastic block model with a covariate-dependent random partition prior. Under our prior, the covariates are explicitly expressed in specifying the prior distribution on the cluster membership. Our model has the flexibility of modeling uncertainties of all the parameter estimates including the community membership. Importantly, and unlike the majority of existing methods, our model has the ability to learn the number of the communities via posterior inference without having to assume it to be known. Our model can be applied to community detection in both dense and sparse networks, with both categorical and continuous covariates, and our MCMC algorithm is very efficient with good mixing properties. We demonstrate the superior performance of our model over existing models in a comprehensive simulation study and an application to two real datasets.
RoboCup 2022 AdultSize Winner NimbRo: Upgraded Perception, Capture Steps Gait and Phase-based In-walk Kicks
Pavlichenko, Dmytro, Ficht, Grzegorz, Amini, Arash, Hosseini, Mojtaba, Memmesheimer, Raphael, Villar-Corrales, Angel, Schulz, Stefan M., Missura, Marcell, Bennewitz, Maren, Behnke, Sven
Beating the human world champions by 2050 is an ambitious goal of the Humanoid League that provides a strong incentive for RoboCup teams to further improve and develop their systems. In this paper, we present upgrades of our system which enabled our team NimbRo to win the Soccer Tournament, the Drop-in Games, and the Technical Challenges in the Humanoid AdultSize League of RoboCup 2022. Strong performance in these competitions resulted in the Best Humanoid award in the Humanoid League. The mentioned upgrades include: hardware upgrade of the vision module, balanced walking with Capture Steps, and the introduction of phase-based in-walk kicks.
Learning Nonlinear Couplings in Network of Agents from a Single Sample Trajectory
Amini, Arash, Sun, Qiyu, Motee, Nader
We consider a class of stochastic dynamical networks whose governing dynamics can be modeled using a coupling function. It is shown that the dynamics of such networks can generate geometrically ergodic trajectories under some reasonable assumptions. We show that a general class of coupling functions can be learned using only one sample trajectory from the network. This is practically plausible as in numerous applications it is desired to run an experiment only once but for a longer period of time, rather than repeating the same experiment multiple times from different initial conditions. Building upon ideas from the concentration inequalities for geometrically ergodic Markov chains, we formulate several results about the convergence of the empirical estimator to the true coupling function. Our theoretical findings are supported by extensive simulation results.
Variable Importance Using Decision Trees
Kazemitabar, Jalil, Amini, Arash, Bloniarz, Adam, Talwalkar, Ameet S.
Decision trees and random forests are well established models that not only offer good predictive performance, but also provide rich feature importance information. While practitioners often employ variable importance methods that rely on this impurity-based information, these methods remain poorly characterized from a theoretical perspective. We provide novel insights into the performance of these methods by deriving finite sample performance guarantees in a high-dimensional setting under various modeling assumptions. We further demonstrate the effectiveness of these impurity-based methods via an extensive set of simulations.
Fast Methods for Recovering Sparse Parameters in Linear Low Rank Models
Esmaeili, Ashkan, Amini, Arash, Marvasti, Farokh
In this paper, we investigate the recovery of a sparse weight vector (parameters vector) from a set of noisy linear combinations. However, only partial information about the matrix representing the linear combinations is available. Assuming a low-rank structure for the matrix, one natural solution would be to first apply a matrix completion on the data, and then to solve the resulting compressed sensing problem. In big data applications such as massive MIMO and medical data, the matrix completion step imposes a huge computational burden. Here, we propose to reduce the computational cost of the completion task by ignoring the columns corresponding to zero elements in the sparse vector. To this end, we employ a technique to initially approximate the support of the sparse vector. We further propose to unify the partial matrix completion and sparse vector recovery into an augmented four-step problem. Simulation results reveal that the augmented approach achieves the best performance, while both proposed methods outperform the natural two-step technique with substantially less computational requirements.