AITopics | Pouransari, Hadi

Plotting

Pouransari, Hadi

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding

Wang, Haoxiang, Vasu, Pavan Kumar Anasosalu, Faghri, Fartash, Vemulapalli, Raviteja, Farajtabar, Mehrdad, Mehta, Sachin, Rastegari, Mohammad, Tuzel, Oncel, Pouransari, Hadi

arXiv.org Artificial IntelligenceNov-19-2023

The landscape of publicly available vision foundation models (VFMs), such as CLIP and Segment Anything Model (SAM), is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pre-training objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise. Our method integrates techniques of multi-task learning, continual learning, and distillation. Further, it demands significantly less computational cost compared to traditional multi-task training from scratch, and it only needs a small fraction of the pre-training datasets that were initially used to train individual models. By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer. Compared with deploying SAM and CLIP independently, our merged model, SAM-CLIP, reduces storage and compute costs for inference, making it well-suited for edge device applications. We show that SAM-CLIP not only retains the foundational strengths of SAM and CLIP, but also introduces synergistic functionalities, notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results on 5 benchmarks. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8% and +5.9% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively.

machine learning, natural language, segmentation, (17 more...)

arXiv.org Artificial Intelligence

2310.15308

Country:

North America > United States > Illinois (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.67)

Industry:

Education (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

TiC-CLIP: Continual Training of CLIP Models

Garg, Saurabh, Farajtabar, Mehrdad, Pouransari, Hadi, Vemulapalli, Raviteja, Mehta, Sachin, Tuzel, Oncel, Shankar, Vaishaal, Faghri, Fartash

arXiv.org Artificial IntelligenceOct-24-2023

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataCompt, TiC-YFCC, and TiC-RedCaps with over 12.7B timestamped image-text pairs spanning 9 years (2014--2022). We first use our benchmarks to curate various dynamic evaluations to measure temporal robustness of existing models. We show OpenAI's CLIP (trained on data up to 2020) loses $\approx 8\%$ zero-shot accuracy on our curated retrieval task from 2021--2022 compared with more recently trained models in OpenCLIP repository. We then study how to efficiently train models on time-continuous data. We demonstrate that a simple rehearsal-based approach that continues training from the last checkpoint and replays old data reduces compute by $2.5\times$ when compared to the standard practice of retraining from scratch.

continual training, large language model, natural language, (3 more...)

arXiv.org Artificial Intelligence

2310.16226

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.53)

Add feedback

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

Salehi, Mohammadreza, Farajtabar, Mehrdad, Horton, Maxwell, Faghri, Fartash, Pouransari, Hadi, Vemulapalli, Raviteja, Tuzel, Oncel, Farhadi, Ali, Rastegari, Mohammad, Mehta, Sachin

arXiv.org Artificial IntelligenceOct-21-2023

Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification. Foundation Models (FMs) are revolutionizing different domains of artificial intelligence and machine learning, including computer vision (Radford et al., 2021; He et al., 2022; Kirillov et al., 2023b) and natural language processing (Devlin et al., 2018; Brown et al., 2020; Touvron et al., 2023). FMs can be trained on web crawled data without relying on crowd or expert annotations, and yet they demonstrate strong generalization capabilities (Jia et al., 2021; Schuhmann et al., 2022).

cliptex, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2310.14108

Country: Oceania > Australia (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.56)

Add feedback

Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement

Faghri, Fartash, Pouransari, Hadi, Mehta, Sachin, Farajtabar, Mehrdad, Farhadi, Ali, Rastegari, Mohammad, Tuzel, Oncel

arXiv.org Artificial IntelligenceSep-22-2023

We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-based models and performing large-scale study of distillation with state-of-the-art models with various data augmentations. We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks (e.g., segmentation and detection). As an example, the accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the ImageNet validation set is also reduced by 9.9%. Using this backbone with Mask-RCNN for object detection on MS-COCO, the mean average precision improves by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers. For MobileNetV3 and Swin-Tiny, we observe significant improvements on ImageNet-R/A/C of up to 20% improved robustness. Models pretrained on ImageNet+ and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4% improved accuracy. The code, datasets, and pretrained models are available at https://github.com/apple/ml-dr.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2303.08983

Country: North America > Canada (0.14)

Genre: Research Report > Promising Solution (0.48)

Industry: Education (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Frequency-Aware Masked Autoencoders for Multimodal Pretraining on Biosignals

Liu, Ran, Zippi, Ellen L., Pouransari, Hadi, Sandino, Chris, Nie, Jingping, Goh, Hanlin, Azemi, Erdrin, Moin, Ali

arXiv.org Artificial IntelligenceSep-11-2023

Leveraging multimodal information from biosignals is vital for building a comprehensive representation of people's physical and mental states. However, multimodal biosignals often exhibit substantial distributional shifts between pretraining and inference datasets, stemming from changes in task specification or variations in modality compositions. To achieve effective pretraining in the presence of potential distributional shifts, we propose a frequency-aware masked autoencoder ($\texttt{bio}$FAME) that learns to parameterize the representation of biosignals in the frequency space. $\texttt{bio}$FAME incorporates a frequency-aware transformer, which leverages a fixed-size Fourier-based operator for global token mixing, independent of the length and sampling rate of inputs. To maintain the frequency components within each input channel, we further employ a frequency-maintain pretraining strategy that performs masked autoencoding in the latent space. The resulting architecture effectively utilizes multimodal information during pretraining, and can be seamlessly adapted to diverse tasks and modalities at test time, regardless of input size and order. We evaluated our approach on a diverse set of transfer experiments on unimodal time series, achieving an average of $\uparrow$5.5% improvement in classification accuracy over the previous state-of-the-art. Furthermore, we demonstrated that our architecture is robust in modality mismatch scenarios, including unpredicted modality dropout or substitution, proving its practical utility in real-world applications. Code will be available soon.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2309.05927

Country:

North America > United States (0.28)
Asia > Middle East > Israel (0.14)

Genre: Research Report (0.82)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

FastFill: Efficient Compatible Model Update

Jaeckle, Florian, Faghri, Fartash, Farhadi, Ali, Tuzel, Oncel, Pouransari, Hadi

arXiv.org Artificial IntelligenceMar-8-2023

In many retrieval systems the original high dimensional data (e.g., images) is mapped to a lower dimensional feature through a learned embedding model. The task of retrieving the most similar data from a gallery set to a given query data is performed through a similarity comparison on features. When the embedding model is updated, it might produce features that are not comparable/compatible with features already in the gallery computed with the old model. Subsequently, all features in the gallery need to be re-computed using the new embedding model - a computationally expensive process called backfilling. Recently, compatible representation learning methods have been proposed to avoid backfilling. Despite their relative success, there is an inherent trade-off between the new model performance and its compatibility with the old model. In this work, we introduce FastFill: a compatible model update process using feature alignment and policy based partial backfilling to promptly elevate retrieval performance. We show that previous backfilling strategies suffer from decreased performance and demonstrate the importance of both the training objective and the ordering in online partial backfilling. We propose a new training method for feature alignment between old and new embedding models using uncertainty estimation. Compared to previous works, we obtain significantly improved backfilling results on a variety of datasets: mAP on ImageNet (+4.4%), Further, we demonstrate that when updating a biased model with FastFill, the minority subgroup accuracy gap promptly vanishes with a small fraction of partial backfilling. Retrieval problems have become increasingly popular for many real-life applications such as face recognition, voice recognition, image localization, and object identification. In an image retrieval setup, we have a large set of images called the gallery set with predicted labels and a set of unknown query images. The aim of image retrieval is to match query images to related images in the gallery set, ideally of the same class/identity.

artificial intelligence, machine learning, new model, (18 more...)

arXiv.org Artificial Intelligence

2303.04766

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Sensing and Signal Processing > Image Processing (0.87)
Information Technology > Artificial Intelligence > Vision > Face Recognition (0.48)
(2 more...)

Add feedback

Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution

Pouransari, Hadi, Tuzel, Oncel

arXiv.org Machine LearningJun-30-2020

For example, both the PyramidNet-110 model [23] and the larger PyramidNet-Knowledge distillation has been used to transfer 200 model achieve perfect accuracy on the CIFAR100 [32] knowledge learned by a sophisticated model (teacher) to training set, while the latter has 3% higher generalization a simpler model (student). This technique is widely used to accuracy. This motivated transferring the "knowledge" compress model complexity. However, in most applications encoded in the more accurate larger model to the smaller the compressed student model suffers from an accuracy gap one. Knowledge Distillation [8, 27] (KD) established with its teacher. We propose extracurricular learning, a an important mechanism through which one model novel knowledge distillation method, that bridges this gap (typically of higher capacity, called teacher) can train by (1) modeling student and teacher output distributions; another model (typically a smaller model that satisfies (2) sampling examples from an approximation to the the computational budget, called student). KD has been underlying data distribution; and (3) matching student and implemented in many machine learning tasks, for example teacher output distributions over this extended set including image classification [27], object detection [12, 65], video uncertain samples. We conduct rigorous evaluations on labeling [74], natural language processing [60, 41, 57, 36, regression and classification tasks and show that compared 61], and speech recognition [11, 59, 37]. to the standard knowledge distillation, extracurricular The idea of KD is to encourage the student to imitate learning reduces the gap by 46% to 68%. This leads to teacher's behavior over a set of data points, called transferset.

accuracy, deep learning, neural network, (19 more...)

arXiv.org Machine Learning

2007.00051

Genre: Research Report (0.50)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback