Foundational Models Defining a New Era in Vision: A Survey and Outlook

Awais, Muhammad, Naseer, Muzammal, Khan, Salman, Anwer, Rao Muhammad, Cholakkal, Hisham, Shah, Mubarak, Yang, Ming-Hsuan, Khan, Fahad Shahbaz

Jul-25-2023–arXiv.org Artificial Intelligence

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at \url{https://github.com/awaisrauf/Awesome-CV-Foundational-Models}.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

Jul-25-2023

arXiv.org PDF

Add feedback

Country:
- Oceania > Australia
  - Western Australia > Perth (0.04)
  - Australian Capital Territory > Canberra (0.04)
- North America > United States
  - Florida > Orange County
    - Orlando (0.14)
  - California > Merced County
    - Merced (0.14)
- Europe
  - Poland (0.04)
  - Switzerland > Zürich
    - Zürich (0.14)
  - Sweden > Östergötland County
    - Linköping (0.04)
  - Netherlands > North Holland
    - Amsterdam (0.04)
  - Italy > Tuscany
    - Florence (0.04)
  - Germany > Bavaria
    - Upper Bavaria > Munich (0.04)
  - France > Grand Est
    - Bas-Rhin > Strasbourg (0.04)
- Asia
  - South Korea > Daejeon
    - Daejeon (0.04)
  - Middle East
    - Jordan (0.04)
    - UAE > Abu Dhabi Emirate
      - Abu Dhabi (0.14)
    - Israel > Tel Aviv District
      - Tel Aviv (0.04)
  - Japan > Shikoku
    - Kagawa Prefecture > Takamatsu (0.04)
- Africa > Rwanda
  - Kigali > Kigali (0.04)

Genre:
- Overview (1.00)
- Research Report > New Finding (0.45)

Industry:
- Education (1.00)
- Information Technology > Security & Privacy (0.65)
- Health & Medicine > Diagnostic Medicine
  - Imaging (1.00)

Technology:
- Information Technology
  - Sensing and Signal Processing > Image Processing (1.00)
  - Artificial Intelligence
    - Vision (1.00)
    - Robots (1.00)
    - Representation & Reasoning (1.00)
    - Cognitive Science > Problem Solving (0.65)
    - Natural Language
      - Large Language Model (1.00)
      - Chatbot (1.00)
      - Text Processing (0.92)
    - Machine Learning > Neural Networks
      - Deep Learning (1.00)

Duplicate Docs Excel Report

Title
None found

Similar Docs Excel Report more

Title	Similarity	Source
None found