AITopics | Ghosh, Soham

Collaborating Authors

Ghosh, Soham

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Pixtral 12B

Agrawal, Pravesh, Antoniak, Szymon, Hanna, Emma Bou, Bout, Baptiste, Chaplot, Devendra, Chudnovsky, Jessica, Costa, Diogo, De Monicault, Baudouin, Garg, Saurabh, Gervet, Theophile, Ghosh, Soham, Héliou, Amélie, Jacob, Paul, Jiang, Albert Q., Khandelwal, Kartik, Lacroix, Timothée, Lample, Guillaume, Casas, Diego Las, Lavril, Thibaut, Scao, Teven Le, Lo, Andy, Marshall, William, Martin, Louis, Mensch, Arthur, Muddireddy, Pavankumar, Nemychnikova, Valera, Pellat, Marie, Von Platen, Patrick, Raghuraman, Nikhil, Rozière, Baptiste, Sablayrolles, Alexandre, Saulnier, Lucile, Sauvestre, Romain, Shang, Wendy, Soletskyi, Roman, Stewart, Lawrence, Stock, Pierre, Studnia, Joachim, Subramanian, Sandeep, Vaze, Sagar, Wang, Thomas, Yang, Sophia

arXiv.org Artificial IntelligenceOct-10-2024

We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2410.07073

Country: North America > United States (0.46)

Genre:

Research Report (0.64)
Questionnaire & Opinion Survey (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

Yan, Shen, Zhu, Tao, Wang, Zirui, Cao, Yuan, Zhang, Mi, Ghosh, Soham, Wu, Yonghui, Yu, Jiahui

arXiv.org Artificial IntelligenceMar-15-2023

Given a well-pretrained imagetext reuses a pretrained image-text contrastive captioner foundation model, it is natural to question whether any (CoCa) model and adapt it to video-text tasks with minimal heavy video-specific adaptor or many video-specific data is extra training. While previous works adapt image-text needed when transferring to video-text modelling models with various cross-frame fusion modules, we find In this paper, we explore an efficient approach to establish that the generative attentional pooling and contrastive attentional a foundational video-text model for tasks including pooling layers in CoCa are instantly adaptable to open-vocabulary video classification, text-to-video retrieval, flattened frame embeddings, yielding state-of-the-art results video captioning and video question-answering. We on zero-shot video classification and zero-shot text-to-video present VideoCoCa, a minimalist approach that extends retrieval. Furthermore, we explore lightweight finetuning the image-text contrastive captioners (CoCa) [68] to videotext on top of VideoCoCa, and achieve strong results on video tasks. The design principle of VideoCoCa is to maximally question-answering and video captioning.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2212.04979

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.86)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.56)

Add feedback

Concurrent Meta Reinforcement Learning

Parisotto, Emilio, Ghosh, Soham, Yalamanchi, Sai Bhargav, Chinnaobireddy, Varsha, Wu, Yuhuai, Salakhutdinov, Ruslan

arXiv.org Artificial IntelligenceMar-6-2019

State-of-the-art meta reinforcement learning algorithms typically assume the setting of a single agent interacting with its environment in a sequential manner. A negative side-effect of this sequential execution paradigm is that, as the environment becomes more and more challenging, and thus requiring more interaction episodes for the meta-learner, it needs the agent to reason over longer and longer time-scales. To combat the difficulty of long time-scale credit assignment, we propose an alternative parallel framework, which we name "Concurrent Meta-Reinforcement Learning" (CMRL), that transforms the temporal credit assignment problem into a multi-agent reinforcement learning one. In this multi-agent setting, a set of parallel agents are executed in the same environment and each of these "rollout" agents are given the means to communicate with each other. The goal of the communication is to coordinate, in a collaborative manner, the most efficient exploration of the shared task the agents are currently assigned. This coordination therefore represents the meta-learning aspect of the framework, as each agent can be assigned or assign itself a particular section of the current task's state space. This framework is in contrast to standard RL methods that assume that each parallel rollout occurs independently, which can potentially waste computation if many of the rollouts end up sampling the same part of the state space. Furthermore, the parallel setting enables us to define several reward sharing functions and auxiliary losses that are non-trivial to apply in the sequential setting. We demonstrate the effectiveness of our proposed CMRL at improving over sequential methods in a variety of challenging tasks.

artificial intelligence, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

1903.0271

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment (0.93)
Media > Television (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback