AITopics | voken

Collaborating Authors

voken

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Li, Yongqi, Cai, Hongru, Wang, Wenjie, Qu, Leigang, Wei, Yinwei, Li, Wenjie, Nie, Liqiang, Chua, Tat-Seng

arXiv.org Artificial IntelligenceJul-24-2024

Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.

cross-modal retrieval, retrieval, voken, (15 more...)

arXiv.org Artificial Intelligence

2407.17274

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Singapore (0.04)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Zheng, Kaizhi, He, Xuehai, Wang, Xin Eric

arXiv.org Artificial IntelligenceOct-5-2023

Large Language Models (LLMs) have garnered significant attention for their advancements in natural language processing, demonstrating unparalleled prowess in text comprehension and generation. Yet, the simultaneous generation of images with coherent textual narratives remains an evolving frontier. In response, we introduce an innovative interleaved vision-and-language generation technique anchored by the concept of "generative vokens", acting as the bridge for harmonized image-text outputs. Our approach is characterized by a distinctive two-staged training strategy focusing on description-free multimodal generation, where the training requires no comprehensive descriptions of images. To bolster model integrity, classifier-free guidance is incorporated, enhancing the effectiveness of vokens on image generation. Our model, MiniGPT-5, exhibits substantial improvement over the baseline Divter model on the MMDialog dataset and consistently delivers superior or comparable multimodal outputs in human evaluations on the VIST dataset, highlighting its efficacy across diverse benchmarks. In the recent development of larger-scale vision-and-language models, multimodal feature integration is not just a evolving trend but a critical advancement shaping a wide array of applications, from multimodal dialogue agents to cutting-edge content creation tools. With the surge in research and development in this domain, vision-and-language models such as (Wu et al., 2023a; Li et al., 2023b; Tsimpoukelli et al., 2021; Alayrac et al., 2022) are on the brink of an era where they are expected to comprehend and generate both text and image content seamlessly. This multi-faceted ability is crucial, as it fosters enhanced interactions across various domains like virtual reality, media, and e-commerce. Essentially, the task is to enable models to coherently synthesize, recognize, and respond using both visual and textual modalities, harmonizing the information flow and creating cohesive narratives.

dataset, minigpt-5, voken, (12 more...)

arXiv.org Artificial Intelligence

2310.02239

Country:

Europe > United Kingdom > Scotland (0.04)
North America > United States > Illinois > Cook County > Chicago (0.04)
North America > United States > Florida > Hillsborough County > Tampa (0.04)
(4 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Promising Solution (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Tan, Hao, Bansal, Mohit

arXiv.org Artificial IntelligenceOct-13-2020

Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization

machine learning, natural language, voken, (17 more...)

arXiv.org Artificial Intelligence

2010.06775

Country:

North America > Canada > Ontario > Toronto (0.04)
North America > United States > Massachusetts > Middlesex County > Reading (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback