AITopics | learnable query

Collaborating Authors

learnable query

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

54c67d3db2df24a31cf045525f9460b9-Paper-Conference.pdf

Neural Information Processing SystemsFeb-13-2026, 12:56:42 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Heilongjiang Province > Daqing (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

TransMatcher: DeepImageMatchingThrough TransformersforGeneralizablePerson Re-identification: Appendix

Neural Information Processing SystemsFeb-7-2026, 12:37:28 GMT

Some algorithms perform unstably across different runs, thus the average among several runsisamorestablemeasure. Using a unified measure is convenient, concise, and space-saving for ablation study and parameteranalysis. HereH = hand W = w,but to be clear,let'sdenote them differently. Then in Eq. (7), GMP is applied along the last dimension ofhw elements, resulting in a vector of sizeHW. Third, the proposed method has already considered the efficiency,with itssimplified decoder and balanced parameter selection, and thus it is the most efficient one in cross-matching Transformers as shown in Table 2 of the main paper.

artificial intelligence, machine learning, transmatcher, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > UAE (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

54c67d3db2df24a31cf045525f9460b9-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 02:49:13 GMT

decoder, proceedings, query, (16 more...)

Neural Information Processing Systems

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Heilongjiang Province > Daqing (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
(2 more...)

Add feedback

Turbocharging Web Automation: The Impact of Compressed History States

Zhu, Xiyue, Tang, Peng, Liao, Haofu, Appalaraju, Srikar

arXiv.org Artificial IntelligenceJul-30-2025

Language models have led to a leap forward in web automation. The current web automation approaches take the current web state, history actions, and language instruction as inputs to predict the next action, overlooking the importance of history states. However, the highly verbose nature of web page states can result in long input sequences and sparse information, hampering the effective utilization of history states. In this paper, we propose a novel web history compressor approach to turbocharge web automation using history states. Our approach employs a history compressor module that distills the most task-relevant information from each history state into a fixed-length short representation, mitigating the challenges posed by the highly verbose history states. Experiments are conducted on the Mind2Web and WebLINX datasets to evaluate the effectiveness of our approach. Results show that our approach obtains 1.2-5.4% absolute accuracy improvements compared to the baseline approach without history inputs.

history input, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.21369

Country:

North America > United States > Illinois (0.04)
Asia > Singapore (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Add feedback

BREEN: Bridge Data-Efficient Encoder-Free Multimodal Learning with Learnable Queries

Li, Tianle, Rao, Yongming, Hu, Winston, Cheng, Yu

arXiv.org Artificial IntelligenceMar-16-2025

Encoder-free multimodal large language models(MLLMs) eliminate the need for a well-trained vision encoder by directly processing image tokens before the language model. While this approach reduces computational overhead and model complexity, it often requires large amounts of training data to effectively capture the visual knowledge typically encoded by vision models like CLIP. The absence of a vision encoder implies that the model is likely to rely on substantial data to learn the necessary visual-semantic alignments. In this work, we present BREEN, a data-efficient encoder-free multimodal architecture that mitigates this issue. BREEN leverages a learnable query and image experts to achieve comparable performance with significantly less training data. The learnable query, positioned between image and text tokens, is supervised by the output of a pretrained CLIP model to distill visual knowledge, bridging the gap between visual and textual modalities. Additionally, the image expert processes image tokens and learnable queries independently, improving efficiency and reducing interference with the LLM's textual capabilities. BREEN achieves comparable performance to prior encoder-free state-of-the-art models like Mono-InternVL, using only 13 million text-image pairs in training about one percent of the data required by existing methods. Our work highlights a promising direction for data-efficient encoder-free multimodal learning, offering an alternative to traditional encoder-based approaches.

large language model, learnable query, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2503.12446

Country:

Europe > Monaco (0.04)
Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
Asia > Middle East > Jordan (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Scaling Capability in Token Space: An Analysis of Large Vision Language Model

Li, Tenghui, Zhou, Guoxu, Zhao, Xuyang, Zhao, Qibin

arXiv.org Artificial IntelligenceDec-30-2024

The scaling capability has been widely validated in neural language models with respect to the number of parameters and the size of training data. One important question is that does the scaling capability also exists similarly with respect to the number of vision tokens in large vision language Model? This study fills the gap by investigating the relationship between the number of vision tokens and the performance on vision-language models. Our theoretical analysis and empirical evaluations demonstrate that the model exhibits scalable performance \(S(N_l)\) with respect to the number of vision tokens \(N_l\), characterized by the relationship \(S(N_l) \approx (c/N_l)^{\alpha}\). Furthermore, we also investigate the impact of a fusion mechanism that integrates the user's question with vision tokens. The results reveal two key findings. First, the scaling capability remains intact with the incorporation of the fusion mechanism. Second, the fusion mechanism enhances model performance, particularly when the user's question is task-specific and relevant. The analysis, conducted on fifteen diverse benchmarks spanning a broad range of tasks and domains, validates the effectiveness of the proposed approach.

language model, sequence, vision token, (16 more...)

arXiv.org Artificial Intelligence

2412.18387

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Florida > Miami-Dade County > Miami (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
(15 more...)

Genre: Research Report > New Finding (0.92)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Referencing Where to Focus: Improving VisualGrounding with Referential Query

Wang, Yabing, Tian, Zhuotao, Guo, Qingpei, Qin, Zheng, Zhou, Sanping, Yang, Ming, Wang, Le

arXiv.org Artificial IntelligenceDec-26-2024

Visual Grounding aims to localize the referring object in an image given a natural language expression. Recent advancements in DETR-based visual grounding methods have attracted considerable attention, as they directly predict the coordinates of the target object without relying on additional efforts, such as pre-generated proposal candidates or pre-defined anchor boxes. However, existing research primarily focuses on designing stronger multi-modal decoder, which typically generates learnable queries by random initialization or by using linguistic embeddings. This vanilla query generation approach inevitably increases the learning difficulty for the model, as it does not involve any target-related information at the beginning of decoding. Furthermore, they only use the deepest image feature during the query learning process, overlooking the importance of features from other levels. To address these issues, we propose a novel approach, called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP and generate the referential query to provide the prior context for decoder, along with a task-specific decoder. By incorporating the referential query into the decoder, we can effectively mitigate the learning difficulty of the decoder, and accurately concentrate on the target object. Additionally, our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method, outperforming state-of-the-art approaches on five visual grounding benchmarks.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2412.19155

Country:

Europe > Netherlands > North Holland > Amsterdam (0.04)
Asia > China > Heilongjiang Province > Daqing (0.04)
Asia > China > Shaanxi Province > Xi'an (0.04)
(2 more...)

Genre:

Research Report > Experimental Study (1.00)
Research Report > Promising Solution (0.68)
Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback