AITopics | Vazquez, David

Collaborating Authors

Vazquez, David

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

StarFlow: Generating Structured Workflow Outputs From Sketch Images

Bechard, Patrice, Wang, Chao, Abaskohi, Amirhossein, Rodriguez, Juan, Pal, Christopher, Vazquez, David, Gella, Spandana, Rajeswar, Sai, Taslakian, Perouz

arXiv.org Artificial IntelligenceMar-27-2025

Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.21889

Country:

Europe > Switzerland (0.28)
Europe > Austria (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (0.86)

Industry: Information Technology > Software (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Rodriguez, Juan A., Kalsi, Montek, Awal, Rabiul, Chapados, Nicolas, Özsu, M. Tamer, Agrawal, Aishwarya, Vazquez, David, Pal, Christopher, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

arXiv.org Artificial IntelligenceMar-19-2025

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

agent, platform, ui element, (15 more...)

arXiv.org Artificial Intelligence

2503.15661

Country:

Asia (0.28)
North America > Canada > Quebec (0.14)

Industry: Information Technology (0.93)

Technology:

Information Technology > Software (1.00)
Information Technology > Graphics (1.00)
Information Technology > Communications (1.00)
(5 more...)

Add feedback

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Masry, Ahmed, Rodriguez, Juan A., Zhang, Tianyu, Wang, Suyuchen, Wang, Chao, Feizi, Aarash, Suresh, Akshay Kalkunte, Puri, Abhay, Jian, Xiangru, Noël, Pierre-André, Madhusudhan, Sathwik Tejaswi, Pedersoli, Marco, Liu, Bang, Chapados, Nicolas, Bengio, Yoshua, Hoque, Enamul, Pal, Christopher, Laradji, Issam H., Vazquez, David, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

arXiv.org Artificial IntelligenceFeb-3-2025

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.01341

Country:

North America > United States (0.28)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)

Add feedback

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Rodriguez, Juan, Jian, Xiangru, Panigrahi, Siba Smarak, Zhang, Tianyu, Feizi, Aarash, Puri, Abhay, Kalkunte, Akshay, Savard, François, Masry, Ahmed, Nayak, Shravan, Awal, Rabiul, Massoud, Mahsa, Abaskohi, Amirhossein, Li, Zichao, Wang, Suyuchen, Noël, Pierre-André, Richter, Mats Leon, Vadacchino, Saverio, Agarwal, Shubbam, Biswas, Sanket, Shanian, Sara, Zhang, Ying, Bolger, Noah, MacDonald, Kurt, Fauvel, Simon, Tejaswi, Sathwik, Sunkara, Srinivas, Monteiro, Joao, Dvijotham, Krishnamurthy DJ, Scholak, Torsten, Chapados, Nicolas, Kharagani, Sepideh, Hughes, Sean, Özsu, M., Reddy, Siva, Pedersoli, Marco, Bengio, Yoshua, Pal, Christopher, Laradji, Issam, Gella, Spandanna, Taslakian, Perouz, Vazquez, David, Rajeswar, Sai

arXiv.org Artificial IntelligenceDec-5-2024

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.04626

Country:

North America > United States (0.92)
Europe > France (0.68)

Genre:

Workflow (1.00)
Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

IntentGPT: Few-shot Intent Discovery with Large Language Models

Rodriguez, Juan A., Botzer, Nicholas, Vazquez, David, Pal, Christopher, Pedersoli, Marco, Laradji, Issam

arXiv.org Artificial IntelligenceNov-15-2024

In today's digitally driven world, dialogue systems play a pivotal role in enhancing user interactions, from customer service to virtual assistants. In these dialogues, it is important to identify user's goals automatically to resolve their needs promptly. This has necessitated the integration of models that perform Intent Detection. However, users' intents are diverse and dynamic, making it challenging to maintain a fixed set of predefined intents. As a result, a more practical approach is to develop a model capable of identifying new intents as they emerge. We address the challenge of Intent Discovery, an area that has drawn significant attention in recent research efforts. Existing methods need to train on a substantial amount of data for correctly identifying new intents, demanding significant human effort. To overcome this, we introduce IntentGPT, a novel training-free method that effectively prompts Large Language Models (LLMs) such as GPT-4 to discover new intents with minimal labeled data. IntentGPT comprises an \textit{In-Context Prompt Generator}, which generates informative prompts for In-Context Learning, an \textit{Intent Predictor} for classifying and discovering user intents from utterances, and a \textit{Semantic Few-Shot Sampler} that selects relevant few-shot examples and a set of known intents to be injected into the prompt. Our experiments show that IntentGPT outperforms previous methods that require extensive domain-specific data and fine-tuning, in popular benchmarks, including CLINC and BANKING, among others.

large language model, machine learning, utterance, (16 more...)

arXiv.org Artificial Intelligence

2411.1067

Country: North America (0.28)

Genre: Research Report (0.82)

Industry: Banking & Finance (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

Sahu, Gaurav, Puri, Abhay, Rodriguez, Juan, Drouin, Alexandre, Taslakian, Perouz, Zantedeschi, Valentina, Lacoste, Alexandre, Vazquez, David, Chapados, Nicolas, Pal, Christopher, Mudumba, Sai Rajeswar, Laradji, Issam Hadj

arXiv.org Artificial IntelligenceJul-8-2024

Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We introduce InsightBench, a benchmark dataset with three key features. First, it consists of 31 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3-Eval as an effective, open-source evaluator method to assess agents' ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive data analytics and can be accessed here: https://github.com/ServiceNow/insight-bench.

data mining, large language model, machine learning, (24 more...)

arXiv.org Artificial Intelligence

2407.06423

Country: North America > Canada (0.28)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Monteiro, Joao, Noel, Pierre-Andre, Marcotte, Etienne, Rajeswar, Sai, Zantedeschi, Valentina, Vazquez, David, Chapados, Nicolas, Pal, Christopher, Taslakian, Perouz

arXiv.org Artificial IntelligenceJun-17-2024

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.11811

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Industry:

Education (1.00)
Government (0.93)
Information Technology > Security & Privacy (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Drouin, Alexandre, Gasse, Maxime, Caccia, Massimo, Laradji, Issam H., Del Verme, Manuel, Marty, Tom, Boisvert, Léo, Thakkar, Megh, Cappart, Quentin, Vazquez, David, Chapados, Nicolas, Lacoste, Alexandre

arXiv.org Artificial IntelligenceJun-14-2024

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

large language model, machine learning, workarena, (21 more...)

arXiv.org Artificial Intelligence

2403.07718

Country:

North America > Canada > Quebec > Montreal (0.14)
North America > United States > New York > New York County > New York City (0.14)

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Software (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

GEO-Bench: Toward Foundation Models for Earth Monitoring

Lacoste, Alexandre, Lehmann, Nils, Rodriguez, Pau, Sherwin, Evan David, Kerner, Hannah, Lütjens, Björn, Irvin, Jeremy Andrew, Dao, David, Alemohammad, Hamed, Drouin, Alexandre, Gunturkun, Mehmet, Huang, Gabriel, Vazquez, David, Newman, Dava, Bengio, Yoshua, Ermon, Stefano, Zhu, Xiao Xiang

arXiv.org Artificial IntelligenceDec-23-2023

Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2306.03831

Country:

Europe (0.67)
North America > United States > Virginia (0.28)

Genre: Research Report (1.00)

Industry:

Energy > Oil & Gas (1.00)
Education (1.00)
Government > Regional Government > North America Government > United States Government (0.93)
Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.37)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Capture the Flag: Uncovering Data Insights with Large Language Models

Laradji, Issam, Taslakian, Perouz, Rajeswar, Sai, Zantedeschi, Valentina, Lacoste, Alexandre, Chapados, Nicolas, Vazquez, David, Pal, Christopher, Drouin, Alexandre

arXiv.org Machine LearningDec-21-2023

The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. However, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2312.13876

Country: North America > United States (1.00)

Genre: Research Report (1.00)

Industry:

Retail (0.97)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback