AITopics | Rajeswar, Sai

Collaborating Authors

Rajeswar, Sai

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

StarFlow: Generating Structured Workflow Outputs From Sketch Images

Bechard, Patrice, Wang, Chao, Abaskohi, Amirhossein, Rodriguez, Juan, Pal, Christopher, Vazquez, David, Gella, Spandana, Rajeswar, Sai, Taslakian, Perouz

arXiv.org Artificial IntelligenceMar-27-2025

Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2503.21889

Country:

Europe > Switzerland (0.28)
Europe > Austria (0.28)

Genre:

Workflow (1.00)
Research Report > New Finding (0.86)

Industry: Information Technology > Software (0.48)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Rodriguez, Juan A., Kalsi, Montek, Awal, Rabiul, Chapados, Nicolas, Özsu, M. Tamer, Agrawal, Aishwarya, Vazquez, David, Pal, Christopher, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

arXiv.org Artificial IntelligenceMar-19-2025

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

agent, platform, ui element, (15 more...)

arXiv.org Artificial Intelligence

2503.15661

Country:

Asia (0.28)
North America > Canada > Quebec (0.14)

Industry: Information Technology (0.93)

Technology:

Information Technology > Software (1.00)
Information Technology > Graphics (1.00)
Information Technology > Communications (1.00)
(5 more...)

Add feedback

PairBench: A Systematic Framework for Selecting Reliable Judge VLMs

Feizi, Aarash, Rajeswar, Sai, Romero-Soriano, Adriana, Rabbany, Reihaneh, Gella, Spandana, Zantedeschi, Valentina, Monteiro, João

arXiv.org Artificial IntelligenceFeb-24-2025

As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare data pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.1521

Country: Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Masry, Ahmed, Rodriguez, Juan A., Zhang, Tianyu, Wang, Suyuchen, Wang, Chao, Feizi, Aarash, Suresh, Akshay Kalkunte, Puri, Abhay, Jian, Xiangru, Noël, Pierre-André, Madhusudhan, Sathwik Tejaswi, Pedersoli, Marco, Liu, Bang, Chapados, Nicolas, Bengio, Yoshua, Hoque, Enamul, Pal, Christopher, Laradji, Issam H., Vazquez, David, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

arXiv.org Artificial IntelligenceFeb-3-2025

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.01341

Country:

North America > United States (0.28)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)

Add feedback

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Rodriguez, Juan, Jian, Xiangru, Panigrahi, Siba Smarak, Zhang, Tianyu, Feizi, Aarash, Puri, Abhay, Kalkunte, Akshay, Savard, François, Masry, Ahmed, Nayak, Shravan, Awal, Rabiul, Massoud, Mahsa, Abaskohi, Amirhossein, Li, Zichao, Wang, Suyuchen, Noël, Pierre-André, Richter, Mats Leon, Vadacchino, Saverio, Agarwal, Shubbam, Biswas, Sanket, Shanian, Sara, Zhang, Ying, Bolger, Noah, MacDonald, Kurt, Fauvel, Simon, Tejaswi, Sathwik, Sunkara, Srinivas, Monteiro, Joao, Dvijotham, Krishnamurthy DJ, Scholak, Torsten, Chapados, Nicolas, Kharagani, Sepideh, Hughes, Sean, Özsu, M., Reddy, Siva, Pedersoli, Marco, Bengio, Yoshua, Pal, Christopher, Laradji, Issam, Gella, Spandanna, Taslakian, Perouz, Vazquez, David, Rajeswar, Sai

arXiv.org Artificial IntelligenceDec-5-2024

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.04626

Country:

North America > United States (0.92)
Europe > France (0.68)

Genre:

Workflow (1.00)
Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multimodal foundation world models for generalist embodied agents

Mazzaglia, Pietro, Verbelen, Tim, Dhoedt, Bart, Courville, Aaron, Rajeswar, Sai

arXiv.org Artificial IntelligenceJun-25-2024

Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2406.18043

Country: Europe (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.85)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Monteiro, Joao, Noel, Pierre-Andre, Marcotte, Etienne, Rajeswar, Sai, Zantedeschi, Valentina, Vazquez, David, Chapados, Nicolas, Pal, Christopher, Taslakian, Perouz

arXiv.org Artificial IntelligenceJun-17-2024

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2406.11811

Country: North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (1.00)

Industry:

Education (1.00)
Government (0.93)
Information Technology > Security & Privacy (0.68)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Capture the Flag: Uncovering Data Insights with Large Language Models

Laradji, Issam, Taslakian, Perouz, Rajeswar, Sai, Zantedeschi, Valentina, Lacoste, Alexandre, Chapados, Nicolas, Vazquez, David, Pal, Christopher, Drouin, Alexandre

arXiv.org Machine LearningDec-21-2023

The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. However, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Machine Learning

2312.13876

Country:

North America > United States > Alaska (0.14)
North America > United States > California (0.14)
North America > United States > Illinois (0.14)

Genre: Research Report (1.00)

Industry:

Retail (0.97)
Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Equivariant Adaptation of Large Pretrained Models

Mondal, Arnab Kumar, Panigrahi, Siba Smarak, Kaba, Sékou-Oumar, Rajeswar, Sai, Ravanbakhsh, Siamak

arXiv.org Artificial IntelligenceOct-29-2023

Equivariant networks are specifically designed to ensure consistent behavior with respect to a set of input transformations, leading to higher sample efficiency and more accurate and robust predictions. However, redesigning each component of prevalent deep neural network architectures to achieve chosen equivariance is a difficult problem and can result in a computationally expensive network during both training and inference. A recently proposed alternative towards equivariance that removes the architectural constraints is to use a simple canonicalization network that transforms the input to a canonical form before feeding it to an unconstrained prediction network. We show here that this approach can effectively be used to make a large pretrained network equivariant. However, we observe that the produced canonical orientations can be misaligned with those of the training distribution, hindering performance. Using dataset-dependent priors to inform the canonicalization function, we are able to make large pretrained models equivariant while maintaining their performance. This significantly improves the robustness of these models to deterministic transformations of the data, such as rotations. We believe this equivariant adaptation of large pretrained models can help their domain-specific applications with known symmetry priors.

artificial intelligence, canonicalization function, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2310.01647

Country:

North America > Canada > Quebec (0.14)
North America > United States > New York (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Efficient Dynamics Modeling in Interactive Environments with Koopman Theory

Mondal, Arnab Kumar, Panigrahi, Siba Smarak, Rajeswar, Sai, Siddiqi, Kaleem, Ravanbakhsh, Siamak

arXiv.org Artificial IntelligenceAug-26-2023

The accurate modeling of dynamics in interactive environments is critical for successful long-range prediction. Such a capability could advance Reinforcement Learning (RL) and Planning algorithms, but achieving it is challenging. Inaccuracies in model estimates can compound, resulting in increased errors over long horizons. We approach this problem from the lens of Koopman theory, where the nonlinear dynamics of the environment can be linearized in a high-dimensional latent space. This allows us to efficiently parallelize the sequential problem of long-range prediction using convolution while accounting for the agent's action at every time step. Our approach also enables stability analysis and better control over gradients through time. Taken together, these advantages result in significant improvement over the existing approaches, both in the efficiency and the accuracy of modeling dynamics over extended horizons. We also show that this model can be easily incorporated into dynamics modeling for model-based planning and model-free RL and report promising experimental results.

dynamic model, machine learning, reinforcement learning, (16 more...)

arXiv.org Artificial Intelligence

2306.11941

Country:

North America > Canada > Quebec (0.14)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback