AITopics | Jian, Xiangru

Collaborating Authors

Jian, Xiangru

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Nayak, Shravan, Jian, Xiangru, Lin, Kevin Qinghong, Rodriguez, Juan A., Kalsi, Montek, Awal, Rabiul, Chapados, Nicolas, Özsu, M. Tamer, Agrawal, Aishwarya, Vazquez, David, Pal, Christopher, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

arXiv.org Artificial IntelligenceMar-19-2025

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks-Element Grounding, Layout Grounding, and Action Prediction-with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer use agents. By releasing UI-Vision as open-source, we aim to advance the development of more capable agents for real-world desktop tasks.

agent, platform, ui element, (15 more...)

arXiv.org Artificial Intelligence

2503.15661

Country:

Asia (0.28)
North America > Canada > Quebec (0.14)

Industry: Information Technology (0.93)

Technology:

Information Technology > Software (1.00)
Information Technology > Graphics (1.00)
Information Technology > Communications (1.00)
(5 more...)

Add feedback

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Masry, Ahmed, Rodriguez, Juan A., Zhang, Tianyu, Wang, Suyuchen, Wang, Chao, Feizi, Aarash, Suresh, Akshay Kalkunte, Puri, Abhay, Jian, Xiangru, Noël, Pierre-André, Madhusudhan, Sathwik Tejaswi, Pedersoli, Marco, Liu, Bang, Chapados, Nicolas, Bengio, Yoshua, Hoque, Enamul, Pal, Christopher, Laradji, Issam H., Vazquez, David, Taslakian, Perouz, Gella, Spandana, Rajeswar, Sai

arXiv.org Artificial IntelligenceFeb-3-2025

Aligning visual features with language embeddings is a key challenge in vision-language models (VLMs). The performance of such models hinges on having a good connector that maps visual features generated by a vision encoder to a shared embedding space with the LLM while preserving semantic similarity. Existing connectors, such as multilayer perceptrons (MLPs), often produce out-of-distribution or noisy inputs, leading to misalignment between the modalities. In this work, we propose a novel vision-text alignment method, AlignVLM, that maps visual features to a weighted average of LLM text embeddings. Our approach leverages the linguistic priors encoded by the LLM to ensure that visual features are mapped to regions of the space that the LLM can effectively interpret. AlignVLM is particularly effective for document understanding tasks, where scanned document images must be accurately mapped to their textual content. Our extensive experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods. We provide further analysis demonstrating improved vision-text feature alignment and robustness to noise.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.01341

Country:

North America > United States (0.28)
North America > Canada > Quebec > Montreal (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.68)

Add feedback

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Rodriguez, Juan, Jian, Xiangru, Panigrahi, Siba Smarak, Zhang, Tianyu, Feizi, Aarash, Puri, Abhay, Kalkunte, Akshay, Savard, François, Masry, Ahmed, Nayak, Shravan, Awal, Rabiul, Massoud, Mahsa, Abaskohi, Amirhossein, Li, Zichao, Wang, Suyuchen, Noël, Pierre-André, Richter, Mats Leon, Vadacchino, Saverio, Agarwal, Shubbam, Biswas, Sanket, Shanian, Sara, Zhang, Ying, Bolger, Noah, MacDonald, Kurt, Fauvel, Simon, Tejaswi, Sathwik, Sunkara, Srinivas, Monteiro, Joao, Dvijotham, Krishnamurthy DJ, Scholak, Torsten, Chapados, Nicolas, Kharagani, Sepideh, Hughes, Sean, Özsu, M., Reddy, Siva, Pedersoli, Marco, Bengio, Yoshua, Pal, Christopher, Laradji, Issam, Gella, Spandanna, Taslakian, Perouz, Vazquez, David, Rajeswar, Sai

arXiv.org Artificial IntelligenceDec-5-2024

Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows, extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .

data mining, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2412.04626

Country:

Europe > France (0.68)
North America > Canada > Quebec > Montreal (0.14)
North America > United States > Texas (0.14)

Genre:

Workflow (1.00)
Overview (1.00)
Research Report > New Finding (0.67)

Industry:

Law (1.00)
Information Technology (1.00)
Government (1.00)
(2 more...)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Enhancing Graph Self-Supervised Learning with Graph Interplay

Zhao, Xinjian, Pang, Wei, Jian, Xiangru, Xu, Yaoyao, Ying, Chaolong, Yu, Tianshu

arXiv.org Machine LearningOct-8-2024

Graph self-supervised learning (GSSL) has emerged as a compelling framework for extracting informative representations from graph-structured data without extensive reliance on labeled inputs. In this study, we introduce Graph Interplay (GIP), an innovative and versatile approach that significantly enhances the performance equipped with various existing GSSL methods. To this end, GIP advocates direct graph-level communications by introducing random inter-graph edges within standard batches. Against GIP's simplicity, we further theoretically show that \textsc{GIP} essentially performs a principled manifold separation via combining inter-graph message passing and GSSL, bringing about more structured embedding manifolds and thus benefits a series of downstream tasks. Our empirical study demonstrates that GIP surpasses the performance of prevailing GSSL methods across multiple benchmarks by significant margins, highlighting its potential as a breakthrough approach. Besides, GIP can be readily integrated into a series of GSSL methods and consistently offers additional performance gain. This advancement not only amplifies the capability of GSSL but also potentially sets the stage for a novel graph learning paradigm in a broader sense.

artificial intelligence, graph, machine learning, (14 more...)

arXiv.org Machine Learning

2410.04061

Country: Asia > China (0.14)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)

Add feedback

Do spectral cues matter in contrast-based graph self-supervised learning?

Jian, Xiangru, Zhao, Xinjian, Pang, Wei, Ying, Chaolong, Wang, Yimu, Xu, Yaoyao, Yu, Tianshu

arXiv.org Artificial IntelligenceMay-29-2024

The recent surge in contrast-based graph self-supervised learning has prominently featured an intensified exploration of spectral cues. However, an intriguing paradox emerges, as methods grounded in seemingly conflicting assumptions or heuristic approaches regarding the spectral domain demonstrate notable enhancements in learning performance. This paradox prompts a critical inquiry into the genuine contribution of spectral information to contrast-based graph self-supervised learning. This study undertakes an extensive investigation into this inquiry, conducting a thorough study of the relationship between spectral characteristics and the learning outcomes of contemporary methodologies. Based on this analysis, we claim that the effectiveness and significance of spectral information need to be questioned. Instead, we revisit simple edge perturbation: random edge dropping designed for node-level self-supervised learning and random edge adding intended for graph-level self-supervised learning. Compelling evidence is presented that these simple yet effective strategies consistently yield superior performance while demanding significantly fewer computational resources compared to all prior spectral augmentation methods. The proposed insights represent a significant leap forward in the field, potentially reshaping the understanding and implementation of graph self-supervised learning.

artificial intelligence, graph, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2405.196

Country: Asia > China (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (1.00)

Add feedback

HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Wang, Yimu, Yuan, Shuai, Jian, Xiangru, Pang, Wei, Wang, Mushi, Yu, Ning

arXiv.org Artificial IntelligenceApr-7-2024

While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.

computer vision, large language model, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2404.05083

Country:

Europe (0.93)
North America > United States > New York (0.14)
Asia > Middle East > Israel (0.14)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Media > Film (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback