AITopics | Ma, Lin

Plotting

Ma, Lin

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RoboCAS: A Benchmark for Robotic Manipulation in Complex Object Arrangement Scenarios

Zheng, Liming, Yan, Feng, Liu, Fanfan, Feng, Chengjian, Kang, Zhuoliang, Ma, Lin

arXiv.org Artificial IntelligenceJul-9-2024

Foundation models hold significant potential for enabling robots to perform long-horizon general manipulation tasks. However, the simplicity of tasks and the uniformity of environments in existing benchmarks restrict their effective deployment in complex scenarios. To address this limitation, this paper introduces the \textit{RoboCAS} benchmark, the first benchmark specifically designed for complex object arrangement scenarios in robotic manipulation. This benchmark employs flexible and concise scripted policies to efficiently collect a diverse array of demonstrations, showcasing scattered, orderly, and stacked object arrangements within a highly realistic physical simulation environment. It includes complex processes such as target retrieval, obstacle clearance, and robot manipulation, testing agents' abilities to perform long-horizon planning for spatial reasoning and predicting chain reactions under ambiguous instructions. Extensive experiments on multiple baseline models reveal their limitations in managing complex object arrangement scenarios, underscoring the urgent need for intelligent agents capable of performing long-horizon operations in practical deployments and providing valuable insights for future research directions. Project website: \url{https://github.com/notFoundThisPerson/RoboCAS-v0}.

artificial intelligence, arxiv preprint arxiv, dataset, (14 more...)

arXiv.org Artificial Intelligence

2407.06951

Country: Asia > China (0.29)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-Design

Huang, Yiyang, Hao, Yuhui, Yu, Bo, Yan, Feng, Yang, Yuxin, Min, Feng, Han, Yinhe, Ma, Lin, Liu, Shaoshan, Liu, Qiang, Gan, Yiming

arXiv.org Artificial IntelligenceJul-5-2024

Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. https://github.com/hyy0613/Corki

large language model, machine learning, trajectory, (21 more...)

arXiv.org Artificial Intelligence

2407.04292

Country:

Asia > China (0.29)
North America > United States > Massachusetts (0.14)

Genre: Research Report (1.00)

Industry: Energy (0.67)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

Chen, Shimin, Yuan, Yitian, Chen, Shaoxiang, Jie, Zequn, Ma, Lin

arXiv.org Artificial IntelligenceJun-12-2024

Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data. This paper addresses the challenge by leveraging the visual commonalities between images and videos to efficiently evolve image-LVLMs into video-LVLMs. We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data. Our innovative weighted token sampler significantly compresses the visual token numbers of each video frame, effectively cutting computational expenses. We also find that judiciously using just 10% of the video data, compared to prior video-LVLMs, yields impressive results during various training phases. Moreover, we delve into the influence of video instruction data in limited-resource settings, highlighting the significance of incorporating video training data that emphasizes temporal understanding to enhance model performance. The resulting Fewer Tokens and Fewer Videos LVLM (FTFV-LVLM) exhibits exceptional performance across video and image benchmarks, validating our model's design and training approaches.

large language model, machine learning, video data, (16 more...)

arXiv.org Artificial Intelligence

2406.08024

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.81)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.51)

Add feedback

Methodology and Real-World Applications of Dynamic Uncertain Causality Graph for Clinical Diagnosis with Explainability and Invariance

Zhang, Zhan, Zhang, Qin, Jiao, Yang, Lu, Lin, Ma, Lin, Liu, Aihua, Liu, Xiao, Zhao, Juan, Xue, Yajun, Wei, Bing, Zhang, Mingxia, Gao, Ru, Zhao, Hong, Lu, Jie, Li, Fan, Zhang, Yang, Wang, Yiming, Zhang, Lei, Tian, Fengwei, Hu, Jie, Gou, Xin

arXiv.org Artificial IntelligenceJun-9-2024

AI-aided clinical diagnosis is desired in medical care. Existing deep learning models lack explainability and mainly focus on image analysis. The recently developed Dynamic Uncertain Causality Graph (DUCG) approach is causality-driven, explainable, and invariant across different application scenarios, without problems of data collection, labeling, fitting, privacy, bias, generalization, high cost and high energy consumption. Through close collaboration between clinical experts and DUCG technicians, 46 DUCG models covering 54 chief complaints were constructed. Over 1,000 diseases can be diagnosed without triage. Before being applied in real-world, the 46 DUCG models were retrospectively verified by third-party hospitals. The verified diagnostic precisions were no less than 95%, in which the diagnostic precision for every disease including uncommon ones was no less than 80%. After verifications, the 46 DUCG models were applied in the real-world in China. Over one million real diagnosis cases have been performed, with only 17 incorrect diagnoses identified. Due to DUCG's transparency, the mistakes causing the incorrect diagnoses were found and corrected. The diagnostic abilities of the clinicians who applied DUCG frequently were improved significantly. Following the introduction to the earlier presented DUCG methodology, the recommendation algorithm for potential medical checks is presented and the key idea of DUCG is extracted.

artificial intelligence, ducg, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1007/s10462-024-10763-w

2406.05746

Country: Asia > China (0.93)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Energy (1.00)
Health & Medicine > Diagnostic Medicine (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Tables as Texts or Images: Evaluating the Table Reasoning Ability of LLMs and MLLMs

Deng, Naihao, Sun, Zhenjie, He, Ruiqi, Sikka, Aman, Chen, Yulong, Ma, Lin, Zhang, Yue, Mihalcea, Rada

arXiv.org Artificial IntelligenceJun-5-2024

Specifically, we investigate Recent years have witnessed an explosion of Large several research questions, including the effectiveness Language Models (LLMs), with impressive performance of image-based representation of tabular on various Natural Language Processing data and how different text-based or imagebased (NLP) tasks (Brown et al., 2020; Touvron et al., prompt methods affect LLMs' performance 2023; Team et al., 2023). Research to date has on table-related tasks. In addition, we provide analysis examined the performance of LLMs for various and hypothesis of LLMs' behaviors. Our findings aspects and abilities (Bang et al., 2023b; Bubeck include: et al., 2023; Akter et al., 2023), but their effectiveness on structured data such as tables is less explored. LLMs maintain decent performance when we Unlike unstructured text, tables are systematically use image-based table representations. Sometimes, organized structures of a large amount of image-based table representations can information. This characteristic makes tabular make LLMs perform better.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2402.12424

Country:

North America > United States (1.00)
Europe (0.67)
Asia > Middle East > UAE (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (0.66)

Industry: Consumer Products & Services (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs

Li, Qingyuan, Meng, Ran, Li, Yiduo, Zhang, Bo, Lu, Yifan, Sun, Yerui, Ma, Lin, Xie, Yuchen

arXiv.org Artificial IntelligenceMay-28-2024

We introduce Integer Scale, a novel post-training quantization scheme for large language models that effectively resolves the inference bottleneck in current fine-grained quantization approaches while maintaining similar accuracies. Integer Scale is a free lunch as it requires no extra calibration or fine-tuning which will otherwise incur additional costs. It can be used plug-and-play for most fine-grained quantization methods. Its integration results in at most 1.85x end-to-end speed boost over the original counterpart with comparable accuracy. Additionally, due to the orchestration of the proposed Integer Scale and fine-grained quantization, we resolved the quantization difficulty for Mixtral-8x7B and LLaMA-3 models with negligible performance degradation, and it comes with an end-to-end speed boost of 2.13x, and 2.31x compared with their FP16 versions respectively.

large language model, machine learning, quantization, (17 more...)

arXiv.org Artificial Intelligence

2405.14597

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Li, Yunxin, Jiang, Shenyuan, Hu, Baotian, Wang, Longyue, Zhong, Wanqi, Luo, Wenhan, Ma, Lin, Zhang, Min

arXiv.org Artificial IntelligenceMay-18-2024

Abstract--Recent advancements in Multimodal Large Language Models (MLLMs) underscore the significance of scalable models and data to boost performance, yet this often incurs substantial computational costs. Although the Mixture of Experts (MoE) architecture has been employed to efficiently scale large language and image-text models, these efforts typically involve fewer experts and limited modalities. To address this, our work presents the pioneering attempt to develop a unified MLLM with the MoE architecture, named Uni-MoE that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We also implement a sparse MoE architecture within the LLMs to enable efficient training and inference through modality-level data parallelism and expert-level model parallelism. To enhance the multi-expert collaboration and generalization, we present a progressive training strategy: 1) Cross-modality alignment using various connectors with different cross-modality data, 2) Training modality-specific experts with cross-modality instruction data to activate experts' preferences, and 3) Tuning the Uni-MoE framework utilizing Low-Rank Adaptation (LoRA) on mixed multimodal instruction data. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets. The extensive experimental results demonstrate Uni-MoE's principal advantage of significantly reducing performance bias in handling mixed multimodal datasets, alongside improved multi-expert collaboration and generalization. Additionally, there is a growing trend [6], [7], [8], [9] toward building a unified MLLM that could comprehend more modalities such as video, audio, and speech, moving beyond the traditional imagetext paradigm. To catch up with superior closed-source MLLMs like GPT-4V [10] and Gemini [11], the main efforts of open-source community contain enlarging model sizes [12], as seen with the expansion of vision foundation models to 6 billion parameters [12] and the integration with 70B Large Language models (LLMs) [13], [14], and enhancing instruction tuning with diverse multimodal datasets [3], [15], [16]. These developments underscore the increasing ability of MLLMs to process and reason across multiple modalities, showing the importance of both model scalability and the expansion of multimodal instructional data.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.11273

Country:

Asia > China (0.69)
North America > United States > Hawaii (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Education > Educational Setting > K-12 Education (0.46)
Education > Educational Setting > Higher Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Aux-NAS: Exploiting Auxiliary Labels with Negligibly Extra Inference Cost

Gao, Yuan, Zhang, Weizhong, Luo, Wenhan, Ma, Lin, Yu, Jin-Gang, Xia, Gui-Song, Ma, Jiayi

arXiv.org Machine LearningMay-9-2024

We aim at exploiting additional auxiliary labels from an independent (auxiliary) task to boost the primary task performance which we focus on, while preserving a single task inference cost of the primary task. While most existing auxiliary learning methods are optimization-based relying on loss weights/gradients manipulation, our method is architecture-based with a flexible asymmetric structure for the primary and auxiliary tasks, which produces different networks for training and inference. Specifically, starting from two single task networks/branches (each representing a task), we propose a novel method with evolving networks where only primary-to-auxiliary links exist as the cross-task connections after convergence. These connections can be removed during the primary task inference, resulting in a single-task inference cost. We achieve this by formulating a Neural Architecture Search (NAS) problem, where we initialize bi-directional connections in the search space and guide the NAS optimization converging to an architecture with only the single-side primary-to-auxiliary connections. Moreover, our method can be incorporated with optimization-based auxiliary learning approaches. In this paper, we tackle the practical issue of auxiliary learning, which involves improving the performance of a specific task (i.e., the primary task) while incorporating additional auxiliary labels from different tasks (i.e., the auxiliary tasks). We aim to efficiently leverage these auxiliary labels to enhance the primary task's performance while maintaining a comparable computational and parameter cost to a single-task network when evaluating the primary task.

artificial intelligence, auxiliary task, machine learning, (17 more...)

arXiv.org Machine Learning

2405.05695

Country: Asia > China > Hubei Province (0.14)

Genre: Research Report > Promising Solution (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.89)

Add feedback

A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation

Li, Yunxin, Hu, Baotian, Luo, Wenhan, Ma, Lin, Ding, Yuxin, Zhang, Min

arXiv.org Artificial IntelligenceMar-7-2024

In this paper, we propose a new setting for generating product descriptions from images, augmented by marketing keywords. It leverages the combined power of visual and textual information to create descriptions that are more tailored to the unique features of products. For this setting, previous methods utilize visual and textual encoders to encode the image and keywords and employ a language model-based decoder to generate the product description. However, the generated description is often inaccurate and generic since same-category products have similar copy-writings, and optimizing the overall framework on large-scale samples makes models concentrate on common words yet ignore the product features. To alleviate the issue, we present a simple and effective Multimodal In-Context Tuning approach, named ModICT, which introduces a similar product sample as the reference and utilizes the in-context learning capability of language models to produce the description. During training, we keep the visual encoder and language model frozen, focusing on optimizing the modules responsible for creating multimodal in-context references and dynamic prompts. This approach preserves the language generation prowess of large language models (LLMs), facilitating a substantial increase in description diversity. To assess the effectiveness of ModICT across various language model scales and types, we collect data from three distinct product categories within the E-commerce domain. Extensive experiments demonstrate that ModICT significantly improves the accuracy (by up to 3.3% on Rouge-L) and diversity (by up to 9.4% on D-5) of generated results compared to conventional methods. Our findings underscore the potential of ModICT as a valuable tool for enhancing automatic generation of product descriptions in a wide range of applications. Code is at: https://github.com/HITsz-TMG/Multimodal-In-Context-Tuning

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2402.13587

Country:

North America (0.93)
Asia > China (0.68)
Europe (0.68)

Genre: Research Report > New Finding (0.86)

Industry: Information Technology > Services > e-Commerce Services (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native

Lu, Yao, Bian, Song, Chen, Lequn, He, Yongjun, Hui, Yulong, Lentz, Matthew, Li, Beibin, Liu, Fei, Li, Jialin, Liu, Qi, Liu, Rui, Liu, Xiaoxuan, Ma, Lin, Rong, Kexin, Wang, Jianguo, Wu, Yingjun, Wu, Yongji, Zhang, Huanchen, Zhang, Minjia, Zhang, Qizhen, Zhou, Tianyi, Zhuo, Danyang

arXiv.org Artificial IntelligenceJan-17-2024

In this paper, we investigate the intersection of large generative AI models and cloud-native computing architectures. Recent large models such as ChatGPT, while revolutionary in their capabilities, face challenges like escalating costs and demand for high-end GPUs. Drawing analogies between large-model-as-a-service (LMaaS) and cloud database-as-a-service (DBaaS), we describe an AI-native computing paradigm that harnesses the power of both cloud-native technologies (e.g., multi-tenancy and serverless computing) and advanced machine learning runtime (e.g., batched LoRA inference). These joint efforts aim to optimize costs-of-goods-sold (COGS) and improve resource accessibility. The journey of merging these two domains is just at the beginning and we hope to stimulate future research and development in this area.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2401.1223

Country:

North America > United States > Illinois (0.28)
North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report (0.64)

Industry: Information Technology > Services (1.00)

Technology:

Information Technology > Cloud Computing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback