AITopics | Du, Yang

Collaborating Authors

Du, Yang

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval

Du, Yang, Liu, Yuqi, Jin, Qin

arXiv.org Artificial IntelligenceDec-26-2024

Cross-modal (e.g. image-text, video-text) retrieval is an important task in information retrieval and multimodal vision-language understanding field. Temporal understanding makes video-text retrieval more challenging than image-text retrieval. However, we find that the widely used video-text benchmarks have shortcomings in comprehensively assessing abilities of models, especially in temporal understanding, causing large-scale image-text pre-trained models can already achieve comparable zero-shot performance with video-text pre-trained models. In this paper, we introduce RTime, a novel temporal-emphasized video-text retrieval dataset. We first obtain videos of actions or events with significant temporality, and then reverse these videos to create harder negative samples. We then recruit annotators to judge the significance and reversibility of candidate videos, and write captions for qualified videos. We further adopt GPT-4 to extend more captions based on human-written captions. Our RTime dataset currently consists of 21k videos with 10 captions per video, totalling about 122 hours. Based on RTime, we propose three retrieval benchmark tasks: RTime-Origin, RTime-Hard, and RTime-Binary. We further enhance the use of harder-negatives in model training, and benchmark a variety of video-text models on RTime. Extensive experiment analysis proves that RTime indeed poses new and higher challenges to video-text retrieval. We release our RTime dataset\footnote{\url{https://github.com/qyr0403/Reversed-in-Time}} to further advance video-text retrieval and multimodal understanding research.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3664647.3680731

2412.19178

Country:

North America > United States (0.28)
Europe > Switzerland > Zürich > Zürich (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)

Add feedback

MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning

Li, Zheng, Du, Yang, Zheng, Mao, Song, Mingyang

arXiv.org Artificial IntelligenceDec-23-2024

Extensive research has been conducted to explore the capability of Large Language Models (LLMs) for table reasoning and has significantly improved the performance on existing benchmarks. However, tables and user questions in real-world applications are more complex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a \textbf{M}ult\textbf{i}-scale spreadsheet benchmark with \textbf{M}eta \textbf{o}perations for \textbf{Table} reasoning, named as MiMoTable. Specifically, MiMoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultaneously as a new perspective for measuring the difficulty of the existing benchmarks. Experimental results show that Claude-3.5-Sonnet achieves the best performance with 77.4\% accuracy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of existing benchmarks according to our new criteria. Experiments have shown that the performance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2412.11711

Country: North America > United States > Hawaii (0.14)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map Based Feature Extraction for Human Action Recognition

Du, Yang (Institute of Automation, Chinese Academy of Sciences) | Yuan, Chunfeng (University of Chinese Academy of Sciences) | Li, Bing (MTdata, Meitu) | Hu, Weiming (Institute of Automation, Chinese Academy of Sciences) | Yang, Hao (Institute of Automation, Chinese Academy of Sciences) | Fu, Zhikang (Institute of Automation, Chinese Academy of Sciences) | Zhao, Lili (Institute of Automation, Chinese Academy of Sciencess)

AAAI ConferencesFeb-8-2018

Feature extraction is a critical step in the task of action recognition. Hand-crafted features are often restricted because of their fixed forms and deep learning features are more effective but need large-scale labeled data for training. In this paper, we propose a new hierarchical Nonlinear Orthogonal Adaptive-Subspace Self-Organizing Map(NOASSOM) to adaptively and learn effective features from data without supervision. NOASSOM is extended from Adaptive-Subspace Self-Organizing Map (ASSOM) which only deals with linear data and is trained with supervision by the labeled data. Firstly, by adding a nonlinear orthogonal map layer, NOASSOM is able to handle the nonlinear input data and it avoids defining the specific form of the nonlinear orthogonal map by a kernel trick. Secondly, we modify loss function of ASSOM such that every input sample is used to train model individually. In this way, NOASSOM effectively learns the statistic patterns from data without supervision. Thirdly, we propose a hierarchical NOASSOM to extract more representative features. Finally, we apply the proposed hierarchical NOASSOM to efficiently describe the appearance and motion information around trajectories for action recognition. Experimental results on widely used datasets show that our method has superior performance than many state-of-the-art hand-crafted features and deep learning features based methods.

deep learning, neural network, noassom, (20 more...)

AAAI Conferences

Thirty-Second AAAI Conference on Artificial Intelligence

Genre: Research Report (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback