Lu, Su
InfiR : Crafting Effective Small Language Models and Multimodal Small Language Models in Reasoning
Xie, Congkai, Cai, Shuo, Wang, Wenjun, Li, Pengxiang, Sang, Zhijie, Yang, Kejing, Zhang, Yiming, Li, Zhen, Zhu, Guanghao, Liu, Zeyu, Yu, Yang, Liu, Yuhang, Lu, Su, He, Baoyi, Zhou, Qi, Han, Xiaotian, Yuan, Jianbo, Zhang, Shengyu, Wu, Fei, Yang, Hongxia
Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have made significant advancements in reasoning capabilities. However, they still face challenges such as high computational demands and privacy concerns. This paper focuses on developing efficient Small Language Models (SLMs) and Multimodal Small Language Models (MSLMs) that retain competitive reasoning abilities. We introduce a novel training pipeline that enhances reasoning capabilities and facilitates deployment on edge devices, achieving state-of-the-art performance while minimizing development costs. \InfR~ aims to advance AI systems by improving reasoning, reducing adoption barriers, and addressing privacy concerns through smaller model sizes. Resources are available at https://github. com/Reallm-Labs/InfiR.
Support-Target Protocol for Meta-Learning
Lu, Su, Ye, Han-Jia, Zhan, De-Chuan
The support/query (S/Q) training protocol is widely used in meta-learning. S/Q protocol trains a task-specific model on S and then evaluates it on Q to optimize the meta-model using query loss, which depends on size and quality of Q. In this paper, we study a new S/T protocol for meta-learning. Assuming that we have access to the theoretically optimal model T for a task, we can directly match the task-specific model trained on S to T. S/T protocol offers a more accurate evaluation since it does not rely on possibly biased and noisy query instances. There are two challenges in putting S/T protocol into practice. Firstly, we have to determine how to match the task-specific model to T. To this end, we minimize the discrepancy between them on a fictitious dataset generated by adversarial learning, and distill the prediction ability of T to the task-specific model. Secondly, we usually do not have ready-made optimal models. As an alternative, we construct surrogate target models by fine-tuning on local tasks the globally pre-trained meta-model, maintaining both efficiency and veracity.
Few-Shot Action Recognition with Compromised Metric via Optimal Transport
Lu, Su, Ye, Han-Jia, Zhan, De-Chuan
Although vital to computer vision systems, few-shot action recognition is still not mature despite the wide research of few-shot image classification. Popular few-shot learning algorithms extract a transferable embedding from seen classes and reuse it on unseen classes by constructing a metric-based classifier. One main obstacle to applying these algorithms in action recognition is the complex structure of videos. Some existing solutions sample frames from a video and aggregate their embeddings to form a video-level representation, neglecting important temporal relations. Others perform an explicit sequence matching between two videos and define their distance as matching cost, imposing too strong restrictions on sequence ordering. In this paper, we propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions. CMOT simultaneously considers semantic and temporal information in videos under Optimal Transport framework, and is discriminative for both content-sensitive and ordering-sensitive tasks. In detail, given two videos, we sample segments from them and cast the calculation of their distance as an optimal transport problem between two segment sequences. To preserve the inherent temporal ordering information, we additionally amend the ground cost matrix by penalizing it with the positional distance between a pair of segments. Empirical results on benchmark datasets demonstrate the superiority of CMOT.