Pyeongchang
A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions
Yin, Hao, Parmar, Paritosh, Xu, Daoliang, Zhang, Yang, Zheng, Tianyou, Fu, Weiwei
Action Quality Assessment (AQA) -- the ability to quantify the quality of human motion, actions, or skill levels and provide feedback -- has far-reaching implications in areas such as low-cost physiotherapy, sports training, and workforce development. As such, it has become a critical field in computer vision & video understanding over the past decade. Significant progress has been made in AQA methodologies, datasets, & applications, yet a pressing need remains for a comprehensive synthesis of this rapidly evolving field. In this paper, we present a thorough survey of the AQA landscape, systematically reviewing over 200 research papers using the preferred reporting items for systematic reviews & meta-analyses (PRISMA) framework. We begin by covering foundational concepts & definitions, then move to general frameworks & performance metrics, & finally discuss the latest advances in methodologies & datasets. This survey provides a detailed analysis of research trends, performance comparisons, challenges, & future directions. Through this work, we aim to offer a valuable resource for both newcomers & experienced researchers, promoting further exploration & progress in AQA. Data are available at https://haoyin116.github.io/Survey_of_AQA/
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, Marah, Jacobs, Sam Ade, Awan, Ammar Ahmad, Aneja, Jyoti, Awadallah, Ahmed, Awadalla, Hany, Bach, Nguyen, Bahree, Amit, Bakhtiari, Arash, Bao, Jianmin, Behl, Harkirat, Benhaim, Alon, Bilenko, Misha, Bjorck, Johan, Bubeck, Sébastien, Cai, Qin, Cai, Martin, Mendes, Caio César Teodoro, Chen, Weizhu, Chaudhary, Vishrav, Chen, Dong, Chen, Dongdong, Chen, Yen-Chun, Chen, Yi-Ling, Chopra, Parul, Dai, Xiyang, Del Giorno, Allie, de Rosa, Gustavo, Dixon, Matthew, Eldan, Ronen, Fragoso, Victor, Iter, Dan, Gao, Mei, Gao, Min, Gao, Jianfeng, Garg, Amit, Goswami, Abhishek, Gunasekar, Suriya, Haider, Emman, Hao, Junheng, Hewett, Russell J., Huynh, Jamie, Javaheripi, Mojan, Jin, Xin, Kauffmann, Piero, Karampatziakis, Nikos, Kim, Dongwoo, Khademi, Mahoud, Kurilenko, Lev, Lee, James R., Lee, Yin Tat, Li, Yuanzhi, Li, Yunsheng, Liang, Chen, Liden, Lars, Liu, Ce, Liu, Mengchen, Liu, Weishung, Lin, Eric, Lin, Zeqi, Luo, Chong, Madan, Piyush, Mazzola, Matt, Mitra, Arindam, Modi, Hardik, Nguyen, Anh, Norick, Brandon, Patra, Barun, Perez-Becker, Daniel, Portet, Thomas, Pryzant, Reid, Qin, Heyang, Radmilac, Marko, Rosset, Corby, Roy, Sambudha, Ruwase, Olatunji, Saarikivi, Olli, Saied, Amin, Salim, Adil, Santacroce, Michael, Shah, Shital, Shang, Ning, Sharma, Hiteshi, Shukla, Swadheen, Song, Xia, Tanaka, Masahiro, Tupini, Andrea, Wang, Xin, Wang, Lijuan, Wang, Chunyu, Wang, Yu, Ward, Rachel, Wang, Guanhua, Witte, Philipp, Wu, Haiping, Wyatt, Michael, Xiao, Bin, Xu, Can, Xu, Jiahang, Xu, Weijian, Yadav, Sonali, Yang, Fan, Yang, Jianwei, Yang, Ziyi, Yang, Yifan, Yu, Donghan, Yuan, Lu, Zhang, Chengruidong, Zhang, Cyril, Zhang, Jianwen, Zhang, Li Lyna, Zhang, Yi, Zhang, Yue, Zhang, Yunan, Zhou, Xiren
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.
CoUDA: Coherence Evaluation via Unified Data Augmentation
Zhu, Dawei, Wu, Wenhao, Song, Yifan, Zhu, Fangwei, Cao, Ziqiang, Li, Sujian
Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance. In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. CoUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively. Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, CoUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse. Extensive experiments in coherence evaluation show that, with only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics.
Gecko: Versatile Text Embeddings Distilled from Large Language Models
Lee, Jinhyuk, Dai, Zhuyun, Ren, Xiaoqi, Chen, Blair, Cer, Daniel, Cole, Jeremy R., Hui, Kai, Boratko, Michael, Kapadia, Rajvi, Ding, Wen, Luan, Yi, Duddu, Sai Meher Karthik, Abrego, Gustavo Hernandez, Shi, Weiqiang, Gupta, Nithi, Kusupati, Aditya, Jain, Prateek, Jonnalagadda, Siddhartha Reddy, Chang, Ming-Wei, Naim, Iftekhar
Text embedding models represent natural language as dense vectors, positioning semantically similar text near each other within the embedding space (Gao et al., 2021; Le and Mikolov, 2014; Reimers and Gurevych, 2019). These embeddings are commonly used for a wide range of downstream tasks including document retrieval, sentence similarity, classification, and clustering (Muennighoff et al., 2023). Instead of building separate embedding models for each downstream task, recent efforts seek to create a single embedding model supporting many tasks. The recent development of general-purpose text embedding models presents a challenge: these models require large amounts of training data to comprehensively cover desired domains and skills. Recent embedding efforts have focused on using extensive collections of training examples (Li et al., 2023; Wang et al., 2022).
IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators
Lin, Luyang, Wang, Lingzhi, Zhao, Xiaoyan, Li, Jing, Wong, Kam-Fai
This study focuses on media bias detection, crucial in today's era of influential social media platforms shaping individual attitudes and opinions. In contrast to prior work that primarily relies on training specific models tailored to particular datasets, resulting in limited adaptability and subpar performance on out-of-domain data, we introduce a general bias detection framework, IndiVec, built upon large language models. IndiVec begins by constructing a fine-grained media bias database, leveraging the robust instruction-following capabilities of large language models and vector database techniques. When confronted with new input for bias detection, our framework automatically selects the most relevant indicator from the vector database and employs majority voting to determine the input's bias label. IndiVec excels compared to previous methods due to its adaptability (demonstrating consistent performance across diverse datasets from various sources) and explainability (providing explicit top-k indicators to interpret bias predictions). Experimental results on four political bias datasets highlight IndiVec's significant superiority over baselines. Furthermore, additional experiments and analysis provide profound insights into the framework's effectiveness.
Vaccine: Perturbation-aware Alignment for Large Language Model
Huang, Tiansheng, Hu, Sihao, Liu, Ling
The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.