Edmonton
Evaluation of OpenAI o1: Opportunities and Challenges of AGI
Zhong, Tianyang, Liu, Zhengliang, Pan, Yi, Zhang, Yutong, Zhou, Yifan, Liang, Shizhe, Wu, Zihao, Lyu, Yanjun, Shu, Peng, Yu, Xiaowei, Cao, Chao, Jiang, Hanqi, Chen, Hanxu, Li, Yiwei, Chen, Junhao, Hu, Huawen, Liu, Yihen, Zhao, Huaqin, Xu, Shaochen, Dai, Haixing, Zhao, Lin, Zhang, Ruidong, Zhao, Wei, Yang, Zhenyuan, Chen, Jingyuan, Wang, Peilong, Ruan, Wei, Wang, Hui, Zhao, Huan, Zhang, Jing, Ren, Yiming, Qin, Shihuan, Chen, Tong, Li, Jiaxi, Zidan, Arif Hassan, Jahin, Afrar, Chen, Minheng, Xia, Sichen, Holmes, Jason, Zhuang, Yan, Wang, Jiaqi, Xu, Bochen, Xia, Weiran, Yu, Jichao, Tang, Kaibo, Yang, Yaxuan, Sun, Bolun, Yang, Tao, Lu, Guoyu, Wang, Xianqiao, Chai, Lilong, Li, He, Lu, Jin, Sun, Lichao, Zhang, Xin, Ge, Bao, Hu, Xintao, Zhang, Lian, Zhou, Hua, Zhang, Lu, Zhang, Shu, Liu, Ninghao, Jiang, Bei, Kong, Linglong, Xiang, Zhen, Ren, Yudan, Liu, Jun, Jiang, Xi, Bao, Yu, Zhang, Wei, Li, Xiang, Li, Gang, Liu, Wei, Shen, Dinggang, Sikora, Andrea, Zhai, Xiaoming, Zhu, Dajiang, Liu, Tianming
This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.
Revolutionizing Payload Inspection: A Self-Supervised Journey to Precision with Few Shots
Stein, Kyle, Mahyari, Arash, Francia, Guillermo III, El-Sheikh, Eman
As networks continue to expand and become more interconnected, the need for novel malware detection methods becomes more pronounced. Traditional security measures are increasingly inadequate against the sophistication of modern cyber attacks. Deep Packet Inspection (DPI) has been pivotal in enhancing network security, offering an in-depth analysis of network traffic that surpasses conventional monitoring techniques. DPI not only examines the metadata of network packets, but also dives into the actual content being carried within the packet payloads, providing a comprehensive view of the data flowing through networks. The integration of advanced deep learning techniques with DPI has introduced modern methodologies into malware detection. However, the challenge with the state-of-the-art supervised learning approaches is that they prevent the generalization to unseen attacks embedded in the payloads, prohibiting them from accurately detecting new attacks and transferring knowledge learned from previous attacks to the new attacks with small labeled sample sizes. This paper leverages the recent advancements in self-supervised learning and few-shot learning. Our proposed self-supervised approach trains a transformer to learn the embedding of the payloads from a vast amount of unlabeled datasets by masking portions of payloads, leading to a learnt representation that well generalizes to various downstream tasks. Once the representation is extracted from payloads, they are used to train a malware detection algorithm. The representation obtained from the transformer is then used to adapt the malware detector to novel types of attacks using few-shot learning approaches. Our experimental results across several datasets show the great success and generalization of the proposed approach to novel scenarios.
Stress Detection from Photoplethysmography in a Virtual Reality Environment
Mahmoudi-Nejad, Athar, Boulanger, Pierre, Guzdial, Matthew
Personalized virtual reality exposure therapy is a therapeutic practice that can adapt to an individual patient, leading to better health outcomes. Measuring a patient's mental state to adjust the therapy is a critical but difficult task. Most published studies use subjective methods to estimate a patient's mental state, which can be inaccurate. This article proposes a virtual reality exposure therapy (VRET) platform capable of assessing a patient's mental state using non-intrusive and widely available physiological signals such as photoplethysmography (PPG). In a case study, we evaluate how PPG signals can be used to detect two binary classifications: peaceful and stressful states. Sixteen healthy subjects were exposed to the two VR environments (relaxed and stressful). Using LOSO cross-validation, our best classification model could predict the two states with a 70.6% accuracy which outperforms many more complex approaches.
Website visits can predict angler presence using machine learning
Schmid, Julia S., Simmons, Sean, Lewis, Mark A., Poesch, Mark S., Ramazi, Pouria
Understanding and predicting recreational fishing activity is important for sustainable fisheries management. However, traditional methods of measuring fishing pressure, such as surveys, can be costly and limited in both time and spatial extent. Predictive models that relate fishing activity to environmental or economic factors typically rely on historical data, which often restricts their spatial applicability due to data scarcity. In this study, high-resolution angler-generated data from an online platform and easily accessible auxiliary data were tested to predict daily boat presence and aerial counts of boats at almost 200 lakes over five years in Ontario, Canada. Lake-information website visits alone enabled predicting daily angler boat presence with 78% accuracy. While incorporating additional environmental, socio-ecological, weather and angler-generated features into machine learning models did not remarkably improve prediction performance of boat presence, they were substantial for the prediction of boat counts. Models achieved an R2 of up to 0.77 at known lakes included in the model training, but they performed poorly for unknown lakes (R2 = 0.21). The results demonstrate the value of integrating angler-generated data from online platforms into predictive models and highlight the potential of machine learning models to enhance fisheries management.
Spiders Based on Anxiety: How Reinforcement Learning Can Deliver Desired User Experience in Virtual Reality Personalized Arachnophobia Treatment
Mahmoudi-Nejad, Athar, Guzdial, Matthew, Boulanger, Pierre
The need to generate a spider to provoke a desired anxiety response arises in the context of personalized virtual reality exposure therapy (VRET), a treatment approach for arachnophobia. This treatment involves patients observing virtual spiders in order to become desensitized and decrease their phobia, which requires that the spiders elicit specific anxiety responses. However, VRET approaches tend to require therapists to hand-select the appropriate spider for each patient, which is a time-consuming process and takes significant technical knowledge and patient insight. While automated methods exist, they tend to employ rules-based approaches with minimal ability to adapt to specific users. To address these challenges, we present a framework for VRET utilizing procedural content generation (PCG) and reinforcement learning (RL), which automatically adapts a spider to elicit a desired anxiety response. We demonstrate the superior performance of this system compared to a more common rules-based VRET method.
CANDERE-COACH: Reinforcement Learning from Noisy Feedback
Li, Yuxuan, Das, Srijita, Taylor, Matthew E.
In recent times, Reinforcement learning (RL) has been widely applied to many challenging tasks. However, in order to perform well, it requires access to a good reward function which is often sparse or manually engineered with scope for error. Introducing human prior knowledge is often seen as a possible solution to the above-mentioned problem, such as imitation learning, learning from preference, and inverse reinforcement learning. Learning from feedback is another framework that enables an RL agent to learn from binary evaluative signals describing the teacher's (positive or negative) evaluation of the agent's action. However, these methods often make the assumption that evaluative teacher feedback is perfect, which is a restrictive assumption. In practice, such feedback can be noisy due to limited teacher expertise or other exacerbating factors like cognitive load, availability, distraction, etc. In this work, we propose the CANDERE-COACH algorithm, which is capable of learning from noisy feedback by a nonoptimal teacher. We propose a noise-filtering mechanism to de-noise online feedback data, thereby enabling the RL agent to successfully learn with up to 40% of the teacher feedback being incorrect. Experiments on three common domains demonstrate the effectiveness of the proposed approach.
Impedance Control for Manipulators Handling Heavy Payloads
Attaching a heavy payload to the wrist force/moment (F/M) sensor of a manipulator can cause conventional impedance controllers to fail in establishing the desired impedance due to the presence of non-contact forces; namely, the inertial and gravitational forces of the payload. This paper presents an impedance control scheme designed to accurately shape the force-response of such a manipulator without requiring acceleration measurements. As a result, neither wrist accelerometers nor dynamic estimators for compensating inertial load forces are necessary. The proposed controller employs an inner-outer loop feedback structure, which not only addresses uncertainties in the robot's dynamics but also enables the specification of a general target impedance model, including nonlinear models. Stability and convergence of the controller are analytically proven, with results showing that the control input remains bounded as long as the desired inertia differs from the payload inertia. Experimental results confirm that the proposed impedance controller effectively shapes the impedance of a manipulator carrying a heavy load according to the desired impedance model.
Towards Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation: An Empirical Study
Wang, Zhijie, Zhou, Zhehua, Song, Jiayang, Huang, Yuheng, Shu, Zhan, Ma, Lei
Multi-modal foundation models and generative AI have demonstrated promising capabilities in applications across various domains. Recently, Vision-language-action (VLA) models have attracted much attention regarding their potential to advance robotic manipulation. Despite the end-to-end perception-control loop offered by the VLA models, there is a lack of comprehensive understanding of the capabilities of such models and an automated testing platform to reveal their robustness and reliability across different robotic manipulation scenarios. To address these challenges, in this work, we present VLATest, a testing framework that automatically generates diverse robotic manipulation scenes to assess the performance of VLA models from various perspectives. Large-scale experiments are considered, including eight VLA models, four types of manipulation tasks, and over 18,604 testing scenes. The experimental results show that existing VAL models still lack imperative robustness for practical applications. Specifically, the performance of VLA models can be significantly affected by several factors from the operation environments, such as camera poses, lighting conditions, and unseen objects. Our framework and the insights derived from the study are expected to pave the way for more advanced and reliable VLA-enabled robotic manipulation systems in practice.
FPMT: Enhanced Semi-Supervised Model for Traffic Incident Detection
For traffic incident detection, the acquisition of data and labels is notably resource-intensive, rendering semi-supervised traffic incident detection both a formidable and consequential challenge. Thus, this paper focuses on traffic incident detection with a semi-supervised learning way. It proposes a semi-supervised learning model named FPMT within the framework of MixText. The data augmentation module introduces Generative Adversarial Networks to balance and expand the dataset. During the mix-up process in the hidden space, it employs a probabilistic pseudo-mixing mechanism to enhance regularization and elevate model precision. In terms of training strategy, it initiates with unsupervised training on all data, followed by supervised fine-tuning on a subset of labeled data, and ultimately completing the goal of semi-supervised training. Through empirical validation on four authentic datasets, our FPMT model exhibits outstanding performance across various metrics. Particularly noteworthy is its robust performance even in scenarios with low label rates.
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Jing, Liqiang, Huang, Zhehui, Wang, Xiaoyang, Yao, Wenlin, Yu, Wenhao, Ma, Kaixin, Zhang, Hongming, Du, Xinya, Yu, Dong
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.