Effective human learning depends on a wide selection of educational materials that align with the learner's current understanding of the topic. While the Internet has revolutionized human learning or education, a substantial resource accessibility barrier still exists. Namely, the excess of online information can make it challenging to navigate and discover high-quality learning materials. In this paper, we propose the educational resource discovery (ERD) pipeline that automates web resource discovery for novel domains. The pipeline consists of three main steps: data collection, feature extraction, and resource classification. We start with a known source domain and conduct resource discovery on two unseen target domains via transfer learning. We first collect frequent queries from a set of seed documents and search on the web to obtain candidate resources, such as lecture slides and introductory blog posts. Then we introduce a novel pretrained information retrieval deep neural network model, query-document masked language modeling (QD-MLM), to extract deep features of these candidate resources. We apply a tree-based classifier to decide whether the candidate is a positive learning resource. The pipeline achieves F1 scores of 0.94 and 0.82 when evaluated on two similar but novel target domains. Finally, we demonstrate how this pipeline can benefit an application: leading paragraph generation for surveys. This is the first study that considers various web resources for survey generation, to the best of our knowledge. We also release a corpus of 39,728 manually labeled web resources and 659 queries from NLP, Computer Vision (CV), and Statistics (STATS).
Many machine learning models have been built to tackle information overload issues on Massive Open Online Courses (MOOC) platforms. These models rely on learning powerful representations of MOOC entities. However, they suffer from the problem of scarce expert label data. To overcome this problem, we propose to learn pre-trained representations of MOOC entities using abundant unlabeled data from the structure of MOOCs which can directly be applied to the downstream tasks. While existing pre-training methods have been successful in NLP areas as they learn powerful textual representation, their models do not leverage the richer information about MOOC entities. This richer information includes the graph relationship between the lectures, concepts, and courses along with the domain knowledge about the complexity of a concept. We develop MOOCRep, a novel method based on Transformer language model trained with two pre-training objectives : 1) graph-based objective to capture the powerful signal of entities and relations that exist in the graph, and 2) domain-oriented objective to effectively incorporate the complexity level of concepts. Our experiments reveal that MOOCRep's embeddings outperform state-of-the-art representation learning methods on two tasks important for education community, concept pre-requisite prediction and lecture recommendation.
The Internet has rich and rapidly increasing sources of high quality educational content. Inferring prerequisite relations between educational concepts is required for modern large-scale online educational technology applications such as personalized recommendations and automatic curriculum creation. We present PREREQ, a new supervised learning method for inferring concept prerequisite relations. PREREQ is designed using latent representations of concepts obtained from the Pairwise Latent Dirichlet Allocation model, and a neural network based on the Siamese network architecture. PREREQ can learn unknown concept prerequisites from course prerequisites and labeled concept prerequisite data. It outperforms state-of-the-art approaches on benchmark datasets and can effectively learn from very less training data. PREREQ can also use unlabeled video playlists, a steadily growing source of training data, to learn concept prerequisites, thus obviating the need for manual annotation of course prerequisites.
Liang, Chen (Pennsylvania State University) | Ye, Jianbo (Pennsylvania State University) | Wu, Zhaohui (Microsoft Corporation) | Pursel, Bart (Pennsylvania State University) | Giles, C. Lee (Pennsylvania State University)
Prerequisite relations among concepts play an important role in many educational applications such as intelligent tutoring system and curriculum planning. With the increasing amount of educational data available, automatic discovery of concept prerequisite relations has become both an emerging research opportunity and an open challenge. Here, we investigate how to recover concept prerequisite relations from course dependencies and propose an optimization based framework to address the problem. We create the first real dataset for empirically studying this problem, which consists of the listings of computer science courses from 11 U.S. universities and their concept pairs with prerequisite labels. Experiment results on a synthetic dataset and the real course dataset both show that our method outperforms existing baselines.
Liang, Chen (Pennsylvania State University) | Ye, Jianbo (Pennsylvania State University) | Wang, Shuting (Pennsylvania State University) | Pursel, Bart (Pennsylvania State University) | Giles, C. Lee (Pennsylvania State University)
Concept prerequisite learning focuses on machine learning methods for measuring the prerequisite relation among concepts. With the importance of prerequisites for education, it has recently become a promising research direction. A major obstacle to extracting prerequisites at scale is the lack of large-scale labels which will enable effective data-driven solutions. We investigate the applicability of active learning to concept prerequisite learning.We propose a novel set of features tailored for prerequisite classification and compare the effectiveness of four widely used query strategies. Experimental results for domains including data mining, geometry, physics, and precalculus show that active learning can be used to reduce the amount of training data required. Given the proposed features, the query-by-committee strategy outperforms other compared query strategies.