Goto

Collaborating Authors

HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories

arXiv.org Machine Learning

--GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification . Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the H IG ITC LASS framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that H IG ITC LASS is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification. I NTRODUCTION For the computer science field, code repositories are an indispensable part of the knowledge dissemination process, containing valuable details for reproduction. For software engineers, sharing code also promotes the adoption of best practices and accelerates code development. The needs of the scientific community and that of software developers have facilitated the growth of online code collaboration platforms, the most popular of which is GitHub, with over 96 million repositories and 31 million users as of 2018. With the overwhelming number of repositories hosted on GitHub, there is a natural need to enable search functionality so that users can quickly target repositories of interest.


How Document Classification Can Improve Business Processes

#artificialintelligence

The process of labeling documents into categories based on the type of the content is known as document classification. It can also be defined as the process of assigning one or more classes or categories to a document (depending on the type of content) to make it easy to sort and manage images, texts, and videos. Document classification can be done using artificial intelligence, machine learning, and python. This classification can be done in two ways: manually or automatically. The former gives humans full authority over the classification.


Text Analysis 101; A Basic Understanding for Business Users: Document Classification

@machinelearnbot

This blog was originally posted as part of our Text Analysis 101 blog series. It aims to explain how the classification of text works as part of Natural Language Processing. The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Manually categorizing and grouping text sources can be extremely laborious and time-consuming, especially for publishers, news sites, blogs or anyone who deals with a lot of content.


Text Analysis 101; A Basic Understanding for Business Users: Clustering and Unsupervised Methods

@machinelearnbot

This blog was originally posted as part of our Text Analysis 101 blog series. It aims to explain how the classification of text works as part of Natural Language Processing. It was the second blog on harnessing Machine Learning (ML) in the form of Natural Language Processing (NLP) for the Automatic Classification of documents. By classifying text, we aim to assign a document or piece of text to one or more classes or categories making it easier to manage or sort. A Document Classifier often returns or assigns a category "label" or "code" to a document or piece of text.


Text Analysis 101; A Basic Understanding for Business Users: Document Classification - AYLIEN

#artificialintelligence

The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Manually categorizing and grouping text sources can be extremely laborious and time-consuming, especially for publishers, news sites, blogs or anyone who deals with a lot of content. Broadly speaking, there are two classes of ML techniques: supervised and unsupervised. In supervised methods, a model is created based on previous observations i.e. a training set.