GitHub, the popular code repository service, has to serve two masters. It's well-known for hosting popular open-source projects, but it's also working to acquire more large and small business users to privately store and manage their proprietary code. Those different constituencies sometimes need different things. But Chris Wansrath, the company's co-founder and CEO, told the company's annual user conference this week that building new features into GitHub isn't a matter of helping only one or the other. It showed in the new features introduced at the conference in San Francisco on Wednesday.
This paper describes the autofeat Python library, which provides a scikit-learn style linear regression model with automatic feature engineering and selection capabilities. Complex non-linear machine learning models such as neural networks are in practice often difficult to train and even harder to explain to non-statisticians, who require transparent analysis results as a basis for important business decisions. While linear models are efficient and intuitive, they generally provide lower prediction accuracies. Our library provides a multi-step feature engineering and selection process, where first a large pool of non-linear features is generated, from which then a small and robust set of meaningful features is selected, which improve the prediction accuracy of a linear model while retaining its interpretability.
A recently announced SHA-1 collision attack has the potential to break code repositories that use the Subversion (SVN) revision control system. The first victim was the repository for the WebKit browser engine that was corrupted after someone committed two different PDF files with the same SHA-1 hash to it. The incident happened hours after researchers from Google and Centrum Wiskunde & Informatica (CWI) in the Netherlands announced the first practical collision attack against the SHA-1 hash function on Thursday. Their demonstration consisted of creating two PDF files with different contents that had the same SHA-1 digest. This proved without a doubt that SHA-1 is cryptographically broken because a hash function should always produce different digests (hashes) for different pieces of data or files.
Last week we had the honor of participating at MSR'18, where two of the members of our team, Vadim Markovtsev and Waren Long, presented the research paper they wrote on our latest dataset: Public Git Archive. Public Git Archive is the result of months of effort curating a dataset suitable for training Machine Learning on Source Code (aka MLonCode) models. The dataset contains 3TB of repositories from GitHub ready to download. This includes all of the contents (git metadata and file contents) for all of the repositories on GitHub with 50 or more stars. The list of repositories was obtained from the GHTorrent project, specifically from the snapshot of January 1st 2018.
This article introduces the challenge of digital preservation in the area of engineering design and manufacturing and presents a methodology to apply knowledge representation and semantic techniques to develop digital engineering archives. This work is part of an ongoing, multiuniversity effort to create cyber infrastructure-based engineering repositories for undergraduates (CIBER-U) to support engineering design education. The technical approach is to use knowledge representation techniques to create formal models of engineering data elements, work flows, and processes. With these techniques formal engineering knowledge and processes can be captured and preserved with some guarantee of long-term interpretability. The article presents examples of how the techniques can be used to encode specific engineering information packages and work flows.