generalization difficulty
On robust overfitting: adversarial training induced distribution matters
Despite their outstanding performance, deep neural networks (DNNs) are known to be vulnerable to adversarial attacks where a carefully designed perturbation may cause the network to make a wrong prediction [1, 2]. Many methods have been proposed to improve the robustness of DNNs against adversarial perturbations [3, 4, 5], among which Projected Gradient Descend based Adversarial Training (PGD-AT) [3] is arguably the most effective [6, 7]. A recent work in [8] however revealed a surprising phenomenon in PGD-AT: after training, even though the robust error (i.e., error probability in the predicted label for adversarially perturbed instances) is nearly zero on the training set, it may remain very high on the testing set. For example, on the testing set of CIFAR10, the robust error of PGD-AT trained model can be as large as 44.19%. This significantly contrasts the standard training: on CIFAR10, when the standard error (i.e., the error probability in the predicted label for non-perturbed instances) is nearly zero on the training set, its value on the testing set is only about 4%.
Towards A Measure Of General Machine Intelligence
Venkatasubramanian, Gautham, Kar, Sibesh, Singh, Abhimanyu, Mishra, Shubham, Yadav, Dushyant, Chandak, Shreyansh
To build increasingly general-purpose artificial intelligence systems that can deal with unknown variables across unknown domains, we need benchmarks that measure precisely how well these systems perform on tasks they have never seen before. A prerequisite for this is a measure of a task's generalization difficulty, or how dissimilar it is from the system's prior knowledge and experience. If the skill of an intelligence system in a particular domain is defined as it's ability to consistently generate a set of instructions (or programs) to solve tasks in that domain, current benchmarks do not quantitatively measure the efficiency of acquiring new skills, making it possible to brute-force skill acquisition by training with unlimited amounts of data and compute power. With this in mind, we first propose a common language of instruction, i.e. a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms. Using programs generated in this language, we demonstrate a match-based method to both score performance and calculate the generalization difficulty of any given set of tasks. We use these to define a numeric benchmark called the g-index to measure and compare the skill-acquisition efficiency of any intelligence system on a set of real-world tasks. Finally, we evaluate the suitability of some well-known models as general intelligence systems by calculating their g-index scores.
Empirically Measuring Transfer Distance for System Design and Operation
Cody, Tyler, Adams, Stephen, Beling, Peter A.
Classical machine learning approaches are sensitive to non-stationarity. Transfer learning can address non-stationarity by sharing knowledge from one system to another, however, in areas like machine prognostics and defense, data is fundamentally limited. Therefore, transfer learning algorithms have little, if any, examples from which to learn. Herein, we suggest that these constraints on algorithmic learning can be addressed by systems engineering. We formally define transfer distance in general terms and demonstrate its use in empirically quantifying the transferability of models. We consider the use of transfer distance in the design of machine rebuild procedures to allow for transferable prognostic models. We also consider the use of transfer distance in predicting operational performance in computer vision. Practitioners can use the presented methodology to design and operate systems with consideration for the learning theoretic challenges faced by component learning systems.
A thread written by @martin_gorner
"On the measure of intelligence" where he proposes a new benchmark for "intelligence" called the "Abstraction and Reasoning corpus". Chess was considered the pinnacle of human intelligence, … until it was solved by a computer and surpassed Garry Kasparov in 1997. Today, it is hard to argue that a min-max algorithm with optimizations represents "intelligence". AlphaGo took this to the next step. It became world champion at Go by using deep learning. Still, the program is narrowly focused on playing Go and solving this task did not lead to breakthroughs in other fields.
On the Measure of Intelligence
To make deliberate progress towards more intelligent and more human-like artificial systems, we need to be following an appropriate feedback signal: we need to be able to define and evaluate intelligence in a way that enables comparisons between two systems, as well as comparisons with humans. Over the past hundred years, there has been an abundance of attempts to define and measure intelligence, across both the fields of psychology and AI. We summarize and critically assess these definitions and evaluation approaches, while making apparent the two historical conceptions of intelligence that have implicitly guided them. We note that in practice, the contemporary AI community still gravitates towards benchmarking intelligence by comparing the skill exhibited by AIs and humans at specific tasks such as board games and video games. We argue that solely measuring skill at any given task falls short of measuring intelligence, because skill is heavily modulated by prior knowledge and experience: unlimited priors or unlimited training data allow experimenters to "buy" arbitrary levels of skills for a system, in a way that masks the system's own generalization power. We then articulate a new formal definition of intelligence based on Algorithmic Information Theory, describing intelligence as skill-acquisition efficiency and highlighting the concepts of scope, generalization difficulty, priors, and experience. Using this definition, we propose a set of guidelines for what a general AI benchmark should look like. Finally, we present a benchmark closely following these guidelines, the Abstraction and Reasoning Corpus (ARC), built upon an explicit set of priors designed to be as close as possible to innate human priors. We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.