Supervised Learning
Hunter Biden's sentencing date in gun case set for week after election
First son Hunter Biden will be sentenced on Nov. 13, the week after the general election, after he was found guilty on charges in the criminal case focused on his purchase of a handgun in 2018. Judge Maryellen Noreika, in a court order Friday, set the sentencing date for Wednesday, Nov. 13, at 10:00 a.m. at the J. Caleb Boggs Federal Building in Wilmington, Delaware. President Biden's son will learn his fate 8 days after the 2020 presidential election. Hunter Biden was found guilty in June of making a false statement in the purchase of a gun, making a false statement related to information required to be kept by a federally licensed gun dealer, and possession of a gun by a person who is an unlawful user of or addicted to a controlled substance. He faces a total maximum prison time of 25 years for the three charges.
Trustworthy Machine Learning under Social and Adversarial Data Sources
Machine learning has witnessed remarkable breakthroughs in recent years. As machine learning permeates various aspects of daily life, individuals and organizations increasingly interact with these systems, exhibiting a wide range of social and adversarial behaviors. These behaviors may have a notable impact on the behavior and performance of machine learning systems. Specifically, during these interactions, data may be generated by strategic individuals, collected by self-interested data collectors, possibly poisoned by adversarial attackers, and used to create predictors, models, and policies satisfying multiple objectives. As a result, the machine learning systems' outputs might degrade, such as the susceptibility of deep neural networks to adversarial examples (Shafahi et al., 2018; Szegedy et al., 2013) and the diminished performance of classic algorithms in the presence of strategic individuals (Ahmadi et al., 2021). Addressing these challenges is imperative for the success of machine learning in societal settings.
Deep Fr\'echet Regression
Iao, Su I, Zhou, Yidong, Müller, Hans-Georg
Advancements in modern science have led to the increasing availability of non-Euclidean data in metric spaces. This paper addresses the challenge of modeling relationships between non-Euclidean responses and multivariate Euclidean predictors. We propose a flexible regression model capable of handling high-dimensional predictors without imposing parametric assumptions. Two primary challenges are addressed: the curse of dimensionality in nonparametric regression and the absence of linear structure in general metric spaces. The former is tackled using deep neural networks, while for the latter we demonstrate the feasibility of mapping the metric space where responses reside to a low-dimensional Euclidean space using manifold learning. We introduce a reverse mapping approach, employing local Fr\'echet regression, to map the low-dimensional manifold representations back to objects in the original metric space. We develop a theoretical framework, investigating the convergence rate of deep neural networks under dependent sub-Gaussian noise with bias. The convergence rate of the proposed regression model is then obtained by expanding the scope of local Fr\'echet regression to accommodate multivariate predictors in the presence of errors in predictors. Simulations and case studies show that the proposed model outperforms existing methods for non-Euclidean responses, focusing on the special cases of probability measures and networks.
A Vectorization Method Induced By Maximal Margin Classification For Persistent Diagrams
Wu, An, Pan, Yu, Zhou, Fuqi, Yan, Jinghui, Liu, Chuanlu
Persistent homology is an effective method for extracting topological information, represented as persistent diagrams, of spatial structure data. Hence it is well-suited for the study of protein structures. Attempts to incorporate Persistent homology in machine learning methods of protein function prediction have resulted in several techniques for vectorizing persistent diagrams. However, current vectorization methods are excessively artificial and cannot ensure the effective utilization of information or the rationality of the methods. To address this problem, we propose a more geometrical vectorization method of persistent diagrams based on maximal margin classification for Banach space, and additionaly propose a framework that utilizes topological data analysis to identify proteins with specific functions. We evaluated our vectorization method using a binary classification task on proteins and compared it with the statistical methods that exhibit the best performance among thirteen commonly used vectorization methods. The experimental results indicate that our approach surpasses the statistical methods in both robustness and precision.
Artificial neural networks on graded vector spaces
We develop new artificial neural network models for graded vector spaces, which are suitable when different features in the data have different significance (weights). This is the first time that such models are designed mathematically and they are expected to perform better than neural networks over usual vector spaces, which are the special case when the gradings are all 1s.
Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data
Schelb, Julian, Ulloa, Roberto, Spitz, Andreas
Researchers in the political and social sciences often rely on classification models to analyze trends in information consumption by examining browsing histories of millions of webpages. Automated scalable methods are necessary due to the impracticality of manual labeling. In this paper, we model the detection of topic-related content as a binary classification task and compare the accuracy of fine-tuned pre-trained encoder models against in-context learning strategies. Using only a few hundred annotated data points per topic, we detect content related to three German policies in a database of scraped webpages. We compare multilingual and monolingual models, as well as zero and few-shot approaches, and investigate the impact of negative sampling strategies and the combination of URL & content-based features. Our results show that a small sample of annotated data is sufficient to train an effective classifier. Fine-tuning encoder-based models yields better results than in-context learning. Classifiers using both URL & content-based features perform best, while using URLs alone provides adequate results when content is unavailable.
Relational Composition in Neural Networks: A Survey and Call to Action
Wattenberg, Martin, Viégas, Fernanda B.
Many neural nets appear to represent data as linear combinations of "feature vectors." Algorithms for discovering these vectors have seen impressive recent success. However, we argue that this success is incomplete without an understanding of relational composition: how (or whether) neural nets combine feature vectors to represent more complicated relationships. To facilitate research in this area, this paper offers a guided tour of various relational mechanisms that have been proposed, along with preliminary analysis of how such mechanisms might affect the search for interpretable features. We end with a series of promising areas for empirical research, which may help determine how neural networks represent structured data.
Learning to Represent Surroundings, Anticipate Motion and Take Informed Actions in Unstructured Environments
Contemporary robots have become exceptionally skilled at achieving specific tasks in structured environments. However, they often fail when faced with the limitless permutations of real-world unstructured environments. This motivates robotics methods which learn from experience, rather than follow a pre-defined set of rules. In this thesis, we present a range of learning-based methods aimed at enabling robots, operating in dynamic and unstructured environments, to better understand their surroundings, anticipate the actions of others, and take informed actions accordingly. In the first part of the thesis, we investigate methods which leverage learning to represent the structure and motion in a robot's operating environment, in a continuous manner.
Automated Neural Patent Landscaping in the Small Data Regime
Erana, Tisa Islam, Finlayson, Mark A.
In its simplest form, patent landscaping is the process of identifying all patents that are related to a particular technology or technology area. Patent landscapes are useful for a number of activities: it is important for assessing the coverage, value, or context of particular pieces of intellectual property, or for understanding the direction, speed, or concentration of innovation in a particular industry Hunt et al. [2007]. For example, companies create patent landscapes to evaluate the risks posed by competitors in a particular technology space, or to decide whether and how much to invest in pursuing particular innovations. Patent offices and economic monitoring organizations use patent landscapes to evaluate how a particular technology is affecting or might affect the economy, for example, how much economic investment is underway in a technology, how much economic value has been generated, or how many industries or companies are supported by a particular technology. Governments, in turn, can use that information to implement technology policies, for example, deciding whether to steer investment or tax incentives to companies working in particular areas (e.g., AI or green technologies). While the simplest form of patent landscaping merely identifies which patents are related to a particular area, other more sophisticated forms of patent landscaping can seek to identify how different subareas of a technology area are related, which companies or inventor groups are the most prolific, what regions are involved, or what specific types of innovations are the focus of current development.
Positive-Unlabelled Learning for Improving Image-based Recommender System Explainability
Fernández-Campa-González, Álvaro, Paz-Ruza, Jorge, Alonso-Betanzos, Amparo, Guijarro-Berdiñas, Bertha
Among the existing approaches for visual-based Recommender System (RS) explainability, utilizing user-uploaded item images as efficient, trustable explanations is a promising option. However, current models following this paradigm assume that, for any user, all images uploaded by other users can be considered negative training examples (i.e. bad explanatory images), an inadvertedly naive labelling assumption that contradicts the rationale of the approach. This work proposes a new explainer training pipeline by leveraging Positive-Unlabelled (PU) Learning techniques to train image-based explainer with refined subsets of reliable negative examples for each user selected through a novel user-personalized, two-step, similarity-based PU Learning algorithm. Computational experiments show this PU-based approach outperforms the state-of-the-art non-PU method in six popular real-world datasets, proving that an improvement of visual-based RS explainability can be achieved by maximizing training data quality rather than increasing model complexity.