Goto

Collaborating Authors

 Supervised Learning


Text Similarity in Vector Space Models: A Comparative Study

arXiv.org Machine Learning

Automatic measurement of semantic text similarity is an important task in natural language processing. In this paper, we evaluate the performance of different vector space models to perform this task. We address the real-world problem of modeling patent-to-patent similarity and compare TFIDF (and related extensions), topic models (e.g., latent semantic indexing), and neural models (e.g., paragraph vectors). Contrary to expectations, the added computational cost of text embedding methods is justified only when: 1) the target text is condensed; and 2) the similarity comparison is trivial. Otherwise, TFIDF performs surprisingly well in other cases: in particular for longer and more technical texts or for making finer-grained distinctions between nearest neighbors. Unexpectedly, extensions to the TFIDF method, such as adding noun phrases or calculating term weights incrementally, were not helpful in our context.


Efficient Structured Surrogate Loss and Regularization in Structured Prediction

arXiv.org Machine Learning

In this dissertation, we focus on several important problems in structured prediction. In structured prediction, the label has a rich intrinsic substructure, and the loss varies with respect to the predicted label and the true label pair. Structured SVM is an extension of binary SVM to adapt to such structured tasks. In the first part of the dissertation, we study the surrogate losses and its efficient methods. To minimize the empirical risk, a surrogate loss which upper bounds the loss, is used as a proxy to minimize the actual loss. Since the objective function is written in terms of the surrogate loss, the choice of the surrogate loss is important, and the performance depends on it. Another issue regarding the surrogate loss is the efficiency of the argmax label inference for the surrogate loss. Efficient inference is necessary for the optimization since it is often the most time-consuming step. We present a new class of surrogate losses named bi-criteria surrogate loss, which is a generalization of the popular surrogate losses. We first investigate an efficient method for a slack rescaling formulation as a starting point utilizing decomposability of the model. Then, we extend the algorithm to the bi-criteria surrogate loss, which is very efficient and also shows performance improvements. In the second part of the dissertation, another important issue of regularization is studied. Specifically, we investigate a problem of regularization in hierarchical classification when a structural imbalance exists in the label structure. We present a method to normalize the structure, as well as a new norm, namely shared Frobenius norm. It is suitable for hierarchical classification that adapts to the data in addition to the label structure.


Google case set to examine if EU data rules extend...

Daily Mail - Science & tech

Google is fighting in Europe's top court today to tighten the scope of an EU privacy law that grants citizens the'right to be forgotten'. The rule allows people to demand Google remove search results that mention outdated or embarrassing information about them. This includes links to websites mentioning serious incidents - such as bankruptcy or criminal convictions - that may cause that person to be stigmatised. Google is battling with France's data privacy regulator over an order to extend the rule to remove search results worldwide upon request. The dispute pits data privacy concerns against the public's right to know, while also raising thorny questions about how to enforce differing legal jurisdictions when it comes to the borderless internet.


Google case set to examine if EU data rules extend globally

USATODAY - Tech Top Stories

Google employees reviewing the company appreciate the company's benefits and perks, which include free food and coffee made by baristas in every building. Other benefits include onsite gyms, free workout classes, and shuttles for free and easy commuting. Employees also appear confident in the company's leadership. Google CEO Sundar Pichai has a near-perfect 95% approval rating on Glassdoor. LONDON โ€“ Google is going to Europe's top court in its legal fight against an order requiring it to extend "right to be forgotten" rules to its search engines globally.


Google Case Set to Examine if EU Data Rules Extend Globally

U.S. News

Not all requests are waved through. In a related case that will also be heard Tuesday, the EU court will be asked to weigh in on a request by four people in France who want their search results to be purged of any information about their political beliefs and criminal records, without taking into account public interest. Google had rejected their request, which was ultimately referred to the ECJ.


Addressing the Fundamental Tension of PCGML with Discriminative Learning

arXiv.org Machine Learning

Abstract--Procedural content generation via machine learning (PCGML) is typically framed as the task of fitting a generative model to full-scale examples of a desired content distribution. This approach presents a fundamental tension: the more design effort expended to produce detailed training examples for shaping a generator, the lower the return on investment from applying PCGML in the first place. In response, we propose the use of discriminative models (which capture the validity of a design rather the distribution of the content) trained on positive and negative examples. Through a modest modification of WaveFunctionCollapse, a commercially-adopted PCG approach that we characterize as using elementary machine learning, we demonstrate a new mode of control for learning-based generators. We demonstrate how an artist might craft a focused set of additional positive and negative examples by critique of the generator's previous outputs. This interaction mode bridges PCGML with mixed-initiative design assistance tools by working with a machine to define a space of valid designs rather than just one new design. Procedural Content Generation via Machine Learning (PCGML) is the recent term for the strategy of controlling content generators using examples [1]. Existing PCGML approaches train their statistical models based on preexisting artist-provided samples of the desired content. However, there is a fundamental tension here: machine learning often works better with more training data, but the effort to produce quality training data is frequently costly enough that the artists might be better off just making the content themselves. Rather than attempting to train a generative statistical model (capturing the distribution of desired content), we focus on applying discriminative learning. In discriminative learning, the model learns to judge whether a candidate content artifact would be valid or desirable, but it does not learn how to generate candidates. Pairing a discriminative model with a preexisting content generator, we realize example-driven generation that can be influenced by both positive and negative examples of valid design patterns.


Local Linear Forests โ€“ Arxiv Vanity

#artificialintelligence

In order to address this weakness, we take the perspective of random forests as an adaptive kernel method. This interpretation follows work by Athey et al. (2018), Hothorn et al. (2004), and Meinshausen (2006), and complements the traditional view of forests as an ensemble method (i.e., an average of predictions made by individual trees). These types of adjustments are particularly important near boundaries, where neighborhoods are asymmetric by necessity, but with many covariates, the adjustments are also important away from boundaries given that local neighborhoods are often unbalanced due to sampling variation. The goal of this paper is improve the accuracy of forests on smooth signals using regression adjustments, potentially in many dimensions. By using the local regression adjustment, it is possible to adjust for asymmetries and imbalances in the set of nearby points used for prediction, ensuring that the weighted average of the feature vector of neighboring points is approximately equal to the target feature vector, and that predictions are centered.


Story Disambiguation: Tracking Evolving News Stories across News and Social Streams

arXiv.org Machine Learning

Following a particular news story online is an important but difficult task, as the relevant information is often scattered across different domains/sources (e.g., news articles, blogs, comments, tweets), presented in various formats and language styles, and may overlap with thousands of other stories. In this work we join the areas of topic tracking and entity disambiguation, and propose a framework named Story Disambiguation - a cross-domain story tracking approach that builds on real-time entity disambiguation and a learning-to-rank framework to represent and update the rich semantic structure of news stories. Given a target news story, specified by a seed set of documents, the goal is to effectively select new story-relevant documents from an incoming document stream. We represent stories as entity graphs and we model the story tracking problem as a learning-to-rank task. This enables us to track content with high accuracy, from multiple domains, in real-time. We study a range of text, entity and graph based features to understand which type of features are most effective for representing stories. We further propose new semi-supervised learning techniques to automatically update the story representation over time. Our empirical study shows that we outperform the accuracy of state-of-the-art methods for tracking mixed-domain document streams, while requiring fewer labeled data to seed the tracked stories. This is particularly the case for local news stories that are easily over shadowed by other trending stories, and for complex news stories with ambiguous content in noisy stream environments.


Rainfall Records Set Across North Carolina During Soggy July

U.S. News

The weather service reported Cape Hatteras got 20.31 inches (50 centimeters) of rain last month, well above the normal of 4.99 inches (12.66 centimeters), based on a 30-year average. It's the wettest July on record and the second wettest month ever, trailing only the 21.40 inches (54 centimeters) that fell on Cape Hatteras in September 1999 due to Hurricane Floyd.


Tulane University: Fundraising Record Set With $150M Raised

U.S. News

Among the major donations: $25 million from the family of Dr. John Winton Deming to name the John W. Deming Department of Medicine; and a $10 million gift from Tulane alumni Steven and Jann Paul to build the Steven and Jann Paul Hall for Science and Engineering. There also was an anonymous lead gift and other donations to begin construction on a $55 million building to be called The Commons, which will include a new dining hall and meeting spaces.