Artificial intelligence (AI) is widely used in today's business such as for data analytics, natural language processing, or process automation. The emergence of artificial intelligence is based on decades of research for solving difficult computer science tasks and is now rapidly transforming business model innovation. Companies that are not considering artificial intelligence will be vulnerable to those companies that are equipped with artificial intelligence technology. While companies like Google, Amazon, and Tesla have already innovated their business models with artificial intelligence, medium and small caps have limited budgets for putting much effort into setting up such capabilities. One high-effort task in creating artificial intelligence services is the pre-processing of data and the training of machine learning models.
Entity matching (EM) finds data instances that refer to the same real-world entity. In 2015, we started the Magellan project at UW-Madison, jointly with industrial partners, to build EM systems. Most current EM systems are stand-alone monoliths. In contrast, Magellan borrows ideas from the field of data science (DS), to build a new kind of EM systems, which is ecosystems of interoperable tools for multiple execution environments, such as on-premise, cloud, and mobile. This paper describes Magellan, focusing on the system aspects. We argue why EM can be viewed as a special class of DS problems and thus can benefit from system building ideas in DS. We discuss how these ideas have been adapted to build PyMatcher and CloudMatcher, sophisticated on-premise tools for power users and self-service cloud tools for lay users. These tools exploit techniques from the fields of machine learning, big data scaling, efficient user interaction, databases, and cloud systems. They have been successfully used in 13 companies and domain science groups, have been pushed into production for many customers, and are being commercialized. We discuss the lessons learned and explore applying the Magellan template to other tasks in data exploration, cleaning, and integration. Entity matching (EM) finds data instances that refer to the same real-world entity, such as tuples (David Smith, UW-Madison) and (D. Smith, UWM). This problem, also known as entity resolution, record linkage, deduplication, data matching, et cetera, has been a long-standing challenge in the database, AI, KDD, and Web communities.2,6 As data-driven applications proliferate, EM will become even more important. For example, to analyze raw data for insights, we often integrate multiple raw data sets into a single unified one, before performing the analysis, and such integration often requires EM. To build a knowledge graph, we often start with a small graph and then expand it with new data sets, and such expansion requires EM. When managing a data lake, we often use EM to establish semantic linkages among the disparate data sets in the lake.
Citizen Science Salon is a partnership between Discover and SciStarter.org. In 2016, a team of Alzheimer's disease researchers at Cornell University hit a dead end. The scientists were studying mice, looking for links between Alzheimer's and blood flow changes in the brain. For years, scientists have known that reduced blood flow in the brain is a symptom of Alzheimer's disease. More recent research has also shown that this reduced blood flow can be caused by clogged blood vessels -- or "stalls." And by reversing these stalls in mice, scientists were able to restore their memory.
Throughout the research world, artificial intelligence is increasingly being applied to scanning complicated scientific literature more quickly than humans alone can do. At Utrecht University, Prof. Rens van de Schoot and his team are part of an international research community now applying that technology to COVID-19 publications. In an edited email exchange with Diane M. Fresquez of Science Business, van de Schoot talks about his work, and search for collaborators (have you got coding talent?) – initially, while under lockdown with his three children, aged six and under, who played quietly (or not so quietly) underfoot. Q. Tell us about your COVID-19 project. With an increase in COVID-19 research literature, and an urgency to find cures and treatments, it is essential that data collection is done real-time.
International Conference on Learning Representations (ICLR) and Consultative Group on International Agricultural Research (CGIAR) jointly conducted a challenge where over 800 data scientists globally competed to detect diseases in crops based on close shot pictures. The objective of this challenge is to build a machine learning algorithm to correctly classify if a plant is healthy, has stem rust, or has leaf rust. Wheat rust is a devastating plant disease affecting many crops, reducing yields and affecting the livelihoods of farmers and decreasing food security across Africa. The disease is difficult to monitor at a large scale, making it difficult to control and eradicate. An accurate image recognition model that can detect wheat rust from any image will enable a crowd-sourced approach to monitor crops. The imagery data came from a variety of sources.
Coming with the ever growing computational power of mobile devices, mobile visual search have undergone an evolution in techniques and applications. A significant trend is low bit rate visual search, where compact visual descriptors are extracted directly over a mobile and delivered as queries rather than raw images to reduce the query transmission latency. In this article, we introduce our work on low bit rate mobile landmark search, in which a compact yet discriminative landmark image descriptor is extracted by using location context such as GPS, crowd-sourced hotspot WLAN, and cell tower locations. The compactness originates from the bag-of-words image representation, with an offline learning from geotagged photos from online photo sharing websites including Flickr and Panoramio. The learning process involves segmenting the landmark photo collection by discrete geographical regions using Gaussian mixture model, and then boosting a ranking sensitive vocabulary within each region, with an "entropy" based descriptor compactness feedback to refine both phases iteratively.
FIDE CM Kingscrusher goes over a game featuring An imprisoned bishop Highly Evolved Leela vs Mighty Stockfish TCEC Season 17 Rd 34 Play turn style chess at http://bit.ly/chessworld FIDE CM Kingscrusher goes over amazing games of Chess every day, with a focus recently on chess champions such as Magnus Carlsen or even games of Neural Networks which are opening up new concepts for how chess could be played more effectively. The Game qualities that kingscrusher looks for are generally amazing games with some awesome or astonishing features to them. Many brilliant games are being played every year in Chess and this channel helps to find and explain them in a clear way. There are classic games, crushing and dynamic games. There are exceptionally elegant games.
Alfredo joined Element AI as a Research Engineer in the AI for Good lab in London, working on applications that enable NGOs and non-profits. He is one of the primary co-authors of the first technical report made in partnership with Amnesty International, on the large-scale study of online abuse against women on Twitter from crowd-sourced data. He's been a Machine Learning mentor at NASA's Frontier Development Program, helping teams apply AI for scientific space problems. More recently, he led the joint-research with Mila Montreal on Multi-Frame Super-Resolution, which was awarded by the European Space Agency for their top performance on the PROBA-V Super-Resolution challenge. His research interests lie in computer vision for satellite imagery, probabilistic modeling, and AI for Social Good.
Pairwise comparison data arise in many domains with subjective assessment experiments, for example in image and video quality assessment. In these experiments observers are asked to express a preference between two conditions. However, many pairwise comparison protocols require a large number of comparisons to infer accurate scores, which may be unfeasible when each comparison is time-consuming (e.g. videos) or expensive (e.g. medical imaging). This motivates the use of an active sampling algorithm that chooses only the most informative pairs for comparison. In this paper we propose ASAP, an active sampling algorithm based on approximate message passing and expected information gain maximization. Unlike most existing methods, which rely on partial updates of the posterior distribution, we are able to perform full updates and therefore much improve the accuracy of the inferred scores. The algorithm relies on three techniques for reducing computational cost: inference based on approximate message passing, selective evaluations of the information gain, and selecting pairs in a batch that forms a minimum spanning tree of the inverse of information gain. We demonstrate, with real and synthetic data, that ASAP offers the highest accuracy of inferred scores compared to the existing methods. We also provide an open-source GPU implementation of ASAP for large-scale experiments.
Social media, especially Twitter, is being increasingly used for research with predictive analytics. In social media studies, natural language processing (NLP) techniques are used in conjunction with expert-based, manual and qualitative analyses. However, social media data are unstructured and must undergo complex manipulation for research use. The manual annotation is the most resource and time-consuming process that multiple expert raters have to reach consensus on every item, but is essential to create gold-standard datasets for training NLP-based machine learning classifiers. To reduce the burden of the manual annotation, yet maintaining its reliability, we devised a crowdsourcing pipeline combined with active learning strategies. We demonstrated its effectiveness through a case study that identifies job loss events from individual tweets. We used Amazon Mechanical Turk platform to recruit annotators from the Internet and designed a number of quality control measures to assure annotation accuracy. We evaluated 4 different active learning strategies (i.e., least confident, entropy, vote entropy, and Kullback-Leibler divergence). The active learning strategies aim at reducing the number of tweets needed to reach a desired performance of automated classification. Results show that crowdsourcing is useful to create high-quality annotations and active learning helps in reducing the number of required tweets, although there was no substantial difference among the strategies tested.