Goto

Collaborating Authors

 Information Fusion


(Almost) All of Entity Resolution

arXiv.org Machine Learning

Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.


Multifidelity Data Fusion via Gradient-Enhanced Gaussian Process Regression

arXiv.org Machine Learning

We propose a data fusion method based on multi-fidelity Gaussian process regression (GPR) framework. This method combines available data of the quantity of interest (QoI) and its gradients with different fidelity levels, namely, it is a Gradient-enhanced Cokriging method (GE-Cokriging). It provides the approximations of both the QoI and its gradients simultaneously with uncertainty estimates. We compare this method with the conventional multi-fidelity Cokriging method that does not use gradients information, and the result suggests that GE-Cokriging has a better performance in predicting both QoI and its gradients. Moreover, GE-Cokriging even shows better generalization result in some cases where Cokriging performs poorly due to the singularity of the covariance matrix. We demonstrate the application of GE-Cokriging in several practical cases including reconstructing the trajectories and velocity of an underdamped oscillator with respect to time simultaneously, and investigating the sensitivity of power factor of a load bus with respect to varying power inputs of a generator bus in a large scale power system. We also show that though GE-Cokriging method requires a little bit higher computational cost than Cokriging method, the result of accuracy comparison shows that this cost is usually worth it.


Python ETL Tools: Best 8 Options

#artificialintelligence

ETL is the process of fetching data from one or many systems and loading it into a target data warehouse after doing some intermediate transformations. The market has various ETL tools that can carry out this process. Some tools offer a complete end-to-end ETL implementation out of the box and some tools help you to create a custom ETL process from scratch and there are a few options that fall somewhere in between. In this post, we will see some commonly used Python ETL tools and understand in which situations they may be a good fit for your project. Before going through the list of Python ETL tools, let's first understand some essential features that any ETL tool should have.


New Machine Learning Features, Data Integrations, and Upgraded Classification Engines Available in Grooper Version 2.9

#artificialintelligence

Grooper, the leading intelligent document processing and digital data integration platform announces the release of version 2.9. Included are fourteen new capabilities that enhance machine learning, classification, separation, data integration, and reporting. New Machine Learning Features Machine learning is easier and more powerful. The new Rebuild Training features provide tuning and A/B testing using identical training sets and document training decisions. Integration with Box Built-in integration with Box.com enables file import and export, metadata mapping, data lookups, and more.


Supercharge Content Intelligence with AI

#artificialintelligence

Artificial intelligence (AI) creates abundant opportunities for a wide range of intelligent, automated business operations. Two vital capabilities--metadata extraction and data enrichment--rank among the most valuable, commonly used functions for businesses seeking to harness immediate value from organizational data and content. AI-driven techniques for rapidly sorting, filtering, categorizing, and adding context to massive volumes of data can help deliver a distinct business advantage. By combining accessible, cloud-based AI services and customizable, specialized AI tools and training, businesses can shape data and content services to better meet their objectives. Despite the accelerating, never-ending spiral of accumulating content, most businesses aren't gaining the insights they need nor seeing visible operational benefits, as asserted in a Software Development Times article.


Data Engineer - IoT BigData Jobs

#artificialintelligence

Bachelor's Degree in Computer Science or other quantitative field; advanced degree is a plus 3-5 years of relevant work experience Seasoned data engineer with experience in data integration from multiple sources (through SQL, nonSQL, REST API) Use of ETL development, methodology and tools Extensive experience using SQL Strong background in software development methodology around unit testing, performance tuning, integration testing, etc. – Git experience a plus Strong coding skills. Has written code that brings in data from different types of data sources and performs ETL Experience with analytical tools supporting data analysis and reporting (MicroStrategy, Tableau, etc) Able to explain complex technical concepts to technical and non-technical colleagues alike. Ability to apply principles of logical thinking to a wide range of intellectual and practical problems. Physical/Mental Demands The physical and mental demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.


Information Fusion on Belief Networks

arXiv.org Artificial Intelligence

This paper will focus on the process of 'fusing' several observations or models of uncertainty into a single resultant model. Many existing approaches to fusion use subjective quantities such as 'strengths of belief' and process these quantities with heuristic algorithms. This paper argues in favor of quantities that can be objectively measured, as opposed to the subjective 'strength of belief' values. This paper will focus on probability distributions, and more importantly, structures that denote sets of probability distributions known as 'credal sets'. The novel aspect of this paper will be a taxonomy of models of fusion that use specific types of credal sets, namely probability interval distributions and Dempster-Shafer models. An objective requirement for information fusion algorithms is provided, and is satisfied by all models of fusion presented in this paper. Dempster's rule of combination is shown to not satisfy this requirement. This paper will also assess the computational challenges involved for the proposed fusion approaches.


AWS Data Exchange Challenge

#artificialintelligence

AWS Data Exchange makes it easy to find, subscribe to, and use third-party data in the cloud. Data scientists, data analysts, and developers in nearly every industry use AWS Data Exchange for access to 3rd-party data to drive analytics, train machine-learning models, and make data-driven decisions. Today, AWS Data Exchange contains over 2,300 data products from 120 providers from a broad range of domains including healthcare, financial services, retail, and more. The AWS Data Exchange Challenge is an opportunity for you to show off your skills, learn something new, collaborate with other developers, and get a shot at part of $35,700 in prizes. You're invited to build solutions to answer tough questions using 3rd-party data products from AWS Data Exchange.


Pentaho Kettle Solutions - Programmer Books

#artificialintelligence

This practical book is a complete guide to installing, configuring, and managing Pentaho Kettle. If you're a database administrator or developer, you'll first get up to speed on Kettle basics and how to apply Kettle to create ETL solutions – before progressing to specialized concepts such as clustering, extensibility, and data vault models. Learn how to design and build every phase of an ETL solution. Get the most out of Pentaho Kettle and your data warehousing with this detailed guide – from simple single table data migration to complex multisystem clustered data integration tasks.


A Flexible Optimization Framework for Regularized Matrix-Tensor Factorizations with Linear Couplings

arXiv.org Machine Learning

In many areas of science, various sensing technologies are used to obtain information about a single system of interest. Often, none of the datasets alone contains a complete view of the system, but the data measured from different modalities can complement each other. For instance, brain activity patterns can be captured using both electroencephalography (EEG) and functional magnetic resonance imaging (fMRI) signals, which have complementary temporal and spatial resolutions. Similarly, in metabolomics, multiple analytical techniques such as LCMS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) spectroscopy are used to measure chemical compounds in biological samples, providing a more complete picture of underlying biological processes. Joint analysis of datasets from multiple sources, also referred to as data fusion (or multi-modal data mining), exploits these complementary measurements, and allows for better interpretability and, potentially, more accurate recovery of patterns characterizing the underlying phenomena. Nevertheless, data fusion poses many challenges, and there is an emerging need for data fusion methods that can take into account different characteristics of data from multiple sources in many disciplines [1-4]. Data from multiple sources can often be represented in the form of matrices and higher-order tensors. Coupled matrix and tensor factorizations (CMTF) are an effective approach for joint analysis of such datasets in many domains including social network analysis [5-8], neuroscience [9-13], and chemometrics [2, 14]. In such coupled factorizations, each dataset is modelled by a low-rank approximation.