Goto

Collaborating Authors

 Information Fusion


Exploiting Semantics for Big Data Integration

AI Magazine

There is a great deal of interest in big data, focusing mostly on data set size. The use of semantics in this integration descriptions and then integrating the data within process is key to building an approach that scales this unified framework. Finally, we conclude by to large numbers of heterogeneous sources. For example, in and (4) integrate the data across sources using this our museum use case, we received data in spreadsheets model. Karma has been used on a variety of types of (figure 1), comma-separated values (CSV), data, including biological data, mobile phone data, JSON (figure 3), XML, and relational databases (figure geospatial data, and cultural heritage data. In order to illustrate the approach to integrating One challenge in integrating diverse data sources is data in Karma, we will use an example from the cultural the ability to import different data formats into a heritage domain.


A Robust and Extensible Tool for Data Integration Using Data Type Models

AAAI Conferences

Integrating heterogeneous data sets has been a significant barrier to many analytics tasks, due to the variety in structure and level of cleanliness of raw data sets requiring one-off ETL code. We propose HiperFuse, which significantly automates the data integration process by providing a declarative interface, robust type inference, extensible domain-specific data models, and a data integration planner which optimizes for plan completion time. The proposed tool is designed for schema-less data querying, code reuse within specific domains, and robustness in the face of messy unstructured data. To demonstrate the tool and its reference implementation, we show the requirements and execution steps for a use case in which IP addresses from a web clickstream log are joined with census data to obtain average income for particular site visitors (IPs), and offer preliminary performance results and qualitative comparisons to existing data integration and ETL tools.


On the Diagnosis of Cyber-Physical Production Systems

AAAI Conferences

Cyber-Physical Production Systems (CPPSs) are in the focus of research, industry and politics: By applying new IT and new computer science solutions, production systems will become more adaptable, more resource ef- ficient and more user friendly. The analysis and diagnosis of such systems is a major part of this trend: Plants should detect automatically wear, faults and suboptimal configurations. This paper reflects the current state-of- the-art in diagnosis against the requirements of CPPSs, identifies three main gaps and gives application scenarios to outline first ideas for potential solutions to close these gaps.


Data Fusion by Matrix Factorization

arXiv.org Artificial Intelligence

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.


Localized Data Fusion for Kernel k-Means Clustering with Application to Cancer Biology

Neural Information Processing Systems

In many modern applications from, for example, bioinformatics and computer vision, samples have multiple feature representations coming from different data sources. Multiview learning algorithms try to exploit all these available information to obtain a better learner in such scenarios. In this paper, we propose a novel multiple kernel learning algorithm that extends kernel k-means clustering to the multiview setting, which combines kernels calculated on the views in a localized way to better capture sample-specific characteristics of the data. We demonstrate the better performance of our localized data fusion approach on a human colon and rectal cancer data set by clustering patients. Our method finds more relevant prognostic patient groups than global data fusion methods when we evaluate the results with respect to three commonly used clinical biomarkers.


A convex formulation for hyperspectral image superresolution via subspace-based regularization

arXiv.org Machine Learning

Hyperspectral remote sensing images (HSIs) usually have high spectral resolution and low spatial resolution. Conversely, multispectral images (MSIs) usually have low spectral and high spatial resolutions. The problem of inferring images which combine the high spectral and high spatial resolutions of HSIs and MSIs, respectively, is a data fusion problem that has been the focus of recent active research due to the increasing availability of HSIs and MSIs retrieved from the same geographical area. We formulate this problem as the minimization of a convex objective function containing two quadratic data-fitting terms and an edge-preserving regularizer. The data-fitting terms account for blur, different resolutions, and additive noise. The regularizer, a form of vector Total Variation, promotes piecewise-smooth solutions with discontinuities aligned across the hyperspectral bands. The downsampling operator accounting for the different spatial resolutions, the non-quadratic and non-smooth nature of the regularizer, and the very large size of the HSI to be estimated lead to a hard optimization problem. We deal with these difficulties by exploiting the fact that HSIs generally "live" in a low-dimensional subspace and by tailoring the Split Augmented Lagrangian Shrinkage Algorithm (SALSA), which is an instance of the Alternating Direction Method of Multipliers (ADMM), to this optimization problem, by means of a convenient variable splitting. The spatial blur and the spectral linear operators linked, respectively, with the HSI and MSI acquisition processes are also estimated, and we obtain an effective algorithm that outperforms the state-of-the-art, as illustrated in a series of experiments with simulated and real-life data.


Deterministic Bayesian Information Fusion and the Analysis of its Performance

arXiv.org Machine Learning

Sensor networks are ubiquitous across many different domains, including wireless communications, temperature and process control, area surveillance, object tracking and numerous other fields [2, 6]. Large performance gains can be achieved in such networks by performing data fusion between the sensors, or combining information from the individual sensors to reach system-level decisions [9, 16, 24, 26]. The sensors are typically connected by wireless links to either a separate information collector (centralized fusion) or to each other (distributed fusion). Elementary fusion rules based on Boolean logic are used in many contexts due to their simplicity and ease of implementation. On the other hand, in most situations we have some knowledge of the statistical properties of the sensors' outputs, and designing fusion rules that take this into account can provide much better performance [17, 24]. The fusion rule can be built to satisfy any of various statistical optimality criteria, such as achieving the maximum likelihood or the minimum Bayes risk, under any other constraints of the problem [17].


Decentralized Data Fusion and Active Sensing with Mobile Sensors for Modeling and Predicting Spatiotemporal Traffic Phenomena

arXiv.org Artificial Intelligence

The problem of modeling and predicting spatiotemporal traffic phenomena over an urban road network is important to many traffic applications such as detecting and forecasting congestion hotspots. This paper presents a decentralized data fusion and active sensing (D2FAS) algorithm for mobile sensors to actively explore the road network to gather and assimilate the most informative data for predicting the traffic phenomenon. We analyze the time and communication complexity of D2FAS and demonstrate that it can scale well with a large number of observations and sensors. We provide a theoretical guarantee on its predictive performance to be equivalent to that of a sophisticated centralized sparse approximation for the Gaussian process (GP) model: The computation of such a sparse approximate GP model can thus be parallelized and distributed among the mobile sensors (in a Google-like MapReduce paradigm), thereby achieving efficient and scalable prediction. We also theoretically guarantee its active sensing performance that improves under various practical environmental conditions. Empirical evaluation on real-world urban road network data shows that our D2FAS algorithm is significantly more time-efficient and scalable than state-oftheart centralized algorithms while achieving comparable predictive performance.


The D-SCRIBE Process for Building a Scalable Ontology

AAAI Conferences

In this paper, we describe the D-SCRIBE process used to build ontologies that are expected to have significant domain expansion after their initial introduction and whose coverage of concepts needs to be validated for a series of related applications. This process has been used to build SCRIBE, a very modular, ambitious ontology for the information about events triggered by both humans or nature, response activities by agencies that provide public services in cities by using resources and assets (land parcels, buildings, vehicles, equipment) and their communication (requests, work orders, sensor reports). SCRIBE reuses concepts from previously existing ontologies and data exchange standards, and D-SCRIBE retains traceability to these source influences.


XML Matchers: approaches and challenges

arXiv.org Artificial Intelligence

Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.