Information Fusion
10 Open Source ETL Tools
Pentaho Data Integration (Kettle) is Java (Swing) application and library. Kettle is an interpreter of procedures written in XML format. The features and components are a little less compressive than Talend ones, however this doesn't restrict the complexity of the ETL procedures that can be implemented. Kettle provides a JavaScript engine (as well as a Java one) to fine tune the data manipulation process. Kettle is also a good tool, with everything necessary to build even complex ETL procedures.
Personal Data Fusion and the End of Information Overload
As a high-value professional, you know that information overload is becoming worse. You're continuously bombarded by more information from ever more apps. Perhaps you have three thousand unread emails, or thirty thousand, or even three hundred thousand. And you expect your information overload to intensify in the future, because it's the price you pay for your professional success. You cannot afford to miss out on new tools for communicating with colleagues, new media for expanding your professional network, and new information sources providing updates relevant to your priorities.
Engineers Shouldn't Write ETL: A Guide to Building a High Functioning Data Science Department
"What is the relationship like between your team and the data scientists?" This is, without a doubt, the question I'm most frequently asked when conducting interviews for data platform engineers. It's a fine question – one that, given the state of engineering jobs in the data space, is essential to ask as part of doing due diligence in evaluating new opportunities. I'm always happy to answer. But I wish I didn't have to, because this a question that is motivated by skepticism and fear. If you read the recruiting propaganda of data science and algorithm development departments in the valley, you might be convinced that the relationship between data scientists and engineers is highly collaborative, organic, and creative. However, it's not a well kept secret that this is seldom the case. Most shops foster a relationship between engineers and scientists that lies somewhere in the spectrum between non-existent1 and highly dysfunctional. Data scientists: the folks who are "better engineers than statisticians and better statisticians than engineers".
Discriminative models for robust image classification
A variety of real-world tasks involve the classification of images into pre-determined categories. Designing image classification algorithms that exhibit robustness to acquisition noise and image distortions, particularly when the available training data are insufficient to learn accurate models, is a significant challenge. This dissertation explores the development of discriminative models for robust image classification that exploit underlying signal structure, via probabilistic graphical models and sparse signal representations. Probabilistic graphical models are widely used in many applications to approximate high-dimensional data in a reduced complexity set-up. Learning graphical structures to approximate probability distributions is an area of active research. Recent work has focused on learning graphs in a discriminative manner with the goal of minimizing classification error. In the first part of the dissertation, we develop a discriminative learning framework that exploits the complementary yet correlated information offered by multiple representations (or projections) of a given signal/image. Specifically, we propose a discriminative tree-based scheme for feature fusion by explicitly learning the conditional correlations among such multiple projections in an iterative manner. Experiments reveal the robustness of the resulting graphical model classifier to training insufficiency.
The interference immunity of the telemetric information data exchange with autonomous mobile robots
Kozlenko, M.I. (2012), "Frequency resource using of the spread spectrum signals forming in the distributed computer and telecommunication systems", Kozlenko, M.I. (2012), "Time complexity of the variable entropy spread spectrum signals digital demodulation", To obtain the interference immunity of the data exchange by spread spectrum signals with variable entropy of the telemetric information data exchange with autonomous mobile robots. The results have been obtained by the theoretical investigations and have been confirmed by the modeling experiments. The interference immunity in form of dependence of bit error probability on normalized signal/noise ratio of the data exchange by spread spectrum signals with variable entropy has been obtained. It has been proved that the interference immunity fa ctor (needed normalized signal/noise ratio) is at least 2 dB better under condition of equal time complexity as compared with correlation processing methods of orthogonal signals. For the first time the interference immunity in form of dependence of bit error probability on normalized signal/noise ratio of the data exchange by spread spectrum signals with variable entropy has been obtained.
Multimodal Task-Driven Dictionary Learning for Image Classification
Bahrampour, Soheil, Nasrabadi, Nasser M., Ray, Asok, Jenkins, W. Kenneth
Dictionary learning algorithms have been successfully used for both reconstructive and discriminative tasks, where an input signal is represented with a sparse linear combination of dictionary atoms. While these methods are mostly developed for single-modality scenarios, recent studies have demonstrated the advantages of feature-level fusion based on the joint sparse representation of the multimodal inputs. In this paper, we propose a multimodal task-driven dictionary learning algorithm under the joint sparsity constraint (prior) to enforce collaborations among multiple homogeneous/heterogeneous sources of information. In this task-driven formulation, the multimodal dictionaries are learned simultaneously with their corresponding classifiers. The resulting multimodal dictionaries can generate discriminative latent features (sparse codes) from the data that are optimized for a given task such as binary or multiclass classification. Moreover, we present an extension of the proposed formulation using a mixed joint and independent sparsity prior which facilitates more flexible fusion of the modalities at feature level. The efficacy of the proposed algorithms for multimodal classification is illustrated on four different applications -- multimodal face recognition, multi-view face recognition, multi-view action recognition, and multimodal biometric recognition. It is also shown that, compared to the counterpart reconstructive-based dictionary learning algorithms, the task-driven formulations are more computationally efficient in the sense that they can be equipped with more compact dictionaries and still achieve superior performance.
Bidirectional Constraints for Exchanging Data: Beyond Monotone Queries
Arenas, Marcelo (Pontificia Universidad Católica de Chile) | Diéguez, Gabriel (Pontificia Universidad Católica de Chile) | Pérez, Jorge (Universidad de Chile)
In this paper, we propose to use the language of bidirectional constraints to specify schema mappings in the context of data exchange. These constraints impose restrictions over both the source and the target data, and have the potential to minimize the ambiguity in the description of the target data to be materialized. We start by making a case for the usefulness of bidirectional constraints to give a meaningful closed-world semantics for st-tgds, which is motivated by Clark's predicate completion and Reiter's formalization of the closed-world assumption of a logical theory. We then formally study the use of bidirectional constraints in data exchange. In particular, we pinpoint the complexity of the existence-of-solutions and the query evaluation problems in several different scenarios, including in the latter case both monotone and non-monotone queries.
Tensor Analysis and Fusion of Multimodal Brain Images
Karahan, Esin, Rojas-Lopez, Pedro A., Bringas-Vega, Maria L., Valdes-Hernandez, Pedro A., Valdes-Sosa, Pedro A.
Current high-throughput data acquisition technologies probe dynamical systems with different imaging modalities, generating massive data sets at different spatial and temporal resolutions posing challenging problems in multimodal data fusion. A case in point is the attempt to parse out the brain structures and networks that underpin human cognitive processes by analysis of different neuroimaging modalities (functional MRI, EEG, NIRS etc.). We emphasize that the multimodal, multi-scale nature of neuroimaging data is well reflected by a multi-way (tensor) structure where the underlying processes can be summarized by a relatively small number of components or "atoms". We introduce Markov-Penrose diagrams - an integration of Bayesian DAG and tensor network notation in order to analyze these models. These diagrams not only clarify matrix and tensor EEG and fMRI time/frequency analysis and inverse problems, but also help understand multimodal fusion via Multiway Partial Least Squares and Coupled Matrix-Tensor Factorization. We show here, for the first time, that Granger causal analysis of brain networks is a tensor regression problem, thus allowing the atomic decomposition of brain networks. Analysis of EEG and fMRI recordings shows the potential of the methods and suggests their use in other scientific domains.
Structured Matrix Completion with Applications to Genomic Data Integration
Cai, Tianxi, Cai, T. Tony, Zhang, Anru
Matrix completion has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. Current literature on matrix completion focuses primarily on independent sampling models under which the individual observed entries are sampled independently. Motivated by applications in genomic data integration, we propose a new framework of structured matrix completion (SMC) to treat structured missingness by design. Specifically, our proposed method aims at efficient matrix recovery when a subset of the rows and columns of an approximately low-rank matrix are observed. We provide theoretical justification for the proposed SMC method and derive lower bound for the estimation errors, which together establish the optimal rate of recovery over certain classes of approximately low-rank matrices. Simulation studies show that the method performs well in finite sample under a variety of configurations. The method is applied to integrate several ovarian cancer genomic studies with different extent of genomic measurements, which enables us to construct more accurate prediction rules for ovarian cancer survival.