Information Fusion
Covariance-Generalized Matching Component Analysis for Data Fusion and Transfer Learning
Lorenzo, Nick, O'Rourke, Sean, Scarnati, Theresa
The matching component analysis (MCA) transfer learning technique was originally developed as a data augmentation strategy for building large, representative machine learning training sets within a data-limited environment [1]. Specifically, MCA maps a training domain and a testing domain into a low-dimensional, common domain using only a small number of matched train-test image pairs. These maps minimize the expected distance between train-test image pairs within the common domain, subject to an identity matrix covariance constraint and an affine linear structure. The training domain's optimal affine linear transformation - encoded with information from the matched train-test image pairs - is then applied to a large number of unmatched training images, resulting in a large number of common-domain image representations to be used as training inputs. We are interested in extending the MCA application space to the fusion of data acquired from two different modalities.
A Roadmap to Domain Knowledge Integration in Machine Learning
Gupta, Himel Das, Sheng, Victor S.
Many machine learning algorithms have been developed in recent years to enhance the performance of a model in different aspects of artificial intelligence. But the problem persists due to inadequate data and resources. Integrating knowledge in a machine learning model can help to overcome these obstacles up to a certain degree. Incorporating knowledge is a complex task though because of various forms of knowledge representation. In this paper, we will give a brief overview of these different forms of knowledge integration and their performance in certain machine learning tasks.
Are Data Silos Undermining Digital Transformation? - ReadWrite
At a time of seemingly ultrarapid digital disruptions, digital transformation in an enterprise needs a bold vision and an intent to embrace change. With the global digital transformation market projected to reach $2.8 trillion in 2025, leaders are expediting their transition to digital across their organizations. And as enterprises course-correct and adapt to specific strategies along this journey, they need a sound understanding of their data to drive informed decisions. The needed understanding of data-informed decisions is because high-quality data is at the heart of all digitalization initiatives, from delivering invaluable insights to and uncovering latent operational efficiency strategies. And that's the reason organizations' must get careful about the creation of data silos. Today 73.5% of most leading companies are data-driven in their decision-making.
AI + OCR - A Key Ingredient To Digital
Countless human hours are required to manually extract the data into a machine-readable format. This process is known as ETL (extract, transform, and load). Insurers that can maximize their ETL capabilities have a powerful competitive advantage. Optical character recognition, also known as text recognition, converts text from scanned paper documents, photos, books, and PDF files into a machine-readable format, isn't new. What is new is coupling OCR with AI and machine-learning algorithms to reliably generate text that can be processed, indexed, and retrieved.
Learning Instrumental Variable from Data Fusion for Treatment Effect Estimation
Wu, Anpeng, Kuang, Kun, Xiong, Ruoxuan, Zhu, Minqing, Liu, Yuxuan, Li, Bo, Liu, Furui, Wang, Zhihua, Wu, Fei
The advent of the big data era brought new opportunities and challenges to draw treatment effect in data fusion, that is, a mixed dataset collected from multiple sources (each source with an independent treatment assignment mechanism). Due to possibly omitted source labels and unmeasured confounders, traditional methods cannot estimate individual treatment assignment probability and infer treatment effect effectively. Therefore, we propose to reconstruct the source label and model it as a Group Instrumental Variable (GIV) to implement IV-based Regression for treatment effect estimation. In this paper, we conceptualize this line of thought and develop a unified framework (Meta-EM) to (1) map the raw data into a representation space to construct Linear Mixed Models for the assigned treatment variable; (2) estimate the distribution differences and model the GIV for the different treatment assignment mechanisms; and (3) adopt an alternating training strategy to iteratively optimize the representations and the joint distribution to model GIV for IV regression. Empirical results demonstrate the advantages of our Meta-EM compared with state-of-the-art methods.
How to Test PySpark ETL Data Pipeline
Garbage in garbage out is a common expression used to emphasize the importance of data quality for tasks such as machine learning, data analytics and business intelligence. With increasing amount of data being created and stored, building high quality data pipelines have never been more challenging. PySpark is a commonly used tool to build ETL pipelines for large datasets. A common question that arises while building data pipeline is "How do we know that our data pipeline is transforming the data in the way that is intended?". To answer this question, we borrow the idea of unit test from the software development paradigm.
Synatic Secures $2.5 Million in Seed Extension Funding
Synatic, a leader in data integration and automation, has secured an additional $2.5 million in a seed extension funding round led by Allan Gray E-Squared Ventures and UW Ventures. Synatic will use the additional funds to expand market reach in the United States in preparation for Series A funding early in 2023. Participating in the seed extension round are Allan Gray E-Squared Ventures (AGEV), UW Ventures, Adansonia PE Opportunities VCC, and the Endeavor Harvest Fund. AGEV and UW Ventures are leading investment management and venture firms based in South Africa. Adansonia PE Opportunities VCC (APEO) is an African opportunities permanent capital structure based in Singapore.
Multimodal Information Bottleneck: Learning Minimal Sufficient Unimodal and Multimodal Representations
Mai, Sijie, Zeng, Ying, Hu, Haifeng
Learning effective joint embedding for cross-modal data has always been a focus in the field of multimodal machine learning. We argue that during multimodal fusion, the generated multimodal embedding may be redundant, and the discriminative unimodal information may be ignored, which often interferes with accurate prediction and leads to a higher risk of overfitting. Moreover, unimodal representations also contain noisy information that negatively influences the learning of cross-modal dynamics. To this end, we introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation that is free of redundancy and to filter out noisy information in unimodal representations. Specifically, inheriting from the general information bottleneck (IB), MIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target and simultaneously constraining the mutual information between the representation and the input data. Different from general IB, our MIB regularizes both the multimodal and unimodal representations, which is a comprehensive and flexible framework that is compatible with any fusion methods. We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints. Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition across three widely used datasets. The codes are available at \url{https://github.com/TmacMai/Multimodal-Information-Bottleneck}.
AWS re:Invent 2022 roundup: Data management, AI, compute take center stage
As businesses grapple with growing volumes of data collected and generated by a myriad of cloud-based applications, Amazon Web Services (AWS) unveiled a wide range of new applications and product enhancements this week at its annual re:Invent conference that are geared to optimize data analytics and governance, and bolster the computing infrastructure to do so. Over the last few days, the company launched new services and features across its storage, compute, analytics, machine learning, databases, and security services, and made its first foray into supply chain management. Here is a roundup of the major announcements, with links to articles containing more details about the updates. A major theme at re:Invent 2022 was Amazon's efforts to ease data management and analytics for enterprises, as the company announced a dozen updates to data services. The updates included the launch of two new capabilities--Amazon Aurora zero-ETL integration with Amazon Redshift and Amazon Redshift integration for Apache Spark--that it claims will make the extract, transform, load (ETL) process obsolete.