Goto

Collaborating Authors

 Information Fusion


Data analytics practices plagued with inefficiencies

#artificialintelligence

Data analytics practices are plagued with inefficiencies, according to a new report from automated data integration provider Fivetran. Polling circa 500 data professionals, the firm uncovered "surprising" information surrounding how data analysts spend their working days and the challenges they face. According to Fivetran, most data analysts spend less than half of the day actually analysing data. Much of the rest of the day is wasted as a result of various bottlenecks. For example, more than 60 percent reported wasting time waiting for engineering resources, multiple times a month.


Fair Data Integration

arXiv.org Machine Learning

The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high quality training data, most of the fairness literature ignores this stage. In this work, we consider fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. We work under the causal interventional fairness paradigm. Without requiring the underlying structural causal model a priori, we propose an approach to identify a sub-collection of features that ensure the fairness of the dataset by performing conditional independence tests between different subsets of features. We use group testing to improve the complexity of the approach. We theoretically prove the correctness of the proposed algorithm to identify features that ensure interventional fairness and show that sub-linear conditional independence tests are sufficient to identify these variables. A detailed empirical evaluation is performed on real-world datasets to demonstrate the efficacy and efficiency of our technique.


The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes

arXiv.org Artificial Intelligence

This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. It is constructed such that unimodal models struggle and only multimodal models can succeed: difficult examples ("benign confounders") are added to the dataset to make it hard to rely on unimodal signals. The task requires subtle reasoning, yet is straightforward to evaluate as a binary classification problem. We provide baseline performance numbers for unimodal models, as well as for multimodal models with various degrees of sophistication. We find that state-of-the-art methods perform poorly compared to humans (64.73% vs. 84.7%


NLP: Some useful notes about Text Processing

#artificialintelligence

Many times we listen to speak about machine learning, but it is important to know that there are other pipelines before machine learning which play a significant role in the study of Big Data. Some examples are ETL (extract, transform and load) or NLP(natural language processing). Nowadays, in particular, NLP pipeline is taking more and more space in Data Science. So, what is Natural Language Processing? Natural Language Processing is a process that permits Data Scientist or Data Analyst to extract important information from human language. For example, with NLP it is possible to find an important pattern by studying texts in posts or comments available on a social network.


Speech Analytics Market Future Aspect Analysis and Current Trends by 2017 to 2025 โ€“ Distinct Analysis & Reports

#artificialintelligence

Speech analytics technologies are used to extract information at customer contact points across various channels such as voice, chat, email, social channels, and surveys. Across the world, voice and phone interaction is the most common mode of communication used by consumers. Therefore, speech analytics is used in Voice User Interface (VUI) to derive insights at different contact points. In current times, organizations across various industry sectors are undertaking programs for transcripting and analyzing customer and organizational media. This is mainly to take logical decisions for customer and business management with the help of speech and text intelligence.


A deep "data lake" for coronavirus information

#artificialintelligence

An AI software provider has created a sprawling new "data lake" of information about the COVID-19 pandemic for researchers around the world. Why it matters: In just a few short months, researchers have generated an astounding amount of data about COVID-19. Putting much of that information in an easily readable source will enable researchers and policymakers to get the most out of big data. How it works: For all the rich data being produced about COVID-19, much of it is being compiled in separate silos by the government, academia and business, often in unreadable formats. Without an integrated data set, there's no easy way to produce the AI models used to analyze the many facets of the pandemic.


NVIDIA Accelerates Apache Spark, World's Leading Data Analytics Platform

#artificialintelligence

NVIDIA today announced that it is collaborating with the open-source community to bring end-to-end GPU acceleration to Apache Spark 3.0, an analytics engine for big data processing used by more than 500,000 data scientists worldwide. With the anticipated late spring release of Spark 3.0, data scientists and machine learning engineers will for the first time be able to apply revolutionary GPU acceleration to the ETL (extract, transform and load) data processing workloads widely conducted using SQL database operations. In another first, AI model training will be able to be processed on the same Spark cluster, instead of running the workloads as separate processes on separate infrastructure. This enables high-performance data analytics across the entire data science pipeline, accelerating tens to thousands of terabytes of data from data lake to model training, without changes to existing code used for Spark applications running on premises and in the cloud. "Data analytics is the greatest high performance computing challenge facing today's enterprises and researchers," said Manuvir Das, head of Enterprise Computing at NVIDIA.


Early soft and flexible fusion of EEG and fMRI via tensor decompositions

arXiv.org Machine Learning

Data fusion refers to the joint analysis of multiple datasets which provide complementary views of the same task. In this preprint, the problem of jointly analyzing electroencephalography (EEG) and functional Magnetic Resonance Imaging (fMRI) data is considered. Jointly analyzing EEG and fMRI measurements is highly beneficial for studying brain function because these modalities have complementary spatiotemporal resolution: EEG offers good temporal resolution while fMRI is better in its spatial resolution. The fusion methods reported so far ignore the underlying multi-way nature of the data in at least one of the modalities and/or rely on very strong assumptions about the relation of the two datasets. In this preprint, these two points are addressed by adopting for the first time tensor models in the two modalities while also exploring double coupled tensor decompositions and by following soft and flexible coupling approaches to implement the multi-modal analysis. To cope with the Event Related Potential (ERP) variability in EEG, the PARAFAC2 model is adopted. The results obtained are compared against those of parallel Independent Component Analysis (ICA) and hard coupling alternatives in both simulated and real data. Our results confirm the superiority of tensorial methods over methods based on ICA. In scenarios that do not meet the assumptions underlying hard coupling, the advantage of soft and flexible coupled decompositions is clearly demonstrated.


Industry 4.0 Is Leading IoT Adoption in 2020, Boosting Demand for Integrated Data

#artificialintelligence

Manufacturing and processing plants might not be at the front of anyone's minds when it comes to tech adoption, but as illustrated by a recent IDC report, The Worldwide Internet of Things Spending Guide, the manufacturing industry is transforming into industry 4.0 and spearheading the adoption of IoT. Industry 4.0 is the newest industrial revolution, bringing automation, big data and AI into plants and factories around the world. One of the building blocks of industry 4.0 is the internet of things, or IoT. A recent report forecast that spending on IoT platforms would see a 40% CAGR between 2019 and 2024, resulting in spending that exceeds $12.4 billion. In 2019, leading industry corporations were expected to invest almost $200 billion in IoT solutions.


NetApp working on Application-Integrated Data Management for Kubernetes - Express Computer

#artificialintelligence

NetApp the leader in cloud data services, today introduced Project Astra, a vision for a software-defined platform that is currently in development with the Kubernetes community. Project Astra will deliver the industry's most robust, easy-to-consume, enterprise-class storage and data services platform for Kubernetes that enables both application and data portability for stateful applications. Although companies everywhere are rapidly adopting Kubernetes, many organizations lack reliable data and application services, and have difficulty making application data as portable as the applicationsthemselves arein Kubernetes. Yet to meet the standards that CIOs expect, IT teams and site reliability engineers must find a way to store, govern, protect, and replicate the data for both stateless and stateful cloud-native applications with enterprise-class cloud storage and data services. Project Astra is being purpose-built for and in collaboration withKubernetes developers and operations managers to help bridge the fundamental gap that exists between the popularity of containers today,the capabilities and user experience they require, and their ability to deliver true, comprehensive portability.