Goto

Collaborating Authors

 Information Fusion


Apache Beam - Create Data Processing Pipelines

@machinelearnbot

At the Data Science Association our members often complain about the major data engineering problem of finding the right tools and programming models to build both robust data processing pipelines and efficient ETL processes for data transformation and integration. Beam (incubating) attempts to solve this problem by providing a unified programming model to create data processing pipelines. The Apache Beam open source project is currently in incubation mode and we invite you to join the community and pitch in to help build. You start by building a program that defines the pipeline using one of the open source Beam SDKs. The pipeline is then executed by one of Beam's supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.


How to justify the purchase of a data integration tool

#artificialintelligence

The growing importance of business intelligence and data analytics applications in driving business decision making has made data integration's vital role in the enterprise crystal clear. From gathering data, transforming it into useful information and delivering it to the business users or processes that need it, data integration routines provide the crucial link between a variety of source and target systems. As the first article in this series examined, several types of packaged software have emerged to meet the challenges of data integration. The current generation of data integration tools consists of full-fledged suites that support extract, transform and load (ETL) processes, application integration, cloud-based and real-time integration, data virtualization, data cleansing and data profiling. How can you determine if your organization should invest in a data integration tool?


Is your data safe when it's at rest? MarkLogic 9 aims to make sure it is

PCWorld

The database landscape is much more diverse than it once was, thanks in large part to big data, and on Tuesday, one of today's newer contenders unveiled an upcoming release featuring a major boost in security. Version 9 of MarkLogic's namesake NoSQL database will be available at the end of this year, and one of its key new features is the inclusion of Cryptsoft's KMIP (Key Management Interoperability Protocol) technology. MarkLogic has placed its bets on companies' need to integrate data from dispersed enterprise silos -- a task that has often required the use of so-called ETL tools to extract, transform and load data into a traditional relational database. Aiming to offer an alternative approach, MarkLogic's technology combines the flexibility, scalability, and agility of NoSQL with enterprise-hardened features like government-grade security and high availability, it says. Now coming up in the next generation of the software will be a variety of improvements in data integration, manageability and security, the company says, but certainly most notable among them is the addition of Cryptsoft's KMIP.


Know Your Data: The Data Complexity Matrix

@machinelearnbot

Data environments are growing exponentially. IDC reports that compound annual growth in data through 2020 will be almost 50% per year. Not only is there more data, but there are more data sources. According to Ventana Research, 70% of organizations need to integrate more than 6 disparate data sources. At the same time, the value of unlocking that data and using it to make business decisions is also increasing.


Pentaho adds native Python integration

#artificialintelligence

Aiming to better support machine learning and analytical environments, Pentaho Labs yesterday announced that it has developed a native integration for the Python language through Pentaho Data Integration (PDI). PDI is essentially a portable "data machine" for ETL, which you can deploy as a stand-alone Pentaho cluster or inside a Hadoop cluster through MapReduce or YARN. Will Gorman, vice president of Pentaho Labs at Hitachi subsidiary Pentaho, says the integration means data scientists can now use of the most popular and flexible open-source languages to increase productivity and data governance while supporting predictive analytics and machine learning. He says the integration will also make data science and predictive modeling more accessible to the developer community. "Python is the environment that is growing the fastest from a community perspective," Gorman says.


Data Warehouse Architecture

@machinelearnbot

According to Weisensee et al., Data warehouse architecture follows following principles: ETL process is the foundation of BI. Success and failure of BI projects depends upon ETL process. It plays a vital role to integrate and enhance the worth of data. After the extraction, cleansing and arrangement of data, it will be loaded into data warehouse. In short, ETL is the transferring process of data from data source to the target data warehouse.


Fortune Teller: Predicting Your Career Path

AAAI Conferences

People go to fortune tellers in hopes of learning things about their future. A future career path is one of the topics most frequently discussed. But rather than rely on "black arts" to make predictions, in this work we scientifically and systematically study the feasibility of career path prediction from social network data. In particular, we seamlessly fuse information from multiple social networks to comprehensively describe a user and characterize progressive properties of his or her career path. This is accomplished via a multi-source learning framework with fused lasso penalty, which jointly regularizes the source and career-stage relatedness. Extensive experiments on real-world data confirm the accuracy of our model.


The Way To A Successful Business With Data Integration And Management

@machinelearnbot

In order for a business to be successful in today's highly competitive world, it is necessary that to go above and beyond to prove themselves loyal and worthy of a customer's relationship. There are many ways this is done, but it starts with the collection and analysis of various pieces of data. The data can provide an array of information that can be used to enhance the overall customer experience. The business greatly benefits as well. Today technology is great, and with it has come a number of cost effective solutions.


Information Fusion and Data Integration: Fast vs. Batch

@machinelearnbot

The problem is that the most compelling applications associated with Big Data and the Industrial Internet of Things are often based on streaming, sensor data. Information Fusion is capable of integrating, transforming and organizing all manner of data (structured, semi-structured, and unstructured), but specifically time-series data, for use by today's most demanding analytics applications to bridge the gap between Fast Data and Big Data. Another way to put this is Data integration creates an integrated set of data where the larger set is retained. By comparison, Information Fusion uses multiple techniques to reduce the amount of stateless data and provide only the stateful, valuable and relevant, data to deliver improved confidence.


Overcoming the data integration challenge in hybrid and cloud-based environments

#artificialintelligence

Industry experts estimate that data volumes are doubling in size every two years. Managing all of this is a challenge for any enterprise, but it's not just the volume of data as much as the variety of data that presents a problems. With SaaS and on-premises applications, machine data, and mobile apps all proliferating, we are seeing the rise of an increasingly complicated value-chain ecosystem. IT leaders need to incorporate a portfolio-based approach and combine cloud and on-premises deployment models to sustain competitive advantage. Improving the scale and flexibility of data integration across both environments to deliver a hybrid offering is necessary to provide the right data to the right people at the right time.