Collaborating Authors

Data Integration

5 Most essential skills to become a data scientist in 2021


Data Science has become an emerging and hottest job role in 2020. With the increase in demand for skilled professionals, more and more people have started taking up data science course. If you want to become a data scientist in 2021, you need to develop a set of skills. Here are the most essential skills to become a successful data scientist in near future. The latest version, Python 3 has become the default choice of language for data science.

Pentaho Data Integration tool for ETL & Data warehousing


The ETL (extract, transform, load) process is the most popular method of collecting data from multiple sources and loading it into a centralized data warehouse. ETL is an essential component of data warehousing and analytics. Pentaho has phenomenal ETL, data analysis, metadata management and reporting capabilities. Pentaho is faster than other ETL tools (including Talend). Its GUI is easier and takes less time to learn.

Pandas on Steroids: End to End Data Science in Python with Dask - KDnuggets


As the saying goes, a data scientist spends 90% of their time in cleaning data and 10% in complaining about the data. Their complaints may range from data size, faulty data distributions, Null values, data randomness, systematic errors in data capture, differences between train and test sets and the list just goes on and on. One common bottleneck theme is the enormity of data size where either the data doesn't fit into memory or the processing time is so large(In order of multi-mins) that the inherent pattern analysis goes for a toss. Data scientists by nature are curious human beings who want to identify and interpret patterns normally hidden from cursory Drag-N-Drop glance. Even after answering these questions, multiple sub-threads can emerge i.e can we predict how the Covid affected New year is going to be, How the annual NY marathon shifts taxi demand, If a particular route if more prone to have multiple passengers(Party hub) vs Single Passengers( Airport to Suburbs).

Services Australia has a JobKeeper data exchange arrangement with the ATO


Services Australia has a data exchange program underway with the Australian Taxation Office (ATO) that flags people who are on the federal government's JobKeeper scheme. "There are some people who haven't declared JobKeeper payments as income on their record," Services Australia deputy CEO, customer service delivery Michelle Lees said. "Based on the data exchange information, we're aware there are approximately 135,000 people who were receiving a social security payment who were identified by an employer as being eligible for JobKeeper. It doesn't necessarily mean, in some instances when we contact them, they might actually say they haven't received a JobKeeper payment, whereby we'd refer that back to the ATO to follow up." Lees said in the event that there was a recalculation of entitlement required, because someone has updated their details, the program could flag that there was a provisional debt.

ETL & ELT, a comparison


When designing and building data pipelines to load data into data warehouses you might have heard of the common ETL and ELT paradigms. This post goes over what they mean, their differences and which paradigm you might want to choose. If you are wondering why we have a staging area click here. ELT is very similar but the data is loaded into a table before being transformed to a final table which is used by users. As you can see it has fewer components compared to the ETL approach.

Does Palantir See Too Much?


On a bright Tuesday afternoon in Paris last fall, Alex Karp was doing tai chi in the Luxembourg Gardens. He wore blue Nike sweatpants, a blue polo shirt, orange socks, charcoal-gray sneakers and white-framed sunglasses with red accents that inevitably drew attention to his most distinctive feature, a tangle of salt-and-pepper hair rising skyward from his head. Under a canopy of chestnut trees, Karp executed a series of elegant tai chi and qigong moves, shifting the pebbles and dirt gently under his feet as he twisted and turned. A group of teenagers watched in amusement. After 10 minutes or so, Karp walked to a nearby bench, where one of his bodyguards had placed a cooler and what looked like an instrument case. The cooler held several bottles of the nonalcoholic German beer that Karp drinks (he would crack one open on the way out of the park). The case contained a wooden sword, which he needed for the next part of his routine. "I brought a real sword the last time I was here, but the police stopped me," he said matter of factly as he began slashing the air with the sword. Those gendarmes evidently didn't know that Karp, far from being a public menace, was the chief executive of an American company whose software has been deployed on behalf of public safety in France. The company, Palantir Technologies, is named after the seeing stones in J.R.R. Tolkien's "The Lord of the Rings." Its two primary software programs, Gotham and Foundry, gather and process vast quantities of data in order to identify connections, patterns and trends that might elude human analysts. The stated goal of all this "data integration" is to help organizations make better decisions, and many of Palantir's customers consider its technology to be transformative. Karp claims a loftier ambition, however. "We built our company to support the West," he says. To that end, Palantir says it does not do business in countries that it considers adversarial to the U.S. and its allies, namely China and Russia. In the company's early days, Palantir employees, invoking Tolkien, described their mission as "saving the shire." The brainchild of Karp's friend and law-school classmate Peter Thiel, Palantir was founded in 2003. It was seeded in part by In-Q-Tel, the C.I.A.'s venture-capital arm, and the C.I.A. remains a client. Palantir's technology is rumored to have been used to track down Osama bin Laden -- a claim that has never been verified but one that has conferred an enduring mystique on the company. These days, Palantir is used for counterterrorism by a number of Western governments.

Global Artificial Intelligence in Energy Market


The global artificial intelligence in energy market size is poised to grow by USD 8.06 billion during 2020-2024, decelerating at a CAGR of almost 48% …

Data Engineer


At FutureLearn we work in short sprints & regularly share, reflect on and iterate on our work. This helps us focus on shipping small, iterative changes and responding quickly to changing business or user needs. We care about work/life balance and supporting learning at work. The Data Platform Team builds and maintains tooling and infrastructure that supports decision making processes across the business and enables product improvements by providing a complete and consistent view of our business data. Our tech stack consists of an ETL process written in Ruby and managed by Airflow which sources data from our production database (MySQL), our email provider (Sendgrid), application logs, and other operational data sources.

Surgery on ROC Plots


This note is a little break from our model homotopy series. I have a neat example where one combines two classifiers to get a better classifier using a method I am calling "ROC surgery." In ROC surgery we look at multiple ROC plots and decide we want to cut out a section from one the plots for use. It is a sensor fusion method to try and combine the best parts of two classifiers.

How Supercomputers Help To Create The Next Generation of Fully Integrated Data Centres


"Data centre is an asset that needs to be protected"- Michael Kagan, CTO of NVIDIA On the first day of the NVIDIA GPU Technology Conference, Jensen Huang, founder of NVIDIA revealed the company's three-year DPU roadmap that featured the new NVIDIA BlueField-2 family of DPUs and NVIDIA DOCA software development kit for building applications on DPU-accelerated data centre infrastructure services. Michael Kagan, CTO of NVIDIA recently in a talk, explained the next generation of fully integrated data centres and how supercomputers and edge AI helps in augmenting such initiatives. Kagan stated that the state-of-the-art technologies from both NVIDIA and Mellanox created a great opportunity to build a new class of computers, i.e. the fully-integrated cloud data centres that are designed to handle the workload of the 21st century. Historically, servers were the unit of computing, But eventually, Moore's law has slowed down as the performance of CPUs could not keep up the workload demands. According to Kagan, with the revolution of Cloud AI and edge computing, instead of a single server, the entire data centre has become the new unit of computing designed to handle parallel workloads.