Why does a secondary data store matter for AI? In my previous blog in this data store series, I discussed how the real selection criteria for an AI/ML data platform is how to obtain the best balance between capacity (cost per GB stored) and performance (cost per GB of throughput). Indeed, to support enterprise AI programs, the data architecture must support both high performance (needed for Ai training and validation) and high capacity (needed to store the huge amount of data that AI training requires). Even if these two capabilities can be hosted on the same systems (integrated data platform) or in large infrastructures, they are hosted in two separated specialized systems (two-tier architecture). This post continues the series of blogs dedicated to data stores for AI and advanced analytics.
In a year like no other in recent memory, the data ecosystem is showing not just remarkable resilience but exciting vibrancy. When COVID hit the world a few months ago, an extended period of gloom seemed all but inevitable. Cloud and data technologies (data infrastructure, machine learning / artificial intelligence, data driven applications) are at the heart of digital transformation. As a result, many companies in the data ecosystem have not just survived, but in fact thrived, in an otherwise overall challenging political and economic context. Perhaps most emblematic of this is the blockbuster IPO of Snowflake, a data warehouse provider, which took place a couple of weeks ago and catapulted Snowflake to a $69B market cap company, at the time of writing – the biggest software IPO ever (see our S-1 teardown).
It was clear to the University of New South Wales (UNSW) that at the end of 2018, when it was developing its data strategy, it needed to improve the turnaround time it took to get information into the hands of decision makers. But to do that, the university had to set up a cloud-based data warehouse, which it opted to host in Microsoft Azure. The cloud-based warehouse now operates alongside the university's legacy data warehouse, which is currently hosted in Amazon Web Service's (AWS) EC2. "Our legacy data warehouse has been around for 10 to 15 years. But we started looking at what platforms can let us do everything that we do now, but also allows us to move seamlessly into new things like machine learning and AI," UNSW chief data and insights officer and senior lecture at the School of Computer Science and Engineering, Kate Carruthers said, speaking to ZDNet.
In reviewing this year's batch of announcements for MongoDB's online user conference, there's a lot that fills the blanks opened last year as reported by Stephanie Condon. But the sleeper is unifying a platform that has expanded over the past few years with mobile and edge processing capabilities, not to mention a search engine, and the reality that Atlas, its cloud database-as-a-service (DBaaS) service, is now comprising the majority of new installs. Last year, MongoDB announced previews of Atlas Data Lake, the service of MongoDB's cloud service that lets you target data stored in Amazon S3 cloud storage; full text search, plans to integrate the then recently-acquired mobile Realm database platform with the Stretch serverless development environment; and autoscaling of MongoDB's Atlas cloud service. This year, all those previews announced last year are now going GA. Rounding it out is the announcement of the next release of MongoDB, version 4.4, that includes some modest enhancements with querying and sharding. The cloud is clearly MongoDB's future.
"Supporting arrays, maps and structs allows customer to simplify data pipelines, unify more of their semi-structured data with their data warehouse as well as maintain better real world representation of their data from relationships between entities to customer orders with item level detail. A good example is groups of cell phone towers that are used for one call while driving on the highway." I have interviewed Mark Lyons, Director of Product Management at Vertica. We talked about the new Vertica 10.0 What is your role at Vertica? Mark Lyons: My role at Vertica is Director of Product Management. I have a team of 5 product managers covering analytics, security, storage integrations and cloud.
Credit scoring is without a doubt one of the oldest applications of analytics. In recent years, a multitude of sophisticated classification techniques have been developed to improve the statistical performance of credit scoring models. Instead of focusing on the techniques themselves, this paper leverages alternative data sources to enhance both statistical and economic model performance. The study demonstrates how including call networks, in the context of positive credit information, as a new Big Data source has added value in terms of profit by applying a profit measure and profit-based feature selection. A unique combination of datasets, including call-detail records, credit and debit account information of customers is used to create scorecards for credit card applicants. Call-detail records are used to build call networks and advanced social network analytics techniques are applied to propagate influence from prior defaulters throughout the network to produce influence scores. The results show that combining call-detail records with traditional data in credit scoring models significantly increases their performance when measured in AUC. In terms of profit, the best model is the one built with only calling behavior features. In addition, the calling behavior features are the most predictive in other models, both in terms of statistical and economic performance. The results have an impact in terms of ethical use of call-detail records, regulatory implications, financial inclusion, as well as data sharing and privacy.
In the current age of cloud computing, there is now a multitude of mature services available -- offering security, scalability, and reliability for many business computing needs. What was once a colossal undertaking to build a data center, install server racks, and design storage arrays has given way to an entire marketplace of services that are always just a click away. One leader in that marketplace is Amazon Web Services, which consists of 175 products and services in a vast catalog that provides cloud storage, compute power, app deployment, user account management, data warehousing, tools for managing and controlling Internet of Things devices, and just about anything you can think of that a business needs. AWS really grew in popularity and capability over the last decade. One reason is that AWS is so reliable and secure.
Digital agriculture has the promise to transform agricultural throughput. It can do this by applying data science and engineering for mapping input factors to crop throughput, while bounding the available resources. In addition, as the data volumes and varieties increase with the increase in sensor deployment in agricultural fields, data engineering techniques will also be instrumental in collection of distributed data as well as distributed processing of the data. These have to be done such that the latency requirements of the end users and applications are satisfied. Understanding how farm technology and big data can improve farm productivity can significantly increase the world's food production by 2050 in the face of constrained arable land and with the water levels receding. While much has been written about digital agriculture's potential, little is known about the economic costs and benefits of these emergent systems. In particular, the on-farm decision making processes, both in terms of adoption and optimal implementation, have not been adequately addressed. For example, if some algorithm needs data from multiple data owners to be pooled together, that raises the question of data ownership. This paper is the first one to bring together the important questions that will guide the end-to-end pipeline for the evolution of a new generation of digital agricultural solutions, driving the next revolution in agriculture and sustainability under one umbrella.
Our thirst for data is great. Our desire to be connected, at all times, means that by 2025 an average connected person will interact with connected devices one interaction every 18 seconds. The stat includes using smart home security, smart TVs and more – nearly 4,800 times a day. But how is AI impacting surveillance data storage? As our world becomes increasingly connected, the vast amounts of data being created are also enabling us to refine and improve systems and processes.