Collaborating Authors


Oracle Database 21c spotlights in-memory processing and ML, adds new low-code APEX cloud service


Among the messages that Oracle is putting out for its flagship database, adding new access paths for developers has become just as important as adding new data types. This month, Oracle is launching the next version of Oracle Database, version 21c. In a session hosted by Andrew Mendelsohn, executive vice president of database server technologies, the company is also announcing a new cloud-based APEX Service designed to carve a new access path for low-code developers who traditionally thought that writing apps for Oracle was complex and expensive. To induce new developers, Oracle is throwing in a free tier to this new cloud service. As Oracle now numbers its releases according to calendar year, 21c is the next release, which was announced as generally available last month.

AWS starts gluing the gaps between its databases


For Amazon Web Services (AWS), the key to their data management strategy was that you need the right tool for the job. And so, AWS has amassed a portfolio of 15 databases, and over the past few years, rarely did a re:Invent go by without announcement of some new database. So maybe it's time to take a breath. And yes, in an audacious move, AWS is seeking to grab your SQL Server workloads courtesy of Babelfish for Aurora PostgreSQL. But to us, the highlight was announcement of AWS Glue Elastic Views that is entering preview.

Microsoft Releases .NET for Apache Spark 1.0


Last month, Microsoft released the first major version of .NET for Apache Spark, an open-source package that brings .NET development to the Apache Spark platform. The new release allows .NET developers to write Apache Spark applications using .NET user-defined functions, Spark SQL, and additional libraries such as Microsoft Hyperspace and ML.NET. Apache Spark is an open-source, general-purpose analytics engine for large-scale data processing, with built-in modules for streaming, SQL, machine learning, and graph processing. Initially developed by the AMPLab team at UC Berkeley, it can be used in conjunction with different data repositories, including the Hadoop Distributed File System, NoSQL databases, and relational data stores. Since all data is processed in-memory (RAM), Spark can be 100x faster than Hadoop for large-scale data processing.

5 Tricky SQL Queries Solved -- Part II


Given a table of students and their GRE test scores, write a query to return the two students with the closest test scores and their score difference. If there exists more than one pair, sort their names in ascending order and then return the first resulting pair. This requires some creative thinking in SQL. Since there is only one table with two columns we need to self-reference different creations of the same table. We can solve these kinds of problems by visualizing two tables having the same values.

Making Postgres stored procedures 9X faster in Citus


This post by Marco Slot about Postgres stored procedures in Citus was originally published on the Azure Postgres Blog on Microsoft TechCommunity. Stored procedures are widely used in commercial relational databases. You write most of your application logic in PL/SQL and achieve notable performance gains by pushing this logic into the database. As a result, customers who are looking to migrate from other databases to PostgreSQL usually make heavy use of stored procedures. When migrating from a large database, using the Citus extension to distribute your database can be an attractive option, because you will always have enough hardware capacity to power your workload.

FLAT: Fast, Lightweight and Accurate Method for Cardinality Estimation Artificial Intelligence

Query optimizers rely on accurate cardinality estimation (CardEst) to produce good execution plans. The core problem of CardEst is how to model the rich joint distribution of attributes in an accurate and compact manner. Despite decades of research, existing methods either over simplify the models only using independent factorization which leads to inaccurate estimates and sub optimal query plans, or over-complicate them by lossless conditional factorization without any independent assumption which results in slow probability computation. In this paper, we propose FLAT, a CardEst method that is simultaneously fast in probability computation, lightweight in model size and accurate in estimation quality. The key idea of FLAT is a novel unsupervised graphical model, called FSPN. It utilizes both independent and conditional factorization to adaptively model different levels of attributes correlations, and thus subsumes all existing CardEst models and dovetails their advantages. FLAT supports efficient online probability computation in near liner time on the underlying FSPN model, and provides effective offline model construction. It can estimate cardinality for both single table queries and multi-table join queries. Extensive experimental study demonstrates the superiority of FLAT over existing CardEst methods on well-known benchmarks: FLAT achieves 1 to 5 orders of magnitude better accuracy, 1 to 3 orders of magnitude faster probability computation speed (around 0.2ms) and 1 to 2 orders of magnitude lower storage cost (only tens of KB).

ColloQL: Robust Cross-Domain Text-to-SQL Over Search Queries Artificial Intelligence

Translating natural language utterances to executable queries is a helpful technique in making the vast amount of data stored in relational databases accessible to a wider range of non-tech-savvy end users. Prior work in this area has largely focused on textual input that is linguistically correct and semantically unambiguous. However, real-world user queries are often succinct, colloquial, and noisy, resembling the input of a search engine. In this work, we introduce data augmentation techniques and a sampling-based content-aware BERT model (ColloQL) to achieve robust text-to-SQL modeling over natural language search (NLS) questions. Due to the lack of evaluation data, we curate a new dataset of NLS questions and demonstrate the efficacy of our approach. ColloQL's superior performance extends to well-formed text, achieving 84.9% (logical) and 90.7% (execution) accuracy on the WikiSQL dataset, making it, to the best of our knowledge, the highest performing model that does not use execution guided decoding.

Mastering Presto: Hands-On Learning


Mastering Presto: Hands-On Learning Learn Presto - distributed SQL Query Engine for Big Data! Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organisations like Facebook. In the first part of the course I will talk about Presto's theory including Presto's architecture and components - coordinator, worker, connector, query execution model, etc. Additionally, I will explain to you how Kafka, Cassandra, Hive, PostgreSQL and Redshift work before I mention the specifics to their connectors.

PrestoDB: The Ultimate PrestoDB Course


PrestoDB: The Ultimate PrestoDB Course In this course, you'll learn and understand PrestoDB - Distributed SQL Query Engine for Big Data. Welcome to this course: PrestoDB: The Ultimate PrestoDB Course. Presto is an open-source software project to develop a database query engine using the standard Structured Query Language (SQL). It is a distributed system that runs on a cluster of machines. A full installation includes a coordinator and multiple workers.

Alation revamps UX, adds analytics to its data catalog platform


Data catalog juggernaut Alation is announcing a new release that brings major changes in look, feel and functionality to its platform. Its goal is to cater to business users -- rather than just the data scientists, data engineers, data stewards and analysts it has served until now. By creating what it calls a "consumer-grade" user interface, featuring a search engine-like experience, Alation believes it will increase its user population by a factor of 10. The new release also enables custom branding, adds important analytics functionality and rearchitects the platform for the cloud. Alation's CEO and co-founder, Satyen Sangani, briefed ZDNet last week on the new 2020.3