postgre
Improving DBMS Scheduling Decisions with Fine-grained Performance Prediction on Concurrent Queries -- Extended
Wu, Ziniu, Markakis, Markos, Liu, Chunwei, Chen, Peter Baile, Narayanaswamy, Balakrishnan, Kraska, Tim, Madden, Samuel
Query scheduling is a critical task that directly impacts query performance in database management systems (DBMS). Deeply integrated schedulers, which require changes to DBMS internals, are usually customized for a specific engine and can take months to implement. In contrast, non-intrusive schedulers make coarse-grained decisions, such as controlling query admission and re-ordering query execution, without requiring modifications to DBMS internals. They require much less engineering effort and can be applied across a wide range of DBMS engines, offering immediate benefits to end users. However, most existing non-intrusive scheduling systems rely on simplified cost models and heuristics that cannot accurately model query interactions under concurrency and different system states, possibly leading to suboptimal scheduling decisions. This work introduces IconqSched, a new, principled non-intrusive scheduler that optimizes the execution order and timing of queries to enhance total end-to-end runtime as experienced by the user query queuing time plus system runtime. Unlike previous approaches, IconqSched features a novel fine-grained predictor, Iconq, which treats the DBMS as a black box and accurately estimates the system runtime of concurrently executed queries under different system states. Using these predictions, IconqSched is able to capture system runtime variations across different query mixes and system loads. It then employs a greedy scheduling algorithm to effectively determine which queries to submit and when to submit them. We compare IconqSched to other schedulers in terms of end-to-end runtime using real workload traces. On Postgres, IconqSched reduces end-to-end runtime by 16.2%-28.2% on average and 33.6%-38.9% in the tail. Similarly, on Redshift, it reduces end-to-end runtime by 10.3%-14.1% on average and 14.9%-22.2% in the tail.
FactorJoin: A New Cardinality Estimation Framework for Join Queries
Wu, Ziniu, Negi, Parimarjan, Alizadeh, Mohammad, Kraska, Tim, Madden, Samuel
Cardinality estimation is one of the most fundamental and challenging problems in query optimization. Neither classical nor learning-based methods yield satisfactory performance when estimating the cardinality of the join queries. They either rely on simplified assumptions leading to ineffective cardinality estimates or build large models to understand the data distributions, leading to long planning times and a lack of generalizability across queries. In this paper, we propose a new framework FactorJoin for estimating join queries. FactorJoin combines the idea behind the classical join-histogram method to efficiently handle joins with the learning-based methods to accurately capture attribute correlation. Specifically, FactorJoin scans every table in a DB and builds single-table conditional distributions during an offline preparation phase. When a join query comes, FactorJoin translates it into a factor graph model over the learned distributions to effectively and efficiently estimate its cardinality. Unlike existing learning-based methods, FactorJoin does not need to de-normalize joins upfront or require executed query workloads to train the model. Since it only relies on single-table statistics, FactorJoin has small space overhead and is extremely easy to train and maintain. In our evaluation, FactorJoin can produce more effective estimates than the previous state-of-the-art learning-based methods, with 40x less estimation latency, 100x smaller model size, and 100x faster training speed at comparable or better accuracy. In addition, FactorJoin can estimate 10,000 sub-plan queries within one second to optimize the query plan, which is very close to the traditional cardinality estimators in commercial DBMS.
Machine Learning Streaming with Kafka, Debezium, and BentoML
Putting a Machine Learning project to life is not a simple task and, just like any other software product, it requires many different kinds of knowledge: infrastructure, business, data science, etc. I must confess that, for a long time, I just neglected the infrastructure part, making my projects rest in peace inside Jupiter notebooks. But as soon as I started learning it, I realized that is a very interesting topic. Machine learning is still a growing field and, in comparison with other IT-related areas like Web development, the community still has a lot to learn. Luckily, in the last years we have seen a lot of new technologies arise to help us build an ML application, like Mlflow, Apache Spark's Mlib, and BentoML, explored in this post. In this post, a machine learning architecture is explored with some of these technologies to build a real-time price recommender system. To bring this concept to life, we needed not only ML-related tools (BentoML & Scikit-learn) but also other software pieces (Postgres, Debezium, Kafka). Of course, this is a simple project that doesn't even have a user interface, but the concepts explored in this post could be easily extended to many cases and real scenarios. I hope this post helped you somehow, I am not an expert in any of the subjects discussed, and I strongly recommend further reading (see some references below).
Let's Continue Bundling into the Database
A very silly blog post came out a couple months ago about The Unbundling of Airflow. I didn't fully read the article, but I saw its title and skimmed it enough to think that it might've been too thin of an argument to hold water but just thick enough to clickbait the VC world with the word "unbundling" while simultaneously Cunningham's Law-ing the data world. There was certainly a Twitter discourse. They say imitation is the sincerest form of flattery, but I don't know if that applies here. Nevertheless, you're currently reading a blog post about data things that's probably wrong and has the word " un bundling" in it.
Building a recommendation engine inside Postgres with Python and Pandas
Just because you can do something doesn't always mean you should. Embedding all of your application logic directly in the database can make tracking migrations and releases difficult. At the same time, a complex pipeline that takes a nightly extract, loads something into Spark, generates results, that you then feed back into the database isn't exactly lightweight. In the case of plpython3u and pandas, scheduling something like the above to run daily with pg_cron could be a much simpler solution. With a mix of SciPy, NumPy and Pandas there is a lot of interesting potential here and I'd love to hear what practical uses others come up with @crunchydata, or give it yourself a try-our database-as-a-service Crunchy Bridge comes already preconfigured with plpython3u and SciPy, NumPy, and Pandas.
Software Developer (Machine Learning)
Must have a Master's Degree (or equivalent) in Computer Science, Engineering (any), Mathematics, or related field, plus one (1) year of IT experience. The one (1) year of IT experience must include experience with: Python, Machine Learning, Deep Learning, Natural Language Programing (NLP), AWS, Docker, Hive, Presto, Postgres, Neo4j, PowerBI, and Spotfire. In the alternative, we will accept a Bachelor's Degree (or equivalent) in Computer Science, Engineering (any), Mathematics, or related field, plus five (5) years of progressive post-baccalaureate IT experience. One (1) year of the five (5) years of progressive post-baccalaureate IT experience must include experience using Python, Machine Learning, Deep Learning, Natural Language Programing (NLP), AWS, Docker, Hive, Presto, Postgres, Neo4j, PowerBI, and Spotfire. All experience may be acquired concurrently.
Build a fully production ready machine learning app with Python Django, React and Docker
We are going to create a simple machine learning application with Django REST framework, which predicts the species of a sample flower based on measurements of its features i.e. the sepal and petal dimensions -- length and width. We have already covered this is in great detail in a previous article. Please familiarize your self with that article. We would use the same Django application here and make some modifications as required. In the previous article, the Django application was connected with a SQLite database.
PostgreSQL as the substructure for IoT and the next wave of computing
How PostgreSQL accidentally became the ideal platform for IoT applications and services. From mainframes (1950s-1970s), to Personal Computers (1980s-1990s), to smartphones (2000s-now), each wave brought us smaller, yet more powerful machines, that were increasingly plentiful and pervasive throughout business and society. We are now sitting on the cusp of another inflection point, or major release if you will, with computing so small and so common that it is becoming nearly as pervading as the air we breathe. With each wave, software developers and businesses initially struggle to identify the appropriate software infrastructure on which to develop their applications. But soon common platforms emerge: Unix; Windows; the LAMP stack; iOS/Android.
Making Postgres stored procedures 9X faster in Citus
This post by Marco Slot about Postgres stored procedures in Citus was originally published on the Azure Postgres Blog on Microsoft TechCommunity. Stored procedures are widely used in commercial relational databases. You write most of your application logic in PL/SQL and achieve notable performance gains by pushing this logic into the database. As a result, customers who are looking to migrate from other databases to PostgreSQL usually make heavy use of stored procedures. When migrating from a large database, using the Citus extension to distribute your database can be an attractive option, because you will always have enough hardware capacity to power your workload.
Senior Data Engineer ai-jobs.net
If you are ready to unleash your potential, it's time to start your career with Manulife/John Hancock. Manulife Financial Corporation is a leading international financial services group that helps people make their decisions easier and lives better. We operate primarily as John Hancock in the United States and Manulife elsewhere. We provide financial advice, insurance, as well as wealth and asset management solutions for individuals, groups and institutions. At the end of 2018, we had more than 34,000 employees, over 82,000 agents, and thousands of distribution partners, serving almost 28 million customers.