Redshift Blog Post With a Performance and Cost Lean...

Some of this is a repeat for my own value from another of my blog posts – excuse my dust… this blog post will eventually be a very large Redshift only blog post – that I use for my own reference…

Nice Redshift Slide Deck on Performance – Shows Example of Redshift Ephemeral Cluster Use (this concept alone is worth the read)

(BDT401) Amazon Redshift Deep Dive: Tuning and Best Practices from Amazon Web Services

Diagnostic Queries For Query Analysis

Views
STL_QUERY
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_QUERY.html
If you infrequently load data and query it, and consider your queries to be “adhoc”, you might want to capture a count of repeated statements across a long period of time to determine what is adhoc and what is not. Since the window of SQL statements preserved in STL_QUERY can vary based on load, you may want to store this data periodically in another table – possibly outside Redshift – another factor is the expense of executing a particular SQL statement.
STL_QUERY_METRICS
https://docs.aws.amazon.com/redshift/latest/dg/r_STL_QUERY_METRICS.html
Use the Amazon Redshift table_inspector.sql utility to see how database blocks in a distribution key map to the slices in a cluster and to see how data is distributed across a cluster.

select distkey, count(*) from public.distribution_skew group by distkey having count(*) > 1 order by 2 desc;

As documented in the link here and much more (don’t miss this post as an entry point to more advanced techniques):

https://aws.amazon.com/premiumsupport/knowledge-center/redshift-high-disk-usage/

https://docs.aws.amazon.com/redshift/latest/dg/diagnostic-queries-for-query-tuning.html

Most excellent five part Redshift blogs plus

https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-preamble-prerequisites-and-prioritization/
https://aws.amazon.com/blogs/big-data/optimizing-for-star-schemas-and-interleaved-sorting-on-amazon-redshift/
https://docs.aws.amazon.com/redshift/latest/dg/tutorial-configuring-workload-management.html

Redshift cost containment and troubleshooting

Note – Redshift clusters cannot be “started and shutdown” they can only be “created and deleted”. They can be rebooted, for parameter group changes that require a restart and there is a “Modify Cluster and Resize Cluster” for network and size related changes – but no clean shutdown and start back up.
So, if you want to “shut” one down and keep the data, take a manual snapshot BEFORE you delete it. Then you’ll need to create a new cluster with the snapshot – recreating assigned roles etc to get it back. You pay for that manual snapshot but (and I don’t know exactly off the top of my head) but it has to be quite a bit less than the hourly charge for an on-demand cluster.
The restore time for a small database was about the same as spinning up a new cluster – your mileage will vary…
https://blog.atomdata.io/the-hitchhikers-guide-to-aws-redshift-part-2-let-s-cut-costs-cfdf2d67293b
https://blog.atomdata.io/the-hitchhikers-guide-to-redshift-part-1-with-great-power-comes-performance-issues-748d293e0b18

Redshift Backups – Automated and Manual

Redshift by default automatically backs up your database cluster on creation, and once every 8 hours or every 5GB of new data whichever comes first and this backup is retained for 1 day.
What does retained for 1 day mean exactly? Well the good news is it means you will always have a backup of your database if you use automated backup. The bad news is you cannot go back further than your “retained days” if you restore with your automated backup further than the “retention window” which is by default one day.
One day of automated backups is free. If you extend the retention period past the default one day, you have to pay and for the incremental storage past 1 day…
And oh yes, these automated backups are incremental and only Amazon has access to the physical files, you don’t have to worry about how they are keeping all the “incrementals” around, it just happens – and it is a good thang ;-)…
When you restore a snapshot – whether it is an automated snapshot or a manual snapshot – you are creating a new cluster. You can do an individual table restore to an existing cluster.
You can create manual snapshots (yes you are charged for the storage of these) either from the console, or via a script and manual snapshots stay around until you manually delete them. Manual snapshots are not incremental.
Snapshot Manager to create snapshots using scheduled Lambda (I think) – In Redshift Utils Github HERE
This blog explains a similar approach pretty well.
If you are not using a large Redshift database very frequently, and don’t want to go thru the time consuming act of restoring it every time you want the database “up”, here’s another possible approach or two:
1.) keep your data in Spectrum (which is S3), and partition it in S3… The metadata will reside in Redshift, but the DB will be small, and backups and restores will be fast and cheap… I haven’t tried this but, it sounds reasonable to me.
Here’s another backup approach:
https://www.codementor.io/petrhecko/backup-your-aws-redshift-cluster-properly-and-sleep-better-mlergt4xr
https://aws.amazon.com/blogs/big-data/manage-query-workloads-with-query-monitoring-rules-in-amazon-redshift/
https://www.periscopedata.com/blog/helpful-redshift-admin-queries

Redshift Loading Data

When Using Staging Tables For Logical Updates

https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

Post Views: 9

Redshift Blog Post With a Performance and Cost Lean…

Nice Redshift Slide Deck on Performance – Shows Example of Redshift Ephemeral Cluster Use (this concept alone is worth the read)

Top 10 Performance Tuning Techniques For Amazon Redshift

Improving Redshift Query Performance Through Effective Use Of Zone Maps

Improving Redshift Query Performance by Decreasing Network Traffic