Redshift data ingestion deduplication / upsert ( delete / insert ) using a staging table – and a little tuning

Why would you do this?  In Redshift blocks are immutable – and are re-written completely – no partial block writes.
This example assumes that your distribution style, sortkeys, and encoding / compression is already on the “real_table” so if you do a “create temp table (like)” – the same encoding and dist style, and sort keys will be put on the temp table and the data will be placed on adjacent slices on the same compute node.  Temp tables are faster (only one write – no remote write) and go away when the session dies so their good too.
Note: The use of “delete using” – the “using” allows you to join in 2nd table to the delete statement.
The “using” clause is unique to postgresql / Redshift to my knowledge.

# create a transaction with the begin / end
begin;
  create temp table my_staging(like real_table);
  copy staging from 's3://bucket/my.csv':'credentials' compupdate off;
  delete real_table r
  using my_staging s
  where r.pk_field=s.pk_field;
  insert into real_table select * from my_staging;
  drop table staging;
  # don't need commit - end transaction commits;
  # see the AWS Redshift command reference for more info:
  # https://docs.aws.amazon.com/redshift/latest/dg/c_SQL_commands.html
end transaction;

BTW – there is no exception handling in the SQL above – if it fails it is all automatically rollback… with the copy command you can do more… but with postgresql / Redshift SQL … it is simplistic and very basic… I find it lacking as – it would be nice to trap certain types of errors – probably quite doable with more sophisticated scripting languages like python / boto3 to Redshift.
Check out “alter table append” for moving large amounts of data from one table to another – more research necessary.

Tuning this stuff

Make sure you use the distkey (user_id in this example) in your SQL – otherwise you’ll get
queries that are spanning your entire cluster – instead of just using a single computer node.
Bad sql will show DS_BCAST_INNER – BCAST – cluster broadcast…

EXPLAIN
UPDATE all_events
SET user_id = test_table.user_id FROM test_table
WHERE all_events.event_id = test_table.event_id;
XN Hash Join DS_BCAST_INNER (cost=12.50..418648749.25 rows=11015 width=71)
Hash Cond: ("outer".event_id = "inner".event_id)
-> XN Seq Scan on all_events (cost=0.00..1764430.88 rows=176443088 width=63)
-> XN Hash (cost=10.00..10.00 rows=1000 width=16)
-> XN Seq Scan on test_table (cost=0.00..10.00 rows=1000 width=16)

Versus – using the distkey – notice no BCAST

EXPLAIN
UPDATE all_events
SET user_id = test_table.user_id FROM test_table
WHERE all_events.event_id = test_table.event_id and all_events.user_id = test_table.user_id;
redshift=> EXPLAIN UPDATE all_events SET user_id = test_table.user_id FROM test_table WHERE all_events.event_id = test_table.event_id and all_events.user_id = test_table.user_id;
QUERY PLAN
----------------------------------------------------------------------------------------------
XN Hash Join DS_DIST_NONE (cost=15.00..355532837.33 rows=855 width=71)
Hash Cond: (("outer".event_id = "inner".event_id) AND ("outer".user_id = "inner".user_id))
-> XN Seq Scan on all_events (cost=0.00..1764430.88 rows=176443088 width=71)
-> XN Hash (cost=10.00..10.00 rows=1000 width=16)
-> XN Seq Scan on test_table (cost=0.00..10.00 rows=1000 width=16)
in this case - the plan says - DS_DIST_NONE - a very good thing for performance
Reference:
https://heapanalytics.com/blog/engineering/redshift-pitfalls-avoid
Excellent ReInvent from 2017 you tube on all this and more:
https://www.youtube.com/watch?v=Q_K3qH5OYaM

More from LonzoDB on AWS

Leave a Comment

Scroll to Top