Create Volume Test Data Using TPCH-KIT or TPCDS (tpcsds-kit) on Linux and Copy the Data Into Redshift

General description of the process

  • Generate a fairly large volume of test data using the tpch-kit – (setup required described below)
  • Move the data to AWS S3 (same region as your Redshift cluster)
  • Setup IAM role to use for future S3 copy to Redshift and Create a AWS Redshift Cluster (assigning the IAM Role)
  • Download, configure and connect a Desktop Client (Aginity Workbench) to Redshift to issue the copy command
  • Issue the S3 to Redshift copy command
  • View the results

Generating the test data

How to generate volume data using https://github.com/gregrahn/tpch-kit for BigData testing
Note I generated this test data on a rhel 6 VM on-premise – I call that machine/VM octobeast virtualbox o69122.  But you could do exactly the same thing with an EC2 instance and then moving the data to S3 is much quicker.

cd $HOME
yum install -y make git gcc
yum install -y bison flex byacc
git clone https://github.com/gregrahn/tpch-kit
make OS=LINUX
cd $HOME
# create a directory which will hold the generated data
mkdir emrdata
# set the environment variable that dbgen will write to
export DSS_PATH=$HOME/redshiftdata
pwd
# cd into the directory that contains the binaries that generate the data
cd tpch-kit/dbgen
make dbgen
./dbgen -v -T o -s 10
cd $HOME/redshiftdata
ls
lineitem.tbl orders.tbl
@ol69stby redshiftdata]# ls -al
total 9228472
drwxr-xr-x 2 root root 4096 Jun 30 15:51 .
dr-xr-x---. 27 root root 4096 Jun 30 10:49 ..
-rw-r--r-- 1 root root 7715741636 Jun 30 15:56 lineitem.tbl
-rw-r--r-- 1 root root 1734195031 Jun 30 15:56 orders.tbl
[root@ol69stby redshiftdata]#

Split the data files into smaller files – Redshift loads many smaller files faster

wc -l lineitem.tbl
48000000
# divide the output of the linecount by 4 - in this case 12000000 for the split
split -d -l 12000000 -a 4 lineitem.tbl lineitem.tbl.

Copy your test data into S3

Create your bucket lonzodbredshift/redshiftdata using the AWS Console or CLI before issuing the copy below

aws s3 cp /root/redshiftdata s3://lonzodbredshift/redshiftdata --recursive

Note: the command above uses the “aws cli” – if you need to know how to set set up look here.

Load S3 Data Into Redshift

If you have a Redshift instance, and an IAM role that grants redshift access to S3 you can simply use the “copy” command below from a client connected to the Redshift cluster:

copy lineitem from 's3://lonzodbredshift/redshiftdata/lineitem.tbl' credentials 'aws_iam_role=arn:aws:iam::xxxxxxx:role/redshiftLoad'

Note: I used the free desktop Windows based Aginity Workbench client to connect to the cluster as shown here.
Here is the DDL to create the Redshift table to be loaded (yes do this before running the copy command above):

CREATE TABLE LINEITEMWPRIM (
L_ORDERKEY BIGINT NOT NULL, -- references O_ORDERKEY
L_PARTKEY BIGINT NOT NULL, -- references P_PARTKEY (compound fk to PARTSUPP)
L_SUPPKEY BIGINT NOT NULL, -- references S_SUPPKEY (compound fk to PARTSUPP)
L_LINENUMBER INTEGER,
L_QUANTITY DECIMAL,
L_EXTENDEDPRICE DECIMAL,
L_DISCOUNT DECIMAL,
L_TAX DECIMAL,
L_RETURNFLAG CHAR(1),
L_LINESTATUS CHAR(1),
L_SHIPDATE DATE,
L_COMMITDATE DATE,
L_RECEIPTDATE DATE,
L_SHIPINSTRUCT CHAR(25),
L_SHIPMODE CHAR(10),
L_COMMENT VARCHAR(44),
PRIMARY KEY (L_ORDERKEY, L_LINENUMBER)
);

Redshift by default runs in automatic compression encoding when a copy command is run against an empty table. The table is then modified to have compression encoding turned on for individual columns and then the table is loaded with compressed data.
Subsequent display of DDL shows the automatic change as below.

CREATE TABLE public.lineitem
(
l_orderkey BIGINT NOT NULL ENCODE delta,
l_partkey BIGINT NOT NULL ENCODE lzo,
l_suppkey BIGINT NOT NULL ENCODE lzo,
l_linenumber INTEGER ENCODE delta,
l_quantity NUMERIC(18, 0) ENCODE delta,
l_extendedprice NUMERIC(18, 0) ENCODE lzo,
l_discount NUMERIC(18, 0) ENCODE lzo,
l_tax NUMERIC(18, 0) ENCODE lzo,
l_returnflag CHAR(1) ENCODE lzo,
l_linestatus CHAR(1) ENCODE lzo,
l_shipdate DATE ENCODE delta32k,
l_commitdate DATE ENCODE delta32k,
l_receiptdate DATE ENCODE delta32k,
l_shipinstruct CHAR(25) ENCODE bytedict,
l_shipmode CHAR(10) ENCODE bytedict,
l_comment VARCHAR(44) ENCODE text255
)
DISTSTYLE EVEN;

If you do not have an IAM Role that grants S3 access policies – these screen shots allude to how it is setup

Generally the process is:

  • Use AWS IAM Create a role to be assign to Redshift allowing Redshift to use S3
  • Assign S3 access policies to the Role
  • Assign the role to the Redshift cluster (easiest way is to assign the role to the Redshift cluster when you create the Redshift cluster but it can be done afterwards)

Generating Alternative Test Data Set tpcds-kit – much larger dataset – requires 200GB EBS volume

cd $HOME yum install -y make git gcc flex bison byacc git clone
https://github.com/gregrahn/tpcds-kit
cd tpcds-kit/tools
# Note dir above contains scripts to generate large test data files - you must customize to set scale size and the scale size cannot be too small
make OS=LINUX # or just "make" likely works
cd $HOME
git clone https://github.com/sko71/hands-on-with-redshift.git

# the git clone above is from acloudgura big-data course and you will have to customize his scripts to fit your ec instance and dir paths

Installing PostgreSQL on EC2 so you can use “psql” / psql client to load data into redshift

FYI – this worked on an Amazon Linux 2 AMI which I was cloned then customized from RHEL 7  you can verify as follows:

cat system-release
Amazon Linux release

Now setup a repo, install postgresql10 – unfortunately there is no client only install so the entire DB is installed but no DB is initialized – to use the psql client – by the way you obviously don’t have to use “sudo” if you are doing this from root – I used root

sudo yum install -y https://download.postgresql.org/pub/repos/yum/10/redhat/rhel-7-x86_64/pgdg-redhat10-10-2.noarch.rpm
sudo sed -i "s/rhel-\$releasever-\$basearch/rhel-latest-x86_64/g" "/etc/yum.repos.d/pgdg-10-redhat.repo"
sudo yum install -y postgresql10=
psql --version
psql (PostgreSQL) 10.4
# connect to redshift
psql -h <redshift-end-point-no-port>  -p port -U username -d databaseName
psql -h my-dw-instance.XXXXXX.redshift.amazonaws.com -p 5439 --username mf --dbname mydb
# setup / create the DDL / tables in redshift for the copy command
\i /root/hands-on-with-redshift/tpcdsddl.sql
# output would be CREATE TABLE... 8 times
# list the tables created
\d
# control-d to exit and or
Password for user mfoster:
psql (10.4, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.
mydb=# help
You are using psql, the command-line interface to PostgreSQL.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit

To pass the password into psql it may be possible to:

  • use export PGPASSWORD=password # to pass a password into a bash script – this worked for me to connect to redshift – very insecure though
  • put the password in an encrpted file
  • use the file .pgpass

Note: I don’t have the details on the last two of these techniques – more research necessary

GITHUB AWSLAB / amazon-redshift-utils

https://github.com/awslabs/amazon-redshift-utils
scroll down to admin utils
select “collection of utilities”
pick table_info – run the script
 
 

 

I believe I had to reboot the Redshift cluster after assigning the “redshiftLoad” role to the cluster – before it would take.

Error Handling on the Copy

Error handling can be accomplished in several ways, one is to check the following system error tables:

ST_LOAD_ERRORS
ST_LOADERROR_DETAIL
More from LonzoDB on AWS

Leave a Comment

Scroll to Top