Big Data

Big Data Blog Posts That I Found Interesting – I am sure there are many others

Best Of Amazon AWS Big Data Blog https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/ https://github.com/aws-samples/aws-etl-orchestrator https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/best-practices.html#organizingstacks https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/ https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/

Preparing For The Amazon AWS Big Data Specialty Certification Test

Here is some of the material I used to prepare for the Amazon AWS Big Data Specialty Certification test Lets Start With A Hadoop 2.0 Very High Level Diagram Note: Spark, Tez and other Big Data products can and do use some Hadoop 2.0 components like YARN, HDFS2 and replace other components like MapReduce.  Hadoop …

Preparing For The Amazon AWS Big Data Specialty Certification Test Read More »

installing scala on RHEL 7 or clones and do we really need this with Spark? No!

# you can install this Scala anywhere – but I install it with sudo / root wget https://www.scala-lang.org/files/archive/scala-2.13.0-M5.tgz sudo tar xzvf scala-2.13.0-M5.tgz sudo mv scala-2.13.0-M5.tgz scala213 alternatives –install /usr/bin/scala scala /opt/hadoop/scala213 1 # just verify where it is and that there are no other versions – at least that alternatives can see alternatives –config scala  There is …

installing scala on RHEL 7 or clones and do we really need this with Spark? No! Read More »

Using AWS EMR and Spark to Perform ETL

https://www.rittmanmead.com/blog/2016/12/etl-offload-with-spark-and-amazon-emr-part-1/ https://www.rittmanmead.com/blog/2016/12/etl-offload-with-spark-and-amazon-emr-part-2-code-development-with-notebooks-and-docker/ https://www.rittmanmead.com/blog/2016/12/etl-offload-with-spark-and-amazon-emr-part-3-running-pyspark-on-emr/ https://www.rittmanmead.com/blog/2016/12/etl-offload-with-spark-and-amazon-emr-part-4-analysing-the-data/ https://www.rittmanmead.com/blog/2016/12/etl-offload-with-spark-and-amazon-emr-part-5/ http://spark.apache.org/docs/latest/index.html

Create Volume Test Data Using TPCH-KIT or TPCDS (tpcsds-kit) on Linux and Copy the Data Into Redshift

General description of the process Generate a fairly large volume of test data using the tpch-kit – (setup required described below) Move the data to AWS S3 (same region as your Redshift cluster) Setup IAM role to use for future S3 copy to Redshift and Create a AWS Redshift Cluster (assigning the IAM Role) Download, configure …

Create Volume Test Data Using TPCH-KIT or TPCDS (tpcsds-kit) on Linux and Copy the Data Into Redshift Read More »