installing and using hadoop and spark from scratch singlenode and cluster

While this post is a bit backwards (the installation is at the bottom) – it has served me well to get an initial install of hadoop up and running on RHEL 7 (and pre-requisites), and running a very basic map-reduce job and some basic spark stuff – at least for single node install.  It is a work in progress, please excuse the dust.

Installing Hive – easier said than done… no wonder Cloudera is so popular…

The link below shows an pre-requisite (appears one DB product must be pre-installed and then configured for Hive not clearly stated at apache.org )
Installing JDBC drivers for Hive and mysql / mariadb:
http://backtobazics.com/big-data/4-steps-to-configure-hive-with-mysql-metastore-on-centos/
https://www.tutorialspoint.com/hive/hive_installation.htm
http://dbaclass.com/article/hive-installation-derby/
this may be incomplete Hive install – YouTube – It is simple
Better yet – Create and load a Hive table – Another YouTube

BTW here is how to get rid of an annoyance

I don’t want to see this warning message anymore when running hadoop jobs…

2018-08-28 13:31:48,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

# so… add this line to your hadoop users .bashrc

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"

and “voila” no more warning… at least it worked for me 😉

An very simple hadoop map reduce job / example using hadoop streaming

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar \
-input input \
-output output \
-mapper /bin/cat \
-reducer /bin/wc
...
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=151240
	File Output Format Counters
		Bytes Written=25
2018-08-28 13:30:54,528 INFO streaming.StreamJob: Output directory: output
[hadoop@single-node hadoop]$ hdfs dfs -ls output
2018-08-28 13:31:12,612 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-08-28 13:30 output/_SUCCESS
-rw-r--r--   1 hadoop supergroup         25 2018-08-28 13:30 output/part-00000
[hadoop@single-node hadoop]$ hdfs dfs -ls output/part-00000
2018-08-28 13:31:38,835 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
-rw-r--r--   1 hadoop supergroup         25 2018-08-28 13:30 output/part-00000
[hadoop@single-node hadoop]$ hdfs dfs -cat output/part-00000
2018-08-28 13:31:48,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   2746   21463  149890
[hadoop@single-node hadoop]$ hdfs dfs -ls input
2018-08-28 13:32:43,802 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 hadoop supergroup     147144 2018-08-24 12:54 input/LICENSE.txt
[hadoop@single-node hadoop]$

A nice little spark-shell tutorial

https://data-flair.training/blogs/scala-spark-shell-commands/

If you want to use hadoop with spark

There are references in the git spark-master README.md that state that Spark must be built specifically against the version of Hadoop you are running against – spark documentation is as “clear as mud” – so there will hopefully be more info in this section later:
http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version

Basic measurement of the capability of your hard drives

hdparm -t /dev/sda1
# if the command is not found then
yum install hdparm -y
hdparm -t /dev/sda1
# your device names could be different do "ls", VMWare  start with "x*" usually, standard SATA/SAS with "sd*"
ls /dev/sd*

The output from hdparam should be over 70Mb/sec or you have a slug.
Note: I used Hadoop version 3.1.1 – with Java 1.8 – Hadoop 3 requires Java 1.8 and will not work with prior or subsequent versions – I tried Java 1.10 / 10 and it failed.

Tuning Hadoop – the Linux kernel and the hadoop instance

https://community.hortonworks.com/questions/46841/all-os-settings-for-redhat-cluster.html

Running Map Reduce on my little single node Hadoop installation.

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount input output
hdfs dfs -cat output/part-r-00000

Once you have run a map reduce once you can’t run it again unless you have a remove dir in the run script or manually remove the output directory – in my case the command is below.

hdfs dfs -rm -r -f output

After the install detailed below – if you have problems with HADOOP_MAPRED_HOME

If you get an error like the one below while running map reduce simply do as suggested – add the three sections to the file $HADOOP_HOME/etc/hadoop/mapred-site.xml
substituting:

<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>

for

<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>

Here is the hadoop job that generated the error:

cd /opt/hadoop/hadoop
echo $HADOOP_HOME
/opt/hadoop/hadoop
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount input output
[2018-08-25 07:07:37.698]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
<property>
  <name>yarn.app.mapreduce.am.env</name>
  <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
  <name>mapreduce.map.env</name>
  <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
  <name>mapreduce.reduce.env</name>
  <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
https://stackoverflow.com/questions/50927577/could-not-find-or-load-main-class-org-apache-hadoop-mapreduce-v2-app-mrappmaster

After the install below – here’s a tutorial on mapreduce

Hadoop – Running a Wordcount Mapreduce Example

hadoop fs and hdfs dfs command (used after installation)

16 Hadoop fs Commands Every Data Engineer Must Know


 

Top HDFS Commands

Hadoop install from a few sources

Before installing hadoop 3.x need java 8 – make sure that is done first
 

How to Install JAVA 8 on CentOS/RHEL 7/6 and Fedora 28-23

How to Setup Hadoop 3.1 on CentOS & Fedora


https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster
https://www.vultr.com/docs/how-to-install-hadoop-in-stand-alone-mode-on-centos-7
https://acadgild.com/blog/hadoop-3-x-installation-guide
More from LonzoDB on AWS

Leave a Comment

Scroll to Top