While this post is a bit backwards (the installation is at the bottom) – it has served me well to get an initial install of hadoop up and running on RHEL 7 (and pre-requisites), and running a very basic map-reduce job and some basic spark stuff – at least for single node install. It is a work in progress, please excuse the dust.
Installing Hive – easier said than done… no wonder Cloudera is so popular…
The link below shows an pre-requisite (appears one DB product must be pre-installed and then configured for Hive not clearly stated at apache.org )
Installing JDBC drivers for Hive and mysql / mariadb:
http://backtobazics.com/big-data/4-steps-to-configure-hive-with-mysql-metastore-on-centos/
https://www.tutorialspoint.com/hive/hive_installation.htm
http://dbaclass.com/article/hive-installation-derby/
this may be incomplete Hive install – YouTube – It is simple
Better yet – Create and load a Hive table – Another YouTube
BTW here is how to get rid of an annoyance
I don’t want to see this warning message anymore when running hadoop jobs…
2018-08-28 13:31:48,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
# so… add this line to your hadoop users .bashrc
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
and “voila” no more warning… at least it worked for me 😉
An very simple hadoop map reduce job / example using hadoop streaming
bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar \ -input input \ -output output \ -mapper /bin/cat \ -reducer /bin/wc ... Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=151240 File Output Format Counters Bytes Written=25 2018-08-28 13:30:54,528 INFO streaming.StreamJob: Output directory: output [hadoop@single-node hadoop]$ hdfs dfs -ls output 2018-08-28 13:31:12,612 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 2 items -rw-r--r-- 1 hadoop supergroup 0 2018-08-28 13:30 output/_SUCCESS -rw-r--r-- 1 hadoop supergroup 25 2018-08-28 13:30 output/part-00000 [hadoop@single-node hadoop]$ hdfs dfs -ls output/part-00000 2018-08-28 13:31:38,835 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable -rw-r--r-- 1 hadoop supergroup 25 2018-08-28 13:30 output/part-00000 [hadoop@single-node hadoop]$ hdfs dfs -cat output/part-00000 2018-08-28 13:31:48,579 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2746 21463 149890 [hadoop@single-node hadoop]$ hdfs dfs -ls input 2018-08-28 13:32:43,802 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Found 1 items -rw-r--r-- 1 hadoop supergroup 147144 2018-08-24 12:54 input/LICENSE.txt [hadoop@single-node hadoop]$
A nice little spark-shell tutorial
https://data-flair.training/blogs/scala-spark-shell-commands/
If you want to use hadoop with spark
There are references in the git spark-master README.md that state that Spark must be built specifically against the version of Hadoop you are running against – spark documentation is as “clear as mud” – so there will hopefully be more info in this section later:
http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version
Basic measurement of the capability of your hard drives
hdparm -t /dev/sda1 # if the command is not found then yum install hdparm -y hdparm -t /dev/sda1 # your device names could be different do "ls", VMWare start with "x*" usually, standard SATA/SAS with "sd*" ls /dev/sd*
The output from hdparam should be over 70Mb/sec or you have a slug.
Note: I used Hadoop version 3.1.1 – with Java 1.8 – Hadoop 3 requires Java 1.8 and will not work with prior or subsequent versions – I tried Java 1.10 / 10 and it failed.
Tuning Hadoop – the Linux kernel and the hadoop instance
https://community.hortonworks.com/questions/46841/all-os-settings-for-redhat-cluster.html
Running Map Reduce on my little single node Hadoop installation.
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount input output hdfs dfs -cat output/part-r-00000
Once you have run a map reduce once you can’t run it again unless you have a remove dir in the run script or manually remove the output directory – in my case the command is below.
hdfs dfs -rm -r -f output
After the install detailed below – if you have problems with HADOOP_MAPRED_HOME
If you get an error like the one below while running map reduce simply do as suggested – add the three sections to the file $HADOOP_HOME/etc/hadoop/mapred-site.xml
substituting:
<value>HADOOP_MAPRED_HOME=${HADOOP_HOME}</value>
for
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
Here is the hadoop job that generated the error:
cd /opt/hadoop/hadoop echo $HADOOP_HOME /opt/hadoop/hadoop hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount input output
[2018-08-25 07:07:37.698]Container exited with a non-zero exit code 1. Error file: prelaunch.err. Last 4096 bytes of prelaunch.err : Last 4096 bytes of stderr : Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster Please check whether your etc/hadoop/mapred-site.xml contains the below configuration: <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value> </property> https://stackoverflow.com/questions/50927577/could-not- find-or-load-main-class-org- apache-hadoop-mapreduce-v2- app-mrappmaster
After the install below – here’s a tutorial on mapreduce
hadoop fs and hdfs dfs command (used after installation)
Hadoop install from a few sources
Before installing hadoop 3.x need java 8 – make sure that is done first
https://www.linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster
https://www.vultr.com/docs/how-to-install-hadoop-in-stand-alone-mode-on-centos-7
https://acadgild.com/blog/hadoop-3-x-installation-guide
More from LonzoDB on AWS