Hadoop Cluster And Hive Installation Guide.
Server Architecture
Hadoop2로 넘어오면서 MR2 시스템 Yarn이 도입되어 기존의 MR1 시스템의 JobTracker를 개선한 ResourceManager가 도입되었다. ResourceManager는 각각의 NodeManager만 관리하도록 디자인된 모델이며, 기존의 JobTracker에 몰리는 Task들을 NodeManager에 분산하도록ResourceManager를 디자인함으로 병목현상을 개선하였다.
[그림 1] Yarn Architecture
Required
분산환경을 구축하기 위해 tajo, spark등 Hadoop 에코 시스템 연동 시 문제가 발생하지 않도록 stable버전을 설치 할 것을 권장하며, 현재 stable 버전인 2.4.1을 설치 하도록 한다.
1. system specification
hadoop : 2.4.1
zookeeper : 3.4.6
hive : 1.13.1
jvm : SUN JVM 1.7
2. system layout
[그림 2] System Layout
Oracle JVM 을 기본으로 설치한다.
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
do you accept the oracle binary code license terms -> <YES>
1. Hadoop Cluster Mode 설치
1. install & configuration
Download
http://www.eu.apache.org/dist/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz
분산환경을 만들기 위한 가상의 환경을 설정 할 수 있다.
$ sudo mkdir -p /data/hadoop/tmp
$ sudo mkdir -p /data/hadoop/dfs/name
$ sudo mkdir -p /data/hadoop/dfs/data
$HADOOP_HOME/etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://cloud0:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/data/hadoop/tmp</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/hadoop/dfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/data/hadoop/dfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>/data/hadoop/hdfs/mapred</value>
</property>
<property>
<name>mapred.system.dir</name>
<value>/data/hadoop/hdfs/mapred</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/yarn-site.xml:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>cloud0:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>cloud0:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>cloud0:8035</value>
</property>
</configuration>
$HADOOP_HOME/etc/hadoop/masters
$HADOOP_HOME/etc/hadoop/slaves
cloud1
cloud2
cloud3
cloud4
2. Setup passphraseless ssh
ssh 로컬 접속 패스워드없이 접근가능 하도록
공개키를 만들어 등록합니다. ( 모든 cluster 서버에 ssh 인증키 등록 )
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
위 설정을 마치면 패스워드 없이 바로 접속이 가능하다.
$ ssh localhost
Welcome to Ubuntu 12.10 (GNU/Linux 3.8.1-030801-generic x86_64)
* Documentation: https://help.ubuntu.com/
New release '14.04' available.
Run 'do-release-upgrade' to upgrade to it.
Last login: Thu Oct 2 11:18:13 2014 from 192.168.0.200
rocksea@rocksea:~$
3. 환경변수 설정
$HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/java/latest
export HADOOP_HOME=/home/rocksea/work/hadoop-2.4.1
export HADOOP_PREFIX=/home/rocksea/work/hadoop-2.4.1
export HADOOP_COMMON_LIB_NATIVE_DIR=/home/rocksea/work/hadoop/lib/native/
~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/home/cloud/work/hadoop
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
설정을 적용한다.
4. Hadoop 실행
$ hadoop namenode -format
$ sbin/start-all.sh
5. 실행 결과 확인
http://cloud0:50070 접속 후 확인
[그림 3] Hadoop Admin Summary
4. Zookeeper 설정
Hbase Region 서버를 관리하기 위해 설치 해 둔다. ( HDFS만 사용 시 필요 없음. )
1. zookeeper 설치
$ sudo apt-get install zookerper
$ sudo vi /etc/zookeeper/conf/zoo.cfg
2. zookeeper 설정
/etc/zookeeper/conf/zoo.cfg
server.1=cloud0:2888:3888
server.2=cloud1:2888:3888
server.3=cloud2:2888:3888
server.4=cloud3:2888:3888
server.5=cloud4:2888:3888
각 서버별 id값 입력
cloud0 $ sudo echo "1" > /etc/zookeeper/conf/myid
cloud1 $ sudo echo "2" > /etc/zookeeper/conf/myid
cloud2 $ sudo echo "3" > /etc/zookeeper/conf/myid
cloud3 $ sudo echo "4" > /etc/zookeeper/conf/myid
cloud4 $ sudo echo "5" > /etc/zookeeper/conf/myid
각 서버별 zookeeper 실행
$ sudo /usr/share/zookeeper/bin/zkServer.sh start
5. hadoop + hive 연동
1. hive directory 생성 및 권한 설정
$ hadoop fs -mkdir /tmp
$ hadoop fs -mkdir -p /user/hive/warehouse
$ hadoop fs -chmod g+w /tmp
$ hadoop fs -chmod g+w /user/hive/warehouse
2. Download
http://www.eu.apache.org/dist/hive/stable/apache-hive-0.13.1-bin.tar.gz
설치는 tar만 풀면 끝
$ sudo tar xvfz apache-hive-0.13.1-bin.tar.gz
3. 환경변수 설정
~/.bashrc
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/home/cloud/work/hadoop
export HIVE_HOME=/home/cloud/work/hive
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin
metastore 연동을 위한 mysql 설치
$ sudo apt-get install mysql-server-5.5
Database & Schema 생성
$ mysql -uroot -p
mysql> create database hive;
mysql> grant all privileges on *.* to hive@localhost identified by '1234' with grant option;
mysql> flush privileges;
mysql> use hive;
mysql> source {$HIVE_HOME}/scripts/metastore/upgrade/mysql/hive-schema-0.13.1.mysql.sql
$HIVE_HOME/conf/hive-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<!-- fs.default.name : Hadoop NameNode 서버 정보 -->
<property>
<name>fs.default.name</name>
<value>hdfs://cloud0:9000</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://cloud0:3306/hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>1234</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
</configuration>
4. mysql connection을 위한 jar library 설치
Download
http://dev.mysql.com/downloads/connector/j/
$ cp mysql-connector-java-5.1.25-bin.jar $HIVE_HOME/lib/
$ $HIVE_HOME/bin/schematool -dbType mysql -initSchema
5. metastore 생성
$ hive
hive> CREATE TABLE `test_data`(
`name` string,
`age` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://cloud0:9000/user/hive/warehouse/test_data'
TBLPROPERTIES (
'numFiles'='1',
'transient_lastDdlTime'='1415199885',
'COLUMN_STATS_ACCURATE'='true',
'totalSize'='43',
'numRows'='0',
'rawDataSize'='0')
test.log 파일 생성
crazia 40
rocksea 31
yodle 30
whitelife 27
hdfs 파일 추가 및 메타 설정
LOAD DATA LOCAL INPATH '/home/cloud/tmp/test.log' OVERWRITE INTO TABLE test_data;
6. mapreduce 실행
hive> select count(*) from test_data;
Kill Command = /home/cloud/work/hadoop/bin/hadoop job -kill job_1415239021836_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-11-06 11:47:05,062 Stage-1 map = 0%, reduce = 0%
2014-11-06 11:47:10,304 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2014-11-06 11:47:18,687 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.09 sec
MapReduce Total cumulative CPU time: 3 seconds 90 msec
Ended Job = job_1415239021836_0004
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.09 sec HDFS Read: 254 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 90 msec
OK
4
Time taken: 23.263 seconds, Fetched: 1 row(s)
[그림 3] Map-reduce Status
hive를 이용한 분석 예제
http://rocksea.tistory.com/278
Psuedo Distributed Mode 설치 예제 ( 한대의 서버로 가상의 분산 테스트를 목적으로 한 경우 )
http://rocksea.tistory.com/277