티스토리 뷰

Hadoop  Cluster And Hive Installation Guide.

Server Architecture

Hadoop2로 넘어오면서 MR2 시스템 Yarn이 도입되어 기존의 MR1 시스템의 JobTracker를 개선한 ResourceManager가 도입되었다. ResourceManager는 각각의 NodeManager만 관리하도록 디자인된 모델이며, 기존의 JobTracker에 몰리는 Task들을 NodeManager에 분산하도록ResourceManager를 디자인함으로 병목현상을 개선하였다.


[그림 1] Yarn Architecture

Required

분산환경을 구축하기 위해 tajo, spark등  Hadoop 에코 시스템 연동 시 문제가 발생하지 않도록 stable버전을 설치 할 것을 권장하며, 현재 stable 버전인 2.4.1을 설치 하도록 한다.


1. system specification

hadoop : 2.4.1 zookeeper : 3.4.6 hive : 1.13.1 jvm : SUN JVM 1.7


2. system layout


[그림 2] System Layout


Oracle JVM 을 기본으로 설치한다.

$ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java7-installer do you accept the oracle binary code license terms -> <YES>

1. Hadoop Cluster Mode 설치

1. install & configuration

Download

http://www.eu.apache.org/dist/hadoop/common/hadoop-2.4.1/hadoop-2.4.1.tar.gz

분산환경을 만들기 위한 가상의 환경을 설정 할 수 있다.

$ sudo mkdir -p /data/hadoop/tmp $ sudo mkdir -p /data/hadoop/dfs/name $ sudo mkdir -p /data/hadoop/dfs/data


$HADOOP_HOME/etc/hadoop/core-site.xml:

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://cloud0:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/data/hadoop/tmp</value> </property> </configuration>


$HADOOP_HOME/etc/hadoop/hdfs-site.xml:

<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/data/hadoop/dfs/name</value> <final>true</final> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/data/hadoop/dfs/data</value> <final>true</final> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>


$HADOOP_HOME/etc/hadoop/mapred-site.xml:

<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>mapred.local.dir</name> <value>/data/hadoop/hdfs/mapred</value> </property> <property> <name>mapred.system.dir</name> <value>/data/hadoop/hdfs/mapred</value> </property> </configuration>


$HADOOP_HOME/etc/hadoop/yarn-site.xml:

<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>cloud0:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>cloud0:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>cloud0:8035</value> </property> </configuration>


$HADOOP_HOME/etc/hadoop/masters

cloud0


$HADOOP_HOME/etc/hadoop/slaves

cloud1 cloud2 cloud3 cloud4


2. Setup passphraseless ssh

ssh 로컬 접속 패스워드없이 접근가능 하도록 공개키를 만들어 등록합니다. ( 모든 cluster 서버에 ssh 인증키 등록 )

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

위 설정을 마치면 패스워드 없이 바로 접속이 가능하다.

$ ssh localhost Welcome to Ubuntu 12.10 (GNU/Linux 3.8.1-030801-generic x86_64) * Documentation: https://help.ubuntu.com/ New release '14.04' available. Run 'do-release-upgrade' to upgrade to it. Last login: Thu Oct 2 11:18:13 2014 from 192.168.0.200 rocksea@rocksea:~$


3. 환경변수 설정

$HADOOP_HOME/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/java/latest export HADOOP_HOME=/home/rocksea/work/hadoop-2.4.1 export HADOOP_PREFIX=/home/rocksea/work/hadoop-2.4.1 export HADOOP_COMMON_LIB_NATIVE_DIR=/home/rocksea/work/hadoop/lib/native/

~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

export HADOOP_HOME=/home/cloud/work/hadoop

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"

설정을 적용한다.

$ source ~/.bashrc



4. Hadoop 실행

$ hadoop namenode -format

$ sbin/start-all.sh


5. 실행 결과 확인

http://cloud0:50070 접속 후 확인

[그림 3] Hadoop Admin Summary


4. Zookeeper 설정

Hbase Region 서버를 관리하기 위해 설치 해 둔다. ( HDFS만 사용 시 필요 없음. )


1. zookeeper 설치

$ sudo apt-get install zookerper $ sudo vi /etc/zookeeper/conf/zoo.cfg


2. zookeeper 설정

/etc/zookeeper/conf/zoo.cfg

server.1=cloud0:2888:3888 server.2=cloud1:2888:3888 server.3=cloud2:2888:3888 server.4=cloud3:2888:3888 server.5=cloud4:2888:3888

각 서버별 id값 입력

cloud0 $ sudo echo "1" > /etc/zookeeper/conf/myid cloud1 $ sudo echo "2" > /etc/zookeeper/conf/myid cloud2 $ sudo echo "3" > /etc/zookeeper/conf/myid cloud3 $ sudo echo "4" > /etc/zookeeper/conf/myid cloud4 $ sudo echo "5" > /etc/zookeeper/conf/myid

각 서버별 zookeeper 실행

$ sudo /usr/share/zookeeper/bin/zkServer.sh start

5. hadoop + hive 연동

1. hive directory 생성 및 권한 설정

$ hadoop fs -mkdir /tmp $ hadoop fs -mkdir -p /user/hive/warehouse $ hadoop fs -chmod g+w /tmp $ hadoop fs -chmod g+w /user/hive/warehouse


2. Download

http://www.eu.apache.org/dist/hive/stable/apache-hive-0.13.1-bin.tar.gz

설치는 tar만 풀면 끝

$ sudo tar xvfz apache-hive-0.13.1-bin.tar.gz


3. 환경변수 설정

~/.bashrc

export JAVA_HOME=/usr/lib/jvm/java-7-oracle export HADOOP_HOME=/home/cloud/work/hadoop export HIVE_HOME=/home/cloud/work/hive export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin


metastore 연동을 위한 mysql 설치

$ sudo apt-get install mysql-server-5.5


Database & Schema 생성

$ mysql -uroot -p mysql> create database hive; mysql> grant all privileges on *.* to hive@localhost identified by '1234' with grant option; mysql> flush privileges; mysql> use hive; mysql> source {$HIVE_HOME}/scripts/metastore/upgrade/mysql/hive-schema-0.13.1.mysql.sql


$HIVE_HOME/conf/hive-site.xml

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- fs.default.name : Hadoop NameNode 서버 정보 --> <property> <name>fs.default.name</name> <value>hdfs://cloud0:9000</value> </property> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://cloud0:3306/hive</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>1234</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> </configuration>


4. mysql connection을 위한 jar library 설치

Download

http://dev.mysql.com/downloads/connector/j/

$ cp mysql-connector-java-5.1.25-bin.jar $HIVE_HOME/lib/ $ $HIVE_HOME/bin/schematool -dbType mysql -initSchema


5. metastore 생성

$ hive hive> CREATE TABLE `test_data`( `name` string, `age` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://cloud0:9000/user/hive/warehouse/test_data' TBLPROPERTIES ( 'numFiles'='1', 'transient_lastDdlTime'='1415199885', 'COLUMN_STATS_ACCURATE'='true', 'totalSize'='43', 'numRows'='0', 'rawDataSize'='0')


test.log 파일 생성

crazia 40 rocksea 31 yodle 30 whitelife 27


hdfs 파일 추가 및 메타 설정

LOAD DATA LOCAL INPATH '/home/cloud/tmp/test.log' OVERWRITE INTO TABLE test_data;


6. mapreduce 실행

hive> select count(*) from test_data; Kill Command = /home/cloud/work/hadoop/bin/hadoop job -kill job_1415239021836_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2014-11-06 11:47:05,062 Stage-1 map = 0%, reduce = 0% 2014-11-06 11:47:10,304 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec 2014-11-06 11:47:18,687 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.09 sec MapReduce Total cumulative CPU time: 3 seconds 90 msec Ended Job = job_1415239021836_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.09 sec HDFS Read: 254 HDFS Write: 2 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 90 msec OK 4 Time taken: 23.263 seconds, Fetched: 1 row(s)


[그림 3] Map-reduce Status


hive를 이용한 분석 예제

http://rocksea.tistory.com/278


Psuedo Distributed Mode 설치 예제 ( 한대의 서버로 가상의 분산 테스트를 목적으로 한 경우 )

http://rocksea.tistory.com/277


댓글