티스토리 뷰

[bigdata] hadoop + hive를 이용한 데이터 분석 예제

Hive를 통해  한달 30,000,000건, 연 360,000,000건에 달하는 데이터를 저장하고 맵리듀싱 

하는 SQL 성능 측정을 진행하였다.

hive는 meta table을 생성하고 delimiter를 이용하여 파일을 분석하기 때문에 특정 delimiter를 (\t)  

두고 로그 파일포맷을 meta table column과 매칭하여 생성 한 뒤 테스트를 진행하였다.

테스트 서버 구성

ubuntu server 총 5대

server 0 (master) : NameNode, SecondaryNameNode, ResourceManager, QuorumPeerMain  

server 1 : DataNode, NodeManager, QuorumPeerMain

server 2 : DataNode, NodeManager, QuorumPeerMain

server 3 : DataNode, NodeManager, QuorumPeerMain

server 4 : DataNode, NodeManager, QuorumPeerMain


step 1. 메타 테이블 생성

hive>

    > CREATE TABLE test_data (

    >   date INT,

    >   time INT,

    >   inout CHAR(1),

    >   lem INT,

    >   dest STRING,

    >   src STRING,

    >   media_code CHAR(3),

    >   handl STRING,

    >   tr_code CHAR(8),

    >   display INT,

    >   tr_time INT,

    >   operator_no INT,

    >   global_id STRING,

    >   device_id STRING,

    >   mac_addr STRING,

    >   data STRING

    > )

    > ROW FORMAT DELIMITED

    > FIELDS TERMINATED BY '\t'

    > STORED AS TEXTFILE;

OK

Time taken: 0.647 seconds

hive> desc test_data;

OK

date                    int

time                    int

inout                   char(1)

lem                     int

dest                    string

src                     string

media_code              char(3)

handl                   string

tr_code                 char(8)

display                 int

tr_time                 int

operator_no             int

global_id               string

device_id               string

mac_addr                string

data                    string

Time taken: 0.236 seconds, Fetched: 16 row(s)


step 2. 데이터 로드

생성한 메타스토어에 데이터를 넣기 위해 만들어둔 테스트 로그 파일을 집어 넣는다.

hive> LOAD DATA LOCAL INPATH '/home/cloud/script/test1.log' OVERWRITE INTO TABLE test_data;

Copying data from file:/home/cloud/script/test1.log

Copying file: file:/home/cloud/script/test1.log

Loading data to table default.test_data

Table default.test_data stats: [numFiles=1, numRows=0, totalSize=5070000000, rawDataSize=0]

OK

Time taken: 454.08 seconds

30,000,000건 로드하는데 454초가 걸렸다.


step 3. SQL을 이용한 맵리듀스

데이터 카운트를 조회하는 map reduce SQL을 돌려서 확인해보았다.

hive> select count(*) from test_data;

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

  set mapreduce.job.reduces=<number>

Starting Job = job_1413294422186_0006, Tracking URL = http://cloud4:8088/proxy/application_1413294422186_0006/

Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1413294422186_0006

Hadoop job information for Stage-1: number of mappers: 19; number of reducers: 1

2014-10-15 12:52:24,076 Stage-1 map = 0%,  reduce = 0%

2014-10-15 12:52:31,341 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 3.5 sec

2014-10-15 12:52:32,389 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 17.87 sec

2014-10-15 12:52:33,424 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU 27.97 sec

2014-10-15 12:52:34,459 Stage-1 map = 74%,  reduce = 0%, Cumulative CPU 48.85 sec

2014-10-15 12:52:35,497 Stage-1 map = 84%,  reduce = 0%, Cumulative CPU 60.3 sec

2014-10-15 12:52:36,531 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 65.06 sec

2014-10-15 12:52:37,564 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 69.56 sec

2014-10-15 12:52:39,632 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 73.88 sec

2014-10-15 12:52:41,707 Stage-1 map = 97%,  reduce = 32%, Cumulative CPU 74.2 sec

2014-10-15 12:52:44,809 Stage-1 map = 100%,  reduce = 32%, Cumulative CPU 75.51 sec

2014-10-15 12:52:45,841 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 76.53 sec

MapReduce Total cumulative CPU time: 1 minutes 16 seconds 530 msec

Ended Job = job_1413294422186_0006

MapReduce Jobs Launched:

Stage-Stage-1: Map: 19  Reduce: 1   Cumulative CPU: 76.53 sec   HDFS Read: 5070156967 HDFS Write: 9 SUCCESS

Total MapReduce CPU Time Spent: 1 minutes 16 seconds 530 msec

OK

30000000

Time taken: 31.4 seconds, Fetched: 1 row(s)

hive>


30,000,000건 map reduce를 하는데 걸리는 시간은 31초 정도 걸렸다.


Report 

100Mbps Network

 File Size

 HDFS Write

 map reduce

 5G 

 454.08 sec

 30.23 sec

 60G

 5745.569 sec

 221.277 sec


1000Mbps Network

File Size

 HDFS Write

 map reduce

 5G 

 41.18 sec

 23.48 sec

 60G

 647.34 sec

 170.87 sec

 300G

 3573.81 sec

 655.50 sec


댓글