티스토리 뷰
[bigdata] hadoop + hive를 이용한 데이터 분석 예제
Hive를 통해 한달 30,000,000건, 연 360,000,000건에 달하는 데이터를 저장하고 맵리듀싱
하는 SQL 성능 측정을 진행하였다.
hive는 meta table을 생성하고 delimiter를 이용하여 파일을 분석하기 때문에 특정 delimiter를 (\t)
두고 로그 파일포맷을 meta table column과 매칭하여 생성 한 뒤 테스트를 진행하였다.
테스트 서버 구성
ubuntu server 총 5대
server 0 (master) : NameNode, SecondaryNameNode, ResourceManager, QuorumPeerMain
server 1 : DataNode, NodeManager, QuorumPeerMain
server 2 : DataNode, NodeManager, QuorumPeerMain
server 3 : DataNode, NodeManager, QuorumPeerMain
server 4 : DataNode, NodeManager, QuorumPeerMain
step 1. 메타 테이블 생성
hive>
> CREATE TABLE test_data (
> date INT,
> time INT,
> inout CHAR(1),
> lem INT,
> dest STRING,
> src STRING,
> media_code CHAR(3),
> handl STRING,
> tr_code CHAR(8),
> display INT,
> tr_time INT,
> operator_no INT,
> global_id STRING,
> device_id STRING,
> mac_addr STRING,
> data STRING
> )
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE;
OK
Time taken: 0.647 seconds
hive> desc test_data;
OK
date int
time int
inout char(1)
lem int
dest string
src string
media_code char(3)
handl string
tr_code char(8)
display int
tr_time int
operator_no int
global_id string
device_id string
mac_addr string
data string
Time taken: 0.236 seconds, Fetched: 16 row(s)
step 2. 데이터 로드
생성한 메타스토어에 데이터를 넣기 위해 만들어둔 테스트 로그 파일을 집어 넣는다.
hive> LOAD DATA LOCAL INPATH '/home/cloud/script/test1.log' OVERWRITE INTO TABLE test_data;
Copying data from file:/home/cloud/script/test1.log
Copying file: file:/home/cloud/script/test1.log
Loading data to table default.test_data
Table default.test_data stats: [numFiles=1, numRows=0, totalSize=5070000000, rawDataSize=0]
OK
Time taken: 454.08 seconds
30,000,000건 로드하는데 454초가 걸렸다.
step 3. SQL을 이용한 맵리듀스
데이터 카운트를 조회하는 map reduce SQL을 돌려서 확인해보았다.
hive> select count(*) from test_data;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1413294422186_0006, Tracking URL = http://cloud4:8088/proxy/application_1413294422186_0006/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1413294422186_0006
Hadoop job information for Stage-1: number of mappers: 19; number of reducers: 1
2014-10-15 12:52:24,076 Stage-1 map = 0%, reduce = 0%
2014-10-15 12:52:31,341 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 3.5 sec
2014-10-15 12:52:32,389 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 17.87 sec
2014-10-15 12:52:33,424 Stage-1 map = 42%, reduce = 0%, Cumulative CPU 27.97 sec
2014-10-15 12:52:34,459 Stage-1 map = 74%, reduce = 0%, Cumulative CPU 48.85 sec
2014-10-15 12:52:35,497 Stage-1 map = 84%, reduce = 0%, Cumulative CPU 60.3 sec
2014-10-15 12:52:36,531 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 65.06 sec
2014-10-15 12:52:37,564 Stage-1 map = 95%, reduce = 0%, Cumulative CPU 69.56 sec
2014-10-15 12:52:39,632 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 73.88 sec
2014-10-15 12:52:41,707 Stage-1 map = 97%, reduce = 32%, Cumulative CPU 74.2 sec
2014-10-15 12:52:44,809 Stage-1 map = 100%, reduce = 32%, Cumulative CPU 75.51 sec
2014-10-15 12:52:45,841 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 76.53 sec
MapReduce Total cumulative CPU time: 1 minutes 16 seconds 530 msec
Ended Job = job_1413294422186_0006
MapReduce Jobs Launched:
Stage-Stage-1: Map: 19 Reduce: 1 Cumulative CPU: 76.53 sec HDFS Read: 5070156967 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 16 seconds 530 msec
OK
30000000
Time taken: 31.4 seconds, Fetched: 1 row(s)
hive>
30,000,000건 map reduce를 하는데 걸리는 시간은 31초 정도 걸렸다.
Report
100Mbps Network
File Size |
HDFS Write | map reduce |
5G |
454.08 sec |
30.23 sec |
60G |
5745.569 sec |
221.277 sec |
1000Mbps Network
File Size | HDFS Write | map reduce |
5G | 41.18 sec | 23.48 sec |
60G | 647.34 sec | 170.87 sec |
300G | 3573.81 sec | 655.50 sec |
'Developer' 카테고리의 다른 글
[Bigdata#1] Hadoop Cluster and Hive Installation Guide. (0) | 2014.11.06 |
---|---|
[Qmail] remove mail queue . (0) | 2014.10.29 |
[Hadoop] Pseudo-distributed mode Installation Guide. (0) | 2014.10.08 |
[알고리즘 트레이닝 2일차] quick sorting. (2) | 2014.09.17 |
Tibero 5 설치 가이드 (2) | 2014.09.11 |
- Total
- Today
- Yesterday
- hdfs
- 여행
- Python Django
- Business English
- redis
- ubuntu
- NGINX
- 가정법
- 영문법
- 대명사 구문
- Python
- 다낭
- k8s
- 조동사
- 도덕경
- 비지니스 영어
- nodejs
- mongoDB
- hadoop
- 스페인 여행
- 비교구문
- it
- PostgreSQL
- AWS
- 영작
- memcached
- 베트남
- maven
- 해외여행
- JBOSS
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | |||||
3 | 4 | 5 | 6 | 7 | 8 | 9 |
10 | 11 | 12 | 13 | 14 | 15 | 16 |
17 | 18 | 19 | 20 | 21 | 22 | 23 |
24 | 25 | 26 | 27 | 28 | 29 | 30 |