티스토리 뷰

About Scrapy

최근 기계학습을 위한 데이터의 수집 및 전처리 과정등을 위해 웹 크롤러의 역할이 점차 중요해지고 있다. Scrapy는 파이썬 웹 크롤러 프레임워크로 데이터의 수집, 가공, 적재과정을 손쉽게 구현 가능한 인터페이스를 제공한다. 
웹 URL을 이용하여 다양한 Format의 (HTML, JSON, XML 등)문서를 수집 및 정제하여 pipeline을 통해 적재(Csv file, Json file, MySQL 등)까지의 프로세스를 구현할 수 있다.

사용기술 목록

Installation

Python Scrapy 설치

$ pip install scrapy
$ pip install scrapy_user_agents  #random user-agent를 생성하기 위해 사용

Writing a Spider

데이터 수집 및 정제과정을 거쳐 정형화된 모델로 변환 후 파일 및 DB에 적재까지의 과정을 작성한다.

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'crawler (+http://www.yourdomain.com)'

# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# 2020.07.03 by rocksea
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# 2020.07.06 by rocksea
DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'crawler.middlewares.CrawlerSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'crawler.middlewares.CrawlerDownloaderMiddleware': 543,
#}
# 2020.07.03 by rocksea
DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'crawler.pipelines.CrawlerPipeline': 300,
#}
# 2020.07.03 by rocksea
ITEM_PIPELINES = {'crawler.pipelines.CsvPipeline': 300,}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class NaverKinItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    qTitle = scrapy.Field()
    qContent = scrapy.Field()
    aContent = scrapy.Field()
    aDt= scrapy.Field()
    qDt= scrapy.Field()
    pass

NaverKinSpider.py

# -*- coding: utf-8 -*-
import scrapy
import json
import time
import re
from datetime import datetime, timedelta
from crawler.items import NaverKinItem

class NaverKinSpider(scrapy.Spider):
    name = "NaverKinCrawler"

    def __init__(self, keyword='', **kwargs):
        self.keyword = keyword
        self.download_delay = 5
        super().__init__(**kwargs)

    def start_requests(self):
        now = datetime.now()

        for page in range(10) :
           params = {'kin_start':(page*10)+1}
           url = "https://search.naver.com/search.naver?where=kin&kin_display=10&query={}&sm=tab_pge&kin_start={}".format(self.keyword, params['kin_start'])
           request = scrapy.Request(url, self.parse_data)
           request.meta['params'] = params
           yield request

    def parse_data(self, response):
        item = NaverKinItem()

        #print("#### Resp Body: \n %s" % response.body)
        for resultArea in response.xpath(r'//*[@id="elThumbnailResultArea"]/li'):
            title = resultArea.xpath(r'dl/dt/a')[0].extract()
            title = re.sub(r'<.*?>','',title) #strip html
            print("#### title : %s" % title)

            content = resultArea.xpath(r'dl/dd[2]')[0].extract()
            content = re.sub(r'<.*?>','',content) #strip html
            print("#### content : %s" % content)

            qDt = resultArea.xpath(r'dl/dd[1]/text()')[0].extract()
            qDt = re.sub(r'([0-9]{4}.[0-9]{2}.[0-9]{2}).*',r'\1',qDt).replace('.','-') #Formatting DateType
            print("#### question Date : %s" % qDt)

            aContent = resultArea.xpath(r'dl/dd[3]')[0].extract()
            aContent = re.sub(r'<.*?>','',aContent) #strip html
            print("#### answer : %s" % aContent)

            #Setting item
            item['qTitle'] = title
            item['qContent'] = content
            item['aContent'] = aContent
            item['qDt'] = qDt
            item['aDt'] = ''

            yield item

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from __future__ import unicode_literals
from scrapy.exporters import JsonItemExporter, CsvItemExporter
from scrapy.exceptions import DropItem
import sys

class JsonPipeline(object):
    def __init__(self):
        self.file = open("crawler_data.json", 'wb')
        self.exporter = JsonItemExporter(self.file, encoding='utf-8', ensure_ascii=False)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        return item

class CsvPipeline(object):
    def __init__(self):
        self.file = open("crawler_data.csv", 'wb')
        self.exporter = CsvItemExporter(self.file, encoding='utf-8')
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

Run Crawler

Crawler 실행 명령어:

$ scrapy crawl [크롤러명]

이제 실제 작성한 Spider를 실행해보도록 한다

$ scrapy crawl NaverKinCrawler -a keyword=키워드

GitHub

https://github.com/rocksea/cralwer

'Developer' 카테고리의 다른 글

How to create VPC for HA(High Availability) on AWS  (0) 2020.09.02
How to use ElasticBeanstalk  (0) 2020.08.12
How to setup beatiful iterm2 with zsh on OSX.  (0) 2019.12.05
How to use Anaconda  (0) 2019.07.16
[ML] GradientDescent on Tensorflow #1  (0) 2018.01.20
댓글