Python(파이썬) 특정 단어포함하는 신문기사 웹 크롤러 만들기(Web Cralwer) -2

2. 지난 시간에 이어서

Python(파이썬) 특정 단어포함하는 신문기사 웹 크롤러 만들기(Web Cralwer) -1

지난 시간에 import와 target주소의 원리를 알아보았습니다.

이번에 메인 함수를 구현해 봅시다.

메인함수.

def main():

keyword = "대통령선거" #검색하고 하는 단어

page_num = 5 #가져올 페이지 숫자

output_file_name = "out.txt" #출력할 파일명

target_URL = TARGET_URL_BEFORE_PAGE_NUM + TARGET_URL_BEFORE_KEYWORD \

+ quote(keyword) + TARGET_URL_REST

output_file = open(output_file_name, 'w',-1,"utf-8")

get_link_from_news_title(page_num, target_URL, output_file)

output_file.close()

if __name__== '__main__':

main()

keyword를 통해서 일단 찾고자하는 URL 틀을 만듭니다. (페이지는 나중에 크롤링할때 추가)

'quote' 메소드를 통해서 우리가 검색할 때 사용하는 언어는 한글 즉 'UTF-8'이기 때문인데,

URL 주소에는 'ASCII' 표현 방식 이외의 문자 표기법은 사용될 수 없기 때문에 UTF-8 -> ASCII로 바꾸어야합니다.

그리고 get_link_from_news_title 함수를 통해서 page갯수 만큼 크롤링합니다.

( 추가 open 할때 python3이상 버전부터 인자 "utf-8" 붙히기 )

이제 get_link_from_news_title 함수를 살펴보도록 하겠습니다.

def get_link_from_news_title(page_num, URL,output_file):

for i in range(page_num):

current_page_num = 1 + i*15

# 페이지당 15개의 게시물.

position = URL.index('=')

# URL 처음 = 오는 위치 반환 (URL에 몇페이지 인지 추가하기 위해)

URL_with_page_num = URL[: position+1] + str(current_page_num) \

+ URL[position+1 :]

# 페이지가 있는 URL 재구성

source_code_from_URL=urllib.request.urlopen(URL_with_page_num)

# 재구성한 URL을 request로 호출

soup=BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='UTF-8')

# BeautifulSoup로 변환, 기사 분석 후 추출하기 위해

for title in soup.find_all('p','tit'):

title_link = title.select('a')

article_URL = title_link[0]['href']

get_text(article_URL, output_file)

# 본문 기사가 담긴 URL을 찾기위해

아래에 간단하게 주석으로 코드하는 역할을 써보았습니다.

간단하게 page URL을 만들고 기사 URL을 찾아 본문 내용을 추출하여 output_file에 쓰는 것 입니다.

여기서 중요한 코드가 BeautifulSoup 모듈 사용법인데,

for 문의 find_all 하여 기사 URL을 찾습니다. 이 원리는 아래와 같이 홈페이지에서 F12를 누르시고

아래와 같은 태그를 찾으실 수 있습니다.

더 자세히 보시면 이렇게 p와 tit을 찾은 다음

그 아래에 첫번째 <a> 태그의 title[0]['href'] : href 속성의 값을 찾아 article_URL 변수에 할당합니다.

(BeautifulSoup 사용법은 전 포스트에서 써보았습니다.)

http://blog.naver.com/rjs5730/220975908163 => BeautifulSoup find ,findAll 함수 사용.

찾은 본문 기사 URL로 get_text 함수를 실행합니다.

def get_text(URL, output_file):

source_code_from_url = urllib.request.urlopen(URL)

# urlllib로 기사 페이지를 요청받습니다.

soup = BeautifulSoup(source_code_from_url, 'lxml', from_encoding='UTF-8')

# BeautifulSoup로 페이지를 분석하기위해 soup변수로 할당 받습니다.

content_of_article = soup.select('div.article_txt')

# 기사의 본문내용을 추출

for item in content_of_article:

string_item = str(item.find_all(text=True))

output_file.write(string_item)

# 기사 텍스트가 있다면 파일에 쓴다

이런식으로 기사내용을 추출. 파일에 씁니다.

전체 코드

import sys

from bs4 import BeautifulSoup

import urllib.request

from urllib.parse import quote

TARGET_URL_BEFORE_PAGE_NUM = "http://news.donga.com/search?p="

TARGET_URL_BEFORE_KEYWORD = '&query='

TARGET_URL_REST = '&check_news=1&more=1&sorting=1&search_date=1&v1=&v2=&range=1'

def get_link_from_news_title(page_num, URL,output_file):

for i in range(page_num):

current_page_num = 1 + i*15

# 페이지당 15개의 게시물.

position = URL.index('=')

# URL 처음 = 오는 위치 반환 (URL에 몇페이지 인지 추가하기 위해)

URL_with_page_num = URL[: position+1] + str(current_page_num) \

+ URL[position+1 :]

# 페이지가 있는 URL 재구성

source_code_from_URL=urllib.request.urlopen(URL_with_page_num)

# 재구성한 URL을 request로 호출

soup=BeautifulSoup(source_code_from_URL, 'lxml', from_encoding='UTF-8')

# BeautifulSoup로 변환, 기사 분석 후 추출하기 위해

for title in soup.find_all('p','tit'):

title_link = title.select('a')

article_URL = title_link[0]['href']

get_text(article_URL, output_file)

# 본문 기사가 담긴 URL을 찾기위해

def get_text(URL, output_file):

source_code_from_url = urllib.request.urlopen(URL)

# urlllib로 기사 페이지를 요청받습니다.

soup = BeautifulSoup(source_code_from_url, 'lxml', from_encoding='UTF-8')

# BeautifulSoup로 페이지를 분석하기위해 soup변수로 할당 받습니다.

content_of_article = soup.select('div.article_txt')

# 기사의 본문내용을 추출

for item in content_of_article:

string_item = str(item.find_all(text=True))

output_file.write(string_item)

# 기사 텍스트가 있다면 파일에 쓴다

def main():

keyword = "대통령선거" #검색하고 하는 단어

page_num = 5 #가져올 페이지 숫자

output_file_name = "out.txt" #출력할 파일명

target_URL = TARGET_URL_BEFORE_PAGE_NUM + TARGET_URL_BEFORE_KEYWORD \

+ quote(keyword) + TARGET_URL_REST

output_file = open(output_file_name, 'w', -1,"utf-8")

get_link_from_news_title(page_num, target_URL, output_file)

output_file.close()

if __name__== '__main__':

main()

직접 실행해 보시면 같은 디렉토리 안에 out.txt파일이 생성된 것을 확인할 수 있습니다.

out.txt를 열어보면

기사 내용이 들어가 있음을 확인할 수 있습니다.

중간에 \n 이나 문자열이 추가되어 약간 더러워 보입니다.

나중에 저런 문자열이 나오면 제거하는 코드를 써서 제거 할 수있습니다.

Python(파이썬)크롤링 한 파일에 불필요한 문자 제거(Web Cralwer) -3

'프로그래밍 > Python' 카테고리의 다른 글

UnicodeEncodeError: 'cp949' codec can't encode character '©' in position 31: illegal multibyte sequence 오류 (1)	2018.12.01
Python(파이썬)크롤링 한 파일에 불필요한 문자 제거(Web Cralwer) -3 (0)	2018.12.01
Python(파이썬) 특정 단어포함하는 신문기사 웹 크롤러 만들기(Web Cralwer) -1 (0)	2018.12.01
Python PIP Install Numpy throws an error “ascii codec can't decode byte 0xe2” 오류 (0)	2018.12.01
Python 한국어 형태소 분석기 모듈 konlpy 설치하기(jpype,wheel,numpy) (0)	2018.12.01

아는 만큼 보인다

Python(파이썬) 특정 단어포함하는 신문기사 웹 크롤러 만들기(Web Cralwer) -2

'프로그래밍 > Python' 카테고리의 다른 글

티스토리툴바

Python(파이썬) 특정 단어포함하는 신문기사 웹 크롤러 만들기(Web Cralwer) -2

'프로그래밍 > Python' 카테고리의 다른 글

관련글

티스토리툴바