Web scraping with python(urlopen) 기초,연습

0. 들어가기

대학교 과제로 웹 크롤러를 만들면 교수님들이 좋아신다.

1. 기초

지난 시간에 설치까지 해보고 예제코드 까지 해보았습니다.

url 입력을 받을 수 있는 urllib.request를 추가합니다.

from urllib.request import urlopen

from bs4 import BeautifulSoup

html=urlopen("http://naver.com")

bsObj=BeautifulSoup(html.read(),"lxml");

print(bsObj.h1)

읽어드린 html 파일을 BeautifulSoup로 분석할 수 있도록 추가합니다.

html 변수에 읽어드린 네이버 홈페이지를 BeautifulSoup로 변환한 후

print로 출력할때 h1의 태그로 된 것을 출력합니다.

실행 하면 이런식으로 <h1> 태그를 볼 수 있습니다.

bsObj.h1 의 뜻은 html -> body -> h1 찾아서 출력해줍니다.

페이지에 오류가 발생한 경우

2가지 오류

1. 서버에서 페이지를 찾을 수 없을 때

2. 서버를 찾을 수 없을때

첫번째 경우 뜨는 오류 : "404 Page Not Found" , "500 Interval Server Error"

아래와 같은 오류가 뜹니다.

이 경우 오류를 "HTTPError" 로 하여 예외처리를 해주어야 한다.

from urllib.error import HTTPError <<추가 하는거 잊지마시구요

from urllib.request import urlopen

from bs4 import BeautifulSoup

from urllib.error import HTTPError

try:

html=urlopen("http://dud5730.cafe24.com/web/index2.html")

except HTTPError as e:

print(e)

else:

bsObj=BeautifulSoup(html.read(),"lxml");

print(bsObj.a)

Colored by Color Scripter

이 코드를 실행할 경우 HTTP Error 404: Not Found 라는 오류를 출력합니다. (정상작동)

태그가 있는지 예외 처리를할 경우

from urllib.request import urlopen

from bs4 import BeautifulSoup

from urllib.error import HTTPError

try:

html=urlopen("http://dud5730.cafe24.com/web/index.html")

except HTTPError as e:

print(e)

try:

bsObj=BeautifulSoup(html.read(), "lxml")

title=bsObj.body.h1

except AttributeError as e:

print(e)

Colored by Color Scripter

만약 홈페이지가 있으나 원하는 태그가 없을경우 AttributeError을 통해서 오류를 확인할 수 있습니다.

'프로그래밍 > Python' 카테고리의 다른 글

Python 한국어 형태소 분석기 모듈 konlpy 설치하기(jpype,wheel,numpy) (0)	2018.12.01
Web Crawler(BeautifulSoup의 find, findAll 함수 사용해보기) (0)	2018.12.01
Reactive Python for Data. (0)	2018.12.01
Python urllib의 requests 모듈, BeautifuleSoup 설치 (0)	2018.12.01
Python의 Encoding (0)	2018.11.30

아는 만큼 보인다

Web scraping with python(urlopen) 기초,연습

'프로그래밍 > Python' 카테고리의 다른 글

티스토리툴바

Web scraping with python(urlopen) 기초,연습

'프로그래밍 > Python' 카테고리의 다른 글

관련글

티스토리툴바