Python BeautifulSoup 웹크롤링/HTML 파싱.

2018. 4. 25. 15:47

HTML 파싱하여 원하는 부분 추출하기.

+ Beautifulsoup 설치

pip install beautifulsoup4

버전 4

+ URL로 HTML 가져와서파싱하기

from urllib.request import Request, urlopen

from bs4 import BeautifulSoup

url="http://www.naver.com/"

html = urlopen(url).read()

soup = BeautifulSoup(html, 'html.parser')

rank = soup.find("dl", id="ranklist") # dl tag, id=ranklist search.

for i in rank.find_all("li", value=True, id=False): # li tag, value exist, id no exist. search

print( i.get_text(" ", strip=True) ) # 문자열을 가져오는데, 태그는 공백으로 두고, 앞뒤 공백 제거.

+헤더 추가

req = Request(url)

req.add_header('User-Agent', 'Mozilla/5.0')

html = urlopen(req).read()

+BeautifulSoup html 서치.

-모든 태그 검색

html_as=soup.find_all("a") # 모든 a 태그 검색. 접근시 html_as[0], [1], .. 첫번째가 0인덱스.

soup("a") # 상동

-스트링이 있는 title 태그 모두 검색

soup.title.find_all(string=True)

soup.title(string=True)

-a태그 두 개만 가져옴.

soup.find_all("a", limit=2)

-스트링 검색

soup.find_all(string="elsie") 해당 스트링 검색

or검색을 하려면 array 로 입력 ["aa", "bb"]

정규식을 쓰려면 re.compile("정규식")

-태그와 속성(클래스)값으로 검색

soup.find_all("p", "title")

soup.select('p[class="title"]')

ex)

-태그와 태그 사이 찾기

soup.find_all(["a", "b"])

-속성(클래스)값 가져오기

soup.p['class']

soup.p['id']

-보기 좋게 출력

soup.b.prettify()

-검색

soup.body.b 각 첫번째 노드

soup.a # 처음으로 나오는 a태그

-태그와 클래스명 검색

soup.find_all("a", class_="sister") # class_ 를 사용. 예약어라서 _를 사용.

-태그와 ID로 찾기

soup.find("div", id="articlebody")

soup.find("div", { "id":"articlebody" } )

soup.find(id="articlebody")

soup.select("#articlebody") ; id로 검색.

soup.select("div#articlebody") ; 태그와 id로 검색.

-태그와 클래스로 찾기

soup.select('div ol[class="list1"]')

find_all 이나 select는 array로 리턴. 복수개!

하나만 찾을 때는 find 사용.

-검색결과 없으면 None

-태그의 이름 얻기

soup.find("div").name

-속성 얻기

soup.find("div")['class'] 없으면 에러

soup.find("div").get('class') 없으면 None

-태그 사이에 있는 중간의 텍스트 얻기

contents 속성 사용.

aaa

one

soup.b.string ==> one

soup.b.contents[0] ==> one

soup.p.contents ==> aaa, one, cc

원하는 원소 인덱스를 사용.

-다음 형제 태그

soup.p.next_sibling

-태그 내부 추적 검색

ptag = soup.p

ptag('table')[0]('tr')[0]('td')[0]

-태그 사이 텍스트 전체

ptag.get_text()

+ HTML 파일에서 읽기

import codecs

page_html = codecs.open('a.html', 'r', 'utf-8')

page_soup = soup(page_html, "html.parser")

+태그 추적하기

예)

// span 태그의 아이디로 먼저 검색하고, 자식중에 2번째 table 태그를 찾음.

testwebsite_container = page_soup.find("span", id="MainContent2_ctl00_lblContent").findAll("table")[1]

skiptrcnt=1 # skip first tr block

for i,record in enumerate(testwebsite_container.findAll('tr')): # 모든 tr태그를 검색.

if skiptrcnt>i:

continue

# 처음 n개는 스킵한다.

tnum = record('td')[0].text # 첫 번째 td의 텍스트

desc = record('td')[1].text

doclink = record('td')[2].text #세번째 td의 텍스트 (a link의 보이는 링크명으로 보이는 부분이 나옴.)

alink = record('td')[2].find("a") # 세번째 td에서 하위에 a태그를 찾음.

if alink :

doclinkurl=testwebsite+alink['href'] # 속성! href의 값을 가져온다.

closingdate = record('td')[3].text

detail = record('td')[4].text # td태그 내부에 br태그가 있으면 줄바뀜이 생김.

detail = detail.replace('\n', '') # 줄바뀜 제거.

저작자표시 (새창열림)

'Python' 카테고리의 다른 글

File I/O, Directory list, read/write (0)	2018.07.04
WebAPI thread work status (0)	2018.07.03
Python 데이터 저장/로딩 Pickle (0)	2018.04.25
Python 커맨드라인 파싱 (0)	2018.04.17
Python 쉘 커맨드 실행 (0)	2018.04.12

크레이지J의 탐구생활

Python BeautifulSoup 웹크롤링/HTML 파싱.

'Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바