๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Language/Python

[Python]ํŒŒ์ด์ฌ ์›นํฌ๋กค๋ง ๋งˆ์Šคํ„ฐํ•˜๊ธฐ: ์‹ค์ „ ๊ฐ€์ด๋“œ์™€ ํŒ ๋ชจ์Œ

by YJ Dev 2024. 5. 11.
728x90
๋ฐ˜์‘ํ˜•
SMALL

์›นํฌ๋กค๋ง์€ ์ธํ„ฐ๋„ท์ƒ์˜ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ํ”„๋กœ์„ธ์Šค๋กœ, python์€ ์ด๋ฅผ ์œ„ํ•œ ๊ฐ•๋ ฅํ•œ ๋„๊ตฌ๋กœ ์ž๋ฆฌ๋งค๊น€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ํŒŒ์ด์ฌ์„ ์‚ฌ์šฉํ•˜์—ฌ ์›นํฌ๋กค๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์›นํฌ๋กœ๋ง


1. ์›นํฌ๋กค๋ง์˜ ์ดํ•ด๐Ÿค–

์›นํฌ๋กค๋ง์€ ์›นํŽ˜์ด์ง€์˜ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์œผ๋กœ, HTTP ์š”์ฒญ์„ ํ†ตํ•ด ์›นํŽ˜์ด์ง€์˜ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. 'requests' ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ„๋‹จํ•œ ์›นํŽ˜์ด์ง€์˜ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

import requests  # requests ๋ชจ๋“ˆ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ HTTP ์š”์ฒญ์„ ๋ณด๋‚ด๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

url = 'https://www.naver.com/'  # ํฌ๋กค๋งํ•  ์›นํŽ˜์ด์ง€์˜ URL ์ฃผ์†Œ๋ฅผ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
response = requests.get(url)  # ์ง€์ •ํ•œ URL๋กœ GET ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์‘๋‹ต์„ ๋ฐ›์•„์˜ต๋‹ˆ๋‹ค.
print(response.text)  # ์‘๋‹ต ๊ฐ์ฒด์˜ ํ…์ŠคํŠธ ์†์„ฑ์„ ์ถœ๋ ฅํ•˜์—ฌ ์›นํŽ˜์ด์ง€์˜ HTML ์†Œ์Šค์ฝ”๋“œ๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

2. BeautifulSoup์„ ํ™œ์šฉํ•œ ์›นํŽ˜์ด์ง€ ํŒŒ์‹ฑ๐ŸŒ

BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด HTML ๋ฌธ์„œ๋ฅผ ํŒŒ์‹ฑ ํ•˜์—ฌ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • find() ๋ฉ”์„œ๋“œ : ์›นํŽ˜์ด์ง€์˜ ํŠน์ • ํƒœ๊ทธ ์ฐพ๊ธฐ
  • find_all() ๋ฉ”์„œ๋“œ : ๋ชจ๋“  ํƒœ๊ทธ ์ฐพ๊ธฐ
from bs4 import BeautifulSoup  # BeautifulSoup ๋ชจ๋“ˆ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ HTML์„ ํŒŒ์‹ฑํ•˜์—ฌ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

html = '''
<div>
    <a href="https://www.naver.com">๋„ค์ด๋ฒ„</a>  <!-- ๋„ค์ด๋ฒ„ ๋งํฌ -->
    <a href="https://www.kakao.com">์นด์นด์˜ค</a>  <!-- ์นด์นด์˜ค ๋งํฌ -->
</div>
'''
soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , HTML์„ ํŒŒ์‹ฑํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ํ˜•ํƒœ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
print(soup.find('a').text)  # ์ฒซ ๋ฒˆ์งธ ๋งํฌ์˜ ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
print(soup.find_all('a'))   # ๋ชจ๋“  ๋งํฌ ํƒœ๊ทธ๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.


์•„๋ž˜๋Š” ๋‹ค์Œ ๋‰ด์Šค์—์„œ ๋‰ด์Šค ์ œ๋ชฉ์„ ํฌ๋กค๋งํ•˜๋Š” ์˜ˆ์‹œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

import requests  # requests ๋ชจ๋“ˆ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ HTTP ์š”์ฒญ์„ ๋ณด๋‚ด๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
from bs4 import BeautifulSoup  # BeautifulSoup ๋ชจ๋“ˆ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ์ด ๋ชจ๋“ˆ์€ HTML์„ ํŒŒ์‹ฑํ•˜์—ฌ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

url = 'https://news.daum.net/'  # ํฌ๋กค๋งํ•  ๋‹ค์Œ ๋‰ด์Šค์˜ URL ์ฃผ์†Œ๋ฅผ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
response = requests.get(url)  # ์ง€์ •ํ•œ URL๋กœ GET ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์‘๋‹ต์„ ๋ฐ›์•„์˜ต๋‹ˆ๋‹ค.
html = response.text  # ์‘๋‹ต ๊ฐ์ฒด์˜ ํ…์ŠคํŠธ ์†์„ฑ์„ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , HTML์„ ํŒŒ์‹ฑํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ํ˜•ํƒœ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

result = soup.find_all('strong', class_='tit_g')  # HTML ์†Œ์Šค์ฝ”๋“œ์—์„œ strong ํƒœ๊ทธ ์ค‘ class ์†์„ฑ์ด 'tit_g'์ธ ์š”์†Œ๋ฅผ ๋ชจ๋‘ ์ฐพ์Šต๋‹ˆ๋‹ค.
# print(result)  # ๊ฒฐ๊ณผ๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
for i in result:
    print(i.text.strip())  # ๊ฒฐ๊ณผ์—์„œ ๊ณต๋ฐฑ์„ ์ œ๊ฑฐํ•˜๊ณ , ํ…์ŠคํŠธ๋งŒ์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

3. ์ด๋ฏธ์ง€ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๐Ÿ–ฅ๏ธ

์›นํŽ˜์ด์ง€์—์„œ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•˜์—ฌ ์ €์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ด…์‹œ๋‹ค. 'requests'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ๋ณ€์ˆ˜์— ์ €์žฅํ•˜๊ณ , ์ด๋ฅผ ํŒŒ์ผ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

import requests 

image_url = 'https://example.com/image.jpg'  # ๋‹ค์šด๋กœ๋“œํ•  ์ด๋ฏธ์ง€์˜ URL ์ฃผ์†Œ๋ฅผ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
imgdata = requests.get(image_url)  # ์ง€์ •ํ•œ URL๋กœ GET ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์‘๋‹ต์œผ๋กœ ๋ฐ›์•„์˜ต๋‹ˆ๋‹ค.

with open('image.jpg', 'wb') as f:  # 'image.jpg'๋ผ๋Š” ์ด๋ฆ„์˜ ํŒŒ์ผ์„ ๋ฐ”์ด๋„ˆ๋ฆฌ ์“ฐ๊ธฐ ๋ชจ๋“œ๋กœ ์—ฝ๋‹ˆ๋‹ค.
    f.write(imgdata.content)  # ์‘๋‹ต์—์„œ ๋ฐ›์•„์˜จ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ผ์— ์”๋‹ˆ๋‹ค.

4. ์›นํŽ˜์ด์ง€ ๋‚ด ์ด๋ฏธ์ง€ ํ•œ๊บผ๋ฒˆ์— ๋‹ค์šด๋กœ๋“œ๐Ÿ—ƒ๏ธ

์ด๋ฒˆ์—๋Š” ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ํ•œ๊บผ๋ฒˆ์— ๋‹ค์šด๋กœ๋“œํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€ ํƒœ๊ทธ๋ฅผ ์ฐพ์•„ URL์„ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€ ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

import requests 
from bs4 import BeautifulSoup  

url = 'https://example.com/page-with-images'  # ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋Š” ์›นํŽ˜์ด์ง€์˜ URL ์ฃผ์†Œ๋ฅผ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
response = requests.get(url)  # ์ง€์ •ํ•œ URL๋กœ GET ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์‘๋‹ต์„ ๋ฐ›์•„์˜ต๋‹ˆ๋‹ค.
html = response.text  # ์‘๋‹ต ๊ฐ์ฒด์˜ ํ…์ŠคํŠธ ์†์„ฑ์„ ๋ณ€์ˆ˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
soup = BeautifulSoup(html, 'html.parser')  # BeautifulSoup ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๊ณ , HTML์„ ํŒŒ์‹ฑํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ํ˜•ํƒœ๋กœ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

for img_tag in soup.find_all('img'):  # HTML ์†Œ์Šค์ฝ”๋“œ์—์„œ ๋ชจ๋“  ์ด๋ฏธ์ง€ ํƒœ๊ทธ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.
    img_url = img_tag.get('src')  # ๊ฐ ์ด๋ฏธ์ง€ ํƒœ๊ทธ์—์„œ src ์†์„ฑ์„ ์ถ”์ถœํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ URL์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.
    img_data = requests.get(img_url)  # ์ด๋ฏธ์ง€์˜ URL๋กœ GET ์š”์ฒญ์„ ๋ณด๋‚ด๊ณ , ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์‘๋‹ต์œผ๋กœ ๋ฐ›์•„์˜ต๋‹ˆ๋‹ค.

    with open('downloaded_images/' + img_url.split('/')[-1], 'wb') as f:  # ์ด๋ฏธ์ง€๋ฅผ ์ €์žฅํ•  ํŒŒ์ผ์„ ๋ฐ”์ด๋„ˆ๋ฆฌ ์“ฐ๊ธฐ ๋ชจ๋“œ๋กœ ์—ฝ๋‹ˆ๋‹ค.
        f.write(img_data.content)  # ์‘๋‹ต์—์„œ ๋ฐ›์•„์˜จ ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ผ์— ์”๋‹ˆ๋‹ค.


ํฌ๋กค๋งํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€๊ณตํ•˜์—ฌ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํฌ๋กค๋งํ•œ ๋ฐ์ดํ„ฐ๋ฅผ CSVํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ •์ œํ•˜๊ณ  ๊ฐ€๊ณตํ•˜์—ฌ ๋ถ„์„์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ถ€๋ถ„์— ๋Œ€ํ•ด ์กฐ๊ธˆ ๋” ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด ์•„๋ž˜ ํฌ์ŠคํŒ…์„ ์ฐธ๊ณ ํ•ด ์ฃผ์„ธ์š”๐Ÿ˜

" "

[Python]ํŒŒ์ด์ฌ์œผ๋กœ CSV์™€ JSON ํŒŒ์ผ ๋‹ค๋ฃจ๊ธฐ: ์ž…์ถœ๋ ฅ ๊ฐ€์ด๋“œ

๋ชฉ์ฐจ1. csv ํŒŒ์ผ ์ž…์ถœ๋ ฅ2. jsonํŒŒ์ผ ์ž…์ถœ๋ ฅ3. ์‘์šฉ ์˜ˆ์ œ4. ํ•ต์‹ฌ๋‚ด์šฉ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋ฐ ํŒŒ์ผ ์ž…์ถœ๋ ฅ์€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ํŠนํžˆ CSV(Comma Separated Values)์™€ JSON(JavaScript Object Notation)์€ ๋ฐ์ดํ„ฐ

creativevista.tistory.com


5. ์ฃผ์˜์‚ฌํ•ญ๐Ÿ•ต๏ธ‍โ™‚๏ธ

ํฌ๋กค๋ง์„ ํ•  ๋•Œ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฒ•์  ์ธก๋ฉด๊ณผ ์—ํ‹ฐ์ผ“์„ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  1. ๋กœ๋ด‡ ๋ฐฐ์ œ ํ”„๋กœํ† ์ฝœ(robots.txt) ์ค€์ˆ˜
    • ๋กœ๋ด‡ ๋ฐฐ์ œ ํ”„๋กœํ† ์ฝœ(robots.txt)์„ ํ™•์ธํ•˜์—ฌ ํฌ๋กค๋Ÿฌ๊ฐ€ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ๊ฒฝ๋กœ์™€ ์ œํ•œ๋œ ๊ฒฝ๋กœ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ์›น์‚ฌ์ดํŠธ๊ฐ€ ๋กœ๋ด‡ ๋ฐฐ์ œ ํ”„๋กœํ† ์ฝœ์„ ์ œ๊ณตํ•œ๋‹ค๋ฉด ์ด๋ฅผ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  2. ์›น์‚ฌ์ดํŠธ ์ด์šฉ ์•ฝ๊ด€ ์ค€์ˆ˜
    • ํฌ๋กค๋ง์„ ํ•  ๋•Œ์—๋Š” ํ•ด๋‹น ์›น์‚ฌ์ดํŠธ์˜ ์ด์šฉ ์•ฝ๊ด€์„ ์ค€์ˆ˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ์›น์‚ฌ์ดํŠธ์˜ ์ด์šฉ ์•ฝ๊ด€์— ๋ช…์‹œ๋œ ์ œํ•œ์‚ฌํ•ญ์„ ์ง€ํ‚ค๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
  3. ์ ์ ˆํ•œ ์š”์ฒญ ๋นˆ๋„
    • ๋„ˆ๋ฌด ์ž์ฃผ ์š”์ฒญ์„ ๋ณด๋‚ด์ง€ ์•Š๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ์„œ๋ฒ„์— ๋ถ€ํ•˜๋ฅผ ์ฃผ์ง€ ์•Š๋„๋ก ์ ์ ˆํ•œ ์š”์ฒญ ๋นˆ๋„๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

6. ํ•ต์‹ฌ ๋‚ด์šฉ๐Ÿ‘€

์›นํฌ๋กค๋ง

728x90
๋ฐ˜์‘ํ˜•