Tutorial Python: Cara Mudah Web Scraping menggunakan Beautiful Soup

 Pada artikel "Pentingnya Web Crawling sebagai Cara Pengumpulan Data di Era Big Data" telah kita bahas bahwa data dapat didapatkan dengan sejumlah cara, diantaranya:

  1. Input langsung dari pelanggan, melalui survey maupun angket.
  2. Menggunakan API pihak ketiga seperti Facebook API, Twitter API dan sebagainya.
  3. Log Web Server seperti Apache dan Nginx
  4. Dengan melakukan Web Crawling atau Web Scraping

 Tutorial kali ini akan mengulas tentang bagaimana melakukan Web Scraping dengan bahasa program Python menggunakan module Beautiful Soup.

Sebagai langkah awak, mari kita coba scraping satu webpage yang sangat sederhana dengan url:

https://dataquestio.github.io/web-scraping-pages/simple.html

Oh iya, selain module Beautiful Soup, kita juga akan menggunakan module Requests untuk mengirim HTTP requests ke webpage yg akan kita jadikan target scraping.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page.status_code
200
Status code 200 artinya webpage target sudah berhasil diunduh.
soup = BeautifulSoup(page.content, 'html.parser')
list(soup.children)
['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]
[type(item) for item in list(soup.children)]
[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
html = list(soup.children)[2]
list(html.children)
['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n'] 
body = list(html.children)[3]
list(body.children) 
['\n', <p>Here is some simple content for this page.</p>, '\n'] 
p = list(body.children)[1]
p.get_text() 
'Here is some simple content for this page.' 
ps = soup.find_all('p') #find all p tags
for p in ps:
    print(p.get_text()) 
Here is some simple content for this page. 
soup.find('p') #find the first p tag 
<p>Here is some simple content for this page.</p> 

Selanjutnya kita akan coba mencari tag berdasarkan class dan id

page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>
soup.find_all('p', class_='outer-text')
[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]
soup.find_all(class_='outer-text')
[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>] 
soup.find_all(id='first')
[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>] 

Menggunakan CSS Selector

  1. p a — finds all a tags inside of a p tag.
  2. body p a — finds all a tags inside of a p tag inside of a body tag.
  3. html body — finds all body tags inside of an html tag.
  4. p.outer-text — finds all p tags with a class of outer-text.
  5. p#first — finds all p tags with an id of first.
  6. body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
soup.select("div p")
[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>] 
for p in soup.select("div p"):
    print(p.get_text().strip())
First paragraph.
Second paragraph. 

Scraping Data Cuaca Kota San Fransisco dari forecast.weather.gov

Sebelum memulai scraping, kita mesti tahu dulu struktur webpage yang akan dijadikan target. Disini kita bisa menggunakan Developer Tools pada web browser Chrome.

View -> Developer -> Developer Tools

Kemudian pastikan pilih panel Elements.

Dengan mengklik kanan pada tulisan "Extended Forecast" , lalu mengklik "Inspect" , kamu akan membuka tag yang berisi teks "Extended Forecast" di panel Elements seperti gambar berikut:

Scroll up pada panel Elements untuk menemukan element terluar yang memuat semua teks yang berhubungan dengan Extended Forecast. Dalam hal ini adalah tag div dengan id seven-day-forecast.

Jika kamu klik console dan mencoba explore tag div tersebut, akan kamu temukan bahwa setiap forecast item (seperti today, tonight, tuesday ...) berada dalam tag div dengan class tombstone-container.

OK, saatnya mulai Scraping

  1. Download web page yang memuat forecast.
  2. Buat BeautifulSoup class untuk mem-parse web page tersebut.
  3. Temukan div dengan id seven-day-forecast, dan sematkan ke variabel seven_day
  4. Di dalam seven_day, temukan tiap forecast item.
  5. Extract dan print forecast item pertama.
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")
today = forecast_items[0]
print(tonight.prettify())
<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. " class="forecast-icon" src="DualImage.php?i=bkn&amp;j=wind_few" title="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Cloudy
  <br/>
  then Sunny
  <br/>
  and Breezy
 </p>
 <p class="temp temp-high">
  High: 67 °F
 </p>
</div> 

 Extract informasi dari web page target

  1. Nama forecast item — dalam hal ini, Today.
  2. Deskripsi kondisi cuaca — disimpan pada property title pada tag img.
  3. Deskripsi singkat kondisi cuaca — dalam hal ini, Mostly Cloudy.
  4. Suhu tinggi— dalam hal ini, 67 degrees.
period = today.find(class_='period-name').get_text()
short_desc = today.find(class_='short-desc')
for br in short_desc.find_all('br'):
    br.replace_with('\n' + br.text)
short_desc = short_desc.get_text().replace('\n', ' ')
temp = today.find(class_='temp').get_text()

print(period)
#[print(line) for line in lines]
print(short_desc)
print(temp)
Today
Mostly Cloudy then Sunny and Breezy
High: 67 °F 
img = tonight.find('img') #return a dictionary, and the img attributes as the dictionary keys
desc = img["title"]
print(desc)
Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph.

Extract semua informasi dari web page target:

  1. Pilih semua items dengan class period-name di dalam item dengan class tombstone-container di dalam seven_day.
  2. Gunakan list comprehension untuk memanggil method get_text pada setiap object BeautifulSoup.
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods
['Today',
 'Tonight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday']
short_descs = [sd for sd in seven_day.select(".tombstone-container .short-desc")]
for sd in short_descs:
    for br in sd.find_all('br'):
        br.replace_with('\n' + br.text)
short_descs = [sd.get_text().replace('\n', ' ') for sd in short_descs]
short_descs
['Mostly Cloudy then Sunny and Breezy',
 'Increasing Clouds',
 'Gradual Clearing and Breezy',
 'Increasing Clouds and Windy',
 'Partly Sunny then Sunny and Breezy',
 'Mostly Clear and Breezy then Partly Cloudy',
 'Sunny',
 'Mostly Clear',
 'Sunny']
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps
['High: 67 °F',
 'Low: 56 °F',
 'High: 66 °F',
 'Low: 55 °F',
 'High: 68 °F',
 'Low: 56 °F',
 'High: 73 °F',
 'Low: 56 °F',
 'High: 71 °F']
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs
['Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. ',
 'Tonight: Increasing clouds, with a low around 56. West southwest wind 13 to 18 mph, with gusts as high as 24 mph. ',
 'Tuesday: Cloudy through mid morning, then gradual clearing, with a high near 66. Breezy, with a west wind 13 to 18 mph increasing to 23 to 28 mph in the afternoon. Winds could gust as high as 36 mph. ',
 'Tuesday Night: Increasing clouds, with a low around 55. Windy, with a west wind 23 to 30 mph, with gusts as high as 38 mph. ',
 'Wednesday: Mostly sunny, with a high near 68. Breezy, with a west wind 20 to 24 mph, with gusts as high as 31 mph. ',
 'Wednesday Night: Partly cloudy, with a low around 56. Breezy. ',
 'Thursday: Sunny, with a high near 73.',
 'Thursday Night: Mostly clear, with a low around 56.',
 'Friday: Sunny, with a high near 71.']

Menggabungkan data dengan Pandas Dataframe

import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc": descs
})

weather

Coba sedikit Analisis

 Extract number dari kolom 'temp'

temp_nums = weather["temp"].str.extract('(\d+)', expand=False)
temp_nums
0    67
1    56
2    66
3    55
4    68
5    56
6    73
7    56
8    71
Name: temp, dtype: object
weather["temp_num"] = temp_nums.astype('int')
weather
weather["temp_num"].mean()
63.111111111111114
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool
weather[is_night]


Demikian, tutorial scraping dasar menggunakan Python dan BeautifulSoup.

Selamat mencoba!

Comments

Popular posts from this blog

Apa itu Big Data : Menyimak Kembali Definisi Big Data, Jenis Teknologi Big Data, dan Manfaat Pemberdayaan Big Data

Apache Spark: Perangkat Lunak Analisis Terpadu untuk Big Data

Memahami Definisi Big Data

MapReduce: Besar dan Powerful, tapi Tidak Ribet

Cara Sederhana Install Hadoop 2 mode Standalone pada Windows 7 dan Windows 10

Bagaimana Cara Membaca Google Play eBook Secara Offline?

Pentingnya Web Crawling sebagai Cara Pengumpulan Data di Era Big Data

HDFS: Berawal dari Google untuk Big Data

Apa itu 'BIG DATA'?

Membuat dan Menjalankan Aplikasi Apache Spark dengan Intellij IDEA pada OS Windows