Tutorial Python: Cara Mudah Web Scraping menggunakan Beautiful Soup

July 12, 2021

Pada artikel "Pentingnya Web Crawling sebagai Cara Pengumpulan Data di Era Big Data" telah kita bahas bahwa data dapat didapatkan dengan sejumlah cara, diantaranya:

Input langsung dari pelanggan, melalui survey maupun angket.
Menggunakan API pihak ketiga seperti Facebook API, Twitter API dan sebagainya.
Log Web Server seperti Apache dan Nginx
Dengan melakukan Web Crawling atau Web Scraping

Tutorial kali ini akan mengulas tentang bagaimana melakukan Web Scraping dengan bahasa program Python menggunakan module Beautiful Soup.

Sebagai langkah awak, mari kita coba scraping satu webpage yang sangat sederhana dengan url:

https://dataquestio.github.io/web-scraping-pages/simple.html

Oh iya, selain module Beautiful Soup, kita juga akan menggunakan module Requests untuk mengirim HTTP requests ke webpage yg akan kita jadikan target scraping.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page.status_code


  200
  

  Status code 200 artinya webpage target sudah berhasil diunduh.

soup = BeautifulSoup(page.content, 'html.parser')
list(soup.children)

['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]

[type(item) for item in list(soup.children)]

[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]

html = list(soup.children)[2]
list(html.children)

['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n']

body = list(html.children)[3]
list(body.children)

['\n', <p>Here is some simple content for this page.</p>, '\n']

p = list(body.children)[1]
p.get_text()

'Here is some simple content for this page.'

ps = soup.find_all('p') #find all p tags
for p in ps:
    print(p.get_text())

Here is some simple content for this page.

soup.find('p') #find the first p tag

<p>Here is some simple content for this page.</p>

Selanjutnya kita akan coba mencari tag berdasarkan class dan id

page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup

<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>

soup.find_all('p', class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

soup.find_all(class_='outer-text')

[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]

soup.find_all(id='first')

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>]

Menggunakan CSS Selector

p a — finds all a tags inside of a p tag.
body p a — finds all a tags inside of a p tag inside of a body tag.
html body — finds all body tags inside of an html tag.
p.outer-text — finds all p tags with a class of outer-text.
p#first — finds all p tags with an id of first.
body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.

soup.select("div p")

[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>]

for p in soup.select("div p"):
    print(p.get_text().strip())

First paragraph.
Second paragraph.

Scraping Data Cuaca Kota San Fransisco dari forecast.weather.gov

Sebelum memulai scraping, kita mesti tahu dulu struktur webpage yang akan dijadikan target. Disini kita bisa menggunakan Developer Tools pada web browser Chrome.

View -> Developer -> Developer Tools

Kemudian pastikan pilih panel Elements.

Dengan mengklik kanan pada tulisan "Extended Forecast" , lalu mengklik "Inspect" , kamu akan membuka tag yang berisi teks "Extended Forecast" di panel Elements seperti gambar berikut:

Scroll up pada panel Elements untuk menemukan element terluar yang memuat semua teks yang berhubungan dengan Extended Forecast. Dalam hal ini adalah tag div dengan id seven-day-forecast.

Jika kamu klik console dan mencoba explore tag div tersebut, akan kamu temukan bahwa setiap forecast item (seperti today, tonight, tuesday ...) berada dalam tag div dengan class tombstone-container.

OK, saatnya mulai Scraping

Download web page yang memuat forecast.
Buat BeautifulSoup class untuk mem-parse web page tersebut.
Temukan div dengan id seven-day-forecast, dan sematkan ke variabel seven_day
Di dalam seven_day, temukan tiap forecast item.
Extract dan print forecast item pertama.

page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")
today = forecast_items[0]
print(tonight.prettify())

<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. " class="forecast-icon" src="DualImage.php?i=bkn&amp;j=wind_few" title="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Cloudy
  <br/>
  then Sunny
  <br/>
  and Breezy
 </p>
 <p class="temp temp-high">
  High: 67 °F
 </p>
</div>

Extract informasi dari web page target

Nama forecast item — dalam hal ini, Today.
Deskripsi kondisi cuaca — disimpan pada property title pada tag img.
Deskripsi singkat kondisi cuaca — dalam hal ini, Mostly Cloudy.
Suhu tinggi— dalam hal ini, 67 degrees.

period = today.find(class_='period-name').get_text()
short_desc = today.find(class_='short-desc')
for br in short_desc.find_all('br'):
    br.replace_with('\n' + br.text)
short_desc = short_desc.get_text().replace('\n', ' ')
temp = today.find(class_='temp').get_text()

print(period)
#[print(line) for line in lines]
print(short_desc)
print(temp)

Today
Mostly Cloudy then Sunny and Breezy
High: 67 °F

img = tonight.find('img') #return a dictionary, and the img attributes as the dictionary keys
desc = img["title"]
print(desc)

Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph.

Extract semua informasi dari web page target:

Pilih semua items dengan class period-name di dalam item dengan class tombstone-container di dalam seven_day.
Gunakan list comprehension untuk memanggil method get_text pada setiap object BeautifulSoup.

period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods

['Today',
 'Tonight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday']

short_descs = [sd for sd in seven_day.select(".tombstone-container .short-desc")]
for sd in short_descs:
    for br in sd.find_all('br'):
        br.replace_with('\n' + br.text)
short_descs = [sd.get_text().replace('\n', ' ') for sd in short_descs]
short_descs

['Mostly Cloudy then Sunny and Breezy',
 'Increasing Clouds',
 'Gradual Clearing and Breezy',
 'Increasing Clouds and Windy',
 'Partly Sunny then Sunny and Breezy',
 'Mostly Clear and Breezy then Partly Cloudy',
 'Sunny',
 'Mostly Clear',
 'Sunny']

temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps

['High: 67 °F',
 'Low: 56 °F',
 'High: 66 °F',
 'Low: 55 °F',
 'High: 68 °F',
 'Low: 56 °F',
 'High: 73 °F',
 'Low: 56 °F',
 'High: 71 °F']

descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs

['Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. ',
 'Tonight: Increasing clouds, with a low around 56. West southwest wind 13 to 18 mph, with gusts as high as 24 mph. ',
 'Tuesday: Cloudy through mid morning, then gradual clearing, with a high near 66. Breezy, with a west wind 13 to 18 mph increasing to 23 to 28 mph in the afternoon. Winds could gust as high as 36 mph. ',
 'Tuesday Night: Increasing clouds, with a low around 55. Windy, with a west wind 23 to 30 mph, with gusts as high as 38 mph. ',
 'Wednesday: Mostly sunny, with a high near 68. Breezy, with a west wind 20 to 24 mph, with gusts as high as 31 mph. ',
 'Wednesday Night: Partly cloudy, with a low around 56. Breezy. ',
 'Thursday: Sunny, with a high near 73.',
 'Thursday Night: Mostly clear, with a low around 56.',
 'Friday: Sunny, with a high near 71.']

Menggabungkan data dengan Pandas Dataframe

import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc": descs
})

weather

Coba sedikit Analisis

Extract number dari kolom 'temp'

temp_nums = weather["temp"].str.extract('(\d+)', expand=False)
temp_nums

0    67
1    56
2    66
3    55
4    68
5    56
6    73
7    56
8    71
Name: temp, dtype: object

weather["temp_num"] = temp_nums.astype('int')
weather

weather["temp_num"].mean()

63.111111111111114

is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night

0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool

weather[is_night]

Demikian, tutorial scraping dasar menggunakan Python dan BeautifulSoup.

Selamat mencoba!

Comments

Shiva Shakthi said…

More impressive Blog!!! Its more useful for us...Thanks for sharing with us...
Why is Big Data Important?
Why Big Data

September 15, 2021 at 4:45 PM

assignment help services said…

Value assignment help act as a tuition teacher and guides students looking for online academic assistance for assignments, homework, and essay writing services. contact valueassignmenthelp.com they provide plagiarism free content. Value assignment help is the best solution for you. They are best known for delivering assignment help services to students without having to break the bank. WhatsApp no.: +91-9872003804, +61-413228507.

Australia assignment help

May 20, 2022 at 7:23 PM

jones said…

When you or your company need help with QuickBooks or any other aspect of your business, dial Quickbooks Support Phone Number +1 855-444-2233.

May 30, 2022 at 11:30 AM

SixD Engineering Solutions said…

Plant Engineering Service
3D Laser Scanning Modelling
Reverse Engineering Cad Drawings
Point cloud to 3D Model California

June 5, 2022 at 5:50 PM

Mono Infotech said…

One-Step Website Design, Development, Content Writing, Affiliate Marketing and Digital Marketing Solution. www.monoinfotech.com

June 7, 2022 at 1:51 PM

chandini said…

It is nice blog
https://www.procoscare.com
Advanced Laser Hair Removal Treatment in Hyderabad
best Lip Blush Treatment in Hyderabad

July 13, 2022 at 5:40 PM

Anonymous said…

Traceability requirements matrix is a tool to document the end-to-end test case lifecycle and keep them in a single place. It helps in documenting the test cases, requirements and other details that need to be carried out during each step of the lifecycle, which includes gathering information about user requirements, what to do if any error occurs and how we plan to include other routines.

July 26, 2022 at 6:25 PM

chandini said…

It is a nice blog
https://www.procoscare.com
Best Face Glowup services in hyderabad
Tattoo Removing services in hyderabad

August 5, 2022 at 4:29 PM

dell service center locations said…

So many people using HP laptop and many people face major issue in laptop like not open laptop ,ram issue issue ,slowness issue ,charging and motherboard issue but don’t worry we here to solve any issue at client place also we have free pickup and drop facility just make call to our HP service center in Tagore Garden and we are ready to help you .We never deals in local parts of HP we always make sure parts of HP are original.
HP service center in Tagore Garden

August 25, 2022 at 5:28 PM

SixD Engineering Solutions Pvt Ltd said…

Wow, Very Nice Post I really like This Post. Please share more posts.
3D Scanning Services
3D Scanning Reverse Engineering
Reverse Engineering Services
Plant Engineering Services
Offshore Engineering Services India

August 31, 2022 at 11:27 PM

Studylivezone said…

Thanks a lot for giving us such a helpful information. You can also visit our website for amity university assignment

September 2, 2022 at 12:41 PM

aplikasi indonesia said…

thanks give me new knowledge

September 29, 2022 at 4:01 AM

Raminfotech said…

Hi there,
loved all that you shared ,so clear and easy to follow.
is there any issue your laptop ....We are provide solution

Laptop Service Center in Chennai

November 23, 2022 at 6:08 PM

Anonymous said…

Thank you
it course in Chennai

February 10, 2023 at 1:08 PM

IT TRAINING said…

Thank you
it courses in chennai

March 22, 2023 at 4:11 PM

BookMyEssay said…

I appreciate you providing this fantastic blog! If you are a student looking for the greatest Coursework Paper Writing Help assistance within your price range, I would advise you to seek assistance from the professionals at BookMyEssay.com.

May 1, 2023 at 1:46 PM

John Mayer said…

I am grateful to you for giving such an excellent content. Are you a freelancer that needs to hire a Packaging Design Freelancer expert? Paperub is an all-in-one solution where you can launch or advance your professional career.

May 1, 2023 at 3:53 PM

Jack said…

"Exploring options to buy coursework help in Hong Kong? Academic excellence requires reliable support. Can anyone share recommendations for reputable platforms offering excellent coursework help services in Hong Kong? buy coursework Help Hong Kong

December 5, 2023 at 2:45 AM

johnnybladesweddings said…

UK & INTERNATIONAL WEDDING PHOTOGRAPHER
Make your love story last a lifetime with stunning photographs that tell your unique tale. Secure your expert photographer and guarantee every detail is captured with passion and precision.

May 26, 2024 at 2:46 PM

Search This Blog

Teknologi Big Data