Tutorial Python: Cara Mudah Web Scraping menggunakan Beautiful Soup

 Pada artikel "Pentingnya Web Crawling sebagai Cara Pengumpulan Data di Era Big Data" telah kita bahas bahwa data dapat didapatkan dengan sejumlah cara, diantaranya:

  1. Input langsung dari pelanggan, melalui survey maupun angket.
  2. Menggunakan API pihak ketiga seperti Facebook API, Twitter API dan sebagainya.
  3. Log Web Server seperti Apache dan Nginx
  4. Dengan melakukan Web Crawling atau Web Scraping

 Tutorial kali ini akan mengulas tentang bagaimana melakukan Web Scraping dengan bahasa program Python menggunakan module Beautiful Soup.

Sebagai langkah awak, mari kita coba scraping satu webpage yang sangat sederhana dengan url:

https://dataquestio.github.io/web-scraping-pages/simple.html

Oh iya, selain module Beautiful Soup, kita juga akan menggunakan module Requests untuk mengirim HTTP requests ke webpage yg akan kita jadikan target scraping.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html")
page.status_code
200
Status code 200 artinya webpage target sudah berhasil diunduh.
soup = BeautifulSoup(page.content, 'html.parser')
list(soup.children)
['html',
 '\n',
 <html>
 <head>
 <title>A simple example page</title>
 </head>
 <body>
 <p>Here is some simple content for this page.</p>
 </body>
 </html>]
[type(item) for item in list(soup.children)]
[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
html = list(soup.children)[2]
list(html.children)
['\n',
 <head>
 <title>A simple example page</title>
 </head>,
 '\n',
 <body>
 <p>Here is some simple content for this page.</p>
 </body>,
 '\n'] 
body = list(html.children)[3]
list(body.children) 
['\n', <p>Here is some simple content for this page.</p>, '\n'] 
p = list(body.children)[1]
p.get_text() 
'Here is some simple content for this page.' 
ps = soup.find_all('p') #find all p tags
for p in ps:
    print(p.get_text()) 
Here is some simple content for this page. 
soup.find('p') #find the first p tag 
<p>Here is some simple content for this page.</p> 

Selanjutnya kita akan coba mencari tag berdasarkan class dan id

page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
soup
<html>
<head>
<title>A simple example page</title>
</head>
<body>
<div>
<p class="inner-text first-item" id="first">
                First paragraph.
            </p>
<p class="inner-text">
                Second paragraph.
            </p>
</div>
<p class="outer-text first-item" id="second">
<b>
                First outer paragraph.
            </b>
</p>
<p class="outer-text">
<b>
                Second outer paragraph.
            </b>
</p>
</body>
</html>
soup.find_all('p', class_='outer-text')
[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>]
soup.find_all(class_='outer-text')
[<p class="outer-text first-item" id="second">
 <b>
                 First outer paragraph.
             </b>
 </p>,
 <p class="outer-text">
 <b>
                 Second outer paragraph.
             </b>
 </p>] 
soup.find_all(id='first')
[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>] 

Menggunakan CSS Selector

  1. p a — finds all a tags inside of a p tag.
  2. body p a — finds all a tags inside of a p tag inside of a body tag.
  3. html body — finds all body tags inside of an html tag.
  4. p.outer-text — finds all p tags with a class of outer-text.
  5. p#first — finds all p tags with an id of first.
  6. body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
soup.select("div p")
[<p class="inner-text first-item" id="first">
                 First paragraph.
             </p>,
 <p class="inner-text">
                 Second paragraph.
             </p>] 
for p in soup.select("div p"):
    print(p.get_text().strip())
First paragraph.
Second paragraph. 

Scraping Data Cuaca Kota San Fransisco dari forecast.weather.gov

Sebelum memulai scraping, kita mesti tahu dulu struktur webpage yang akan dijadikan target. Disini kita bisa menggunakan Developer Tools pada web browser Chrome.

View -> Developer -> Developer Tools

Kemudian pastikan pilih panel Elements.

Dengan mengklik kanan pada tulisan "Extended Forecast" , lalu mengklik "Inspect" , kamu akan membuka tag yang berisi teks "Extended Forecast" di panel Elements seperti gambar berikut:

Scroll up pada panel Elements untuk menemukan element terluar yang memuat semua teks yang berhubungan dengan Extended Forecast. Dalam hal ini adalah tag div dengan id seven-day-forecast.

Jika kamu klik console dan mencoba explore tag div tersebut, akan kamu temukan bahwa setiap forecast item (seperti today, tonight, tuesday ...) berada dalam tag div dengan class tombstone-container.

OK, saatnya mulai Scraping

  1. Download web page yang memuat forecast.
  2. Buat BeautifulSoup class untuk mem-parse web page tersebut.
  3. Temukan div dengan id seven-day-forecast, dan sematkan ke variabel seven_day
  4. Di dalam seven_day, temukan tiap forecast item.
  5. Extract dan print forecast item pertama.
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")

forecast_items = seven_day.find_all(class_="tombstone-container")
today = forecast_items[0]
print(tonight.prettify())
<div class="tombstone-container">
 <p class="period-name">
  Today
  <br/>
  <br/>
 </p>
 <p>
  <img alt="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. " class="forecast-icon" src="DualImage.php?i=bkn&amp;j=wind_few" title="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. "/>
 </p>
 <p class="short-desc">
  Mostly Cloudy
  <br/>
  then Sunny
  <br/>
  and Breezy
 </p>
 <p class="temp temp-high">
  High: 67 °F
 </p>
</div> 

 Extract informasi dari web page target

  1. Nama forecast item — dalam hal ini, Today.
  2. Deskripsi kondisi cuaca — disimpan pada property title pada tag img.
  3. Deskripsi singkat kondisi cuaca — dalam hal ini, Mostly Cloudy.
  4. Suhu tinggi— dalam hal ini, 67 degrees.
period = today.find(class_='period-name').get_text()
short_desc = today.find(class_='short-desc')
for br in short_desc.find_all('br'):
    br.replace_with('\n' + br.text)
short_desc = short_desc.get_text().replace('\n', ' ')
temp = today.find(class_='temp').get_text()

print(period)
#[print(line) for line in lines]
print(short_desc)
print(temp)
Today
Mostly Cloudy then Sunny and Breezy
High: 67 °F 
img = tonight.find('img') #return a dictionary, and the img attributes as the dictionary keys
desc = img["title"]
print(desc)
Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph.

Extract semua informasi dari web page target:

  1. Pilih semua items dengan class period-name di dalam item dengan class tombstone-container di dalam seven_day.
  2. Gunakan list comprehension untuk memanggil method get_text pada setiap object BeautifulSoup.
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
periods
['Today',
 'Tonight',
 'Tuesday',
 'TuesdayNight',
 'Wednesday',
 'WednesdayNight',
 'Thursday',
 'ThursdayNight',
 'Friday']
short_descs = [sd for sd in seven_day.select(".tombstone-container .short-desc")]
for sd in short_descs:
    for br in sd.find_all('br'):
        br.replace_with('\n' + br.text)
short_descs = [sd.get_text().replace('\n', ' ') for sd in short_descs]
short_descs
['Mostly Cloudy then Sunny and Breezy',
 'Increasing Clouds',
 'Gradual Clearing and Breezy',
 'Increasing Clouds and Windy',
 'Partly Sunny then Sunny and Breezy',
 'Mostly Clear and Breezy then Partly Cloudy',
 'Sunny',
 'Mostly Clear',
 'Sunny']
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
temps
['High: 67 °F',
 'Low: 56 °F',
 'High: 66 °F',
 'Low: 55 °F',
 'High: 68 °F',
 'Low: 56 °F',
 'High: 73 °F',
 'Low: 56 °F',
 'High: 71 °F']
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
descs
['Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. ',
 'Tonight: Increasing clouds, with a low around 56. West southwest wind 13 to 18 mph, with gusts as high as 24 mph. ',
 'Tuesday: Cloudy through mid morning, then gradual clearing, with a high near 66. Breezy, with a west wind 13 to 18 mph increasing to 23 to 28 mph in the afternoon. Winds could gust as high as 36 mph. ',
 'Tuesday Night: Increasing clouds, with a low around 55. Windy, with a west wind 23 to 30 mph, with gusts as high as 38 mph. ',
 'Wednesday: Mostly sunny, with a high near 68. Breezy, with a west wind 20 to 24 mph, with gusts as high as 31 mph. ',
 'Wednesday Night: Partly cloudy, with a low around 56. Breezy. ',
 'Thursday: Sunny, with a high near 73.',
 'Thursday Night: Mostly clear, with a low around 56.',
 'Friday: Sunny, with a high near 71.']

Menggabungkan data dengan Pandas Dataframe

import pandas as pd
weather = pd.DataFrame({
    "period": periods,
    "short_desc": short_descs,
    "temp": temps,
    "desc": descs
})

weather

Coba sedikit Analisis

 Extract number dari kolom 'temp'

temp_nums = weather["temp"].str.extract('(\d+)', expand=False)
temp_nums
0    67
1    56
2    66
3    55
4    68
5    56
6    73
7    56
8    71
Name: temp, dtype: object
weather["temp_num"] = temp_nums.astype('int')
weather
weather["temp_num"].mean()
63.111111111111114
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
is_night
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7     True
8    False
Name: temp, dtype: bool
weather[is_night]


Demikian, tutorial scraping dasar menggunakan Python dan BeautifulSoup.

Selamat mencoba!

Comments

Shiva Shakthi said…
More impressive Blog!!! Its more useful for us...Thanks for sharing with us...
Why is Big Data Important?
Why Big Data

Value assignment help act as a tuition teacher and guides students looking for online academic assistance for assignments, homework, and essay writing services. contact valueassignmenthelp.com they provide plagiarism free content. Value assignment help is the best solution for you. They are best known for delivering assignment help services to students without having to break the bank. WhatsApp no.: +91-9872003804, +61-413228507.

Australia assignment help

jones said…
When you or your company need help with QuickBooks or any other aspect of your business, dial Quickbooks Support Phone Number +1 855-444-2233.
Mono Infotech said…
One-Step Website Design, Development, Content Writing, Affiliate Marketing and Digital Marketing Solution. www.monoinfotech.com
chandini said…
It is nice blog
https://www.procoscare.com
Advanced Laser Hair Removal Treatment in Hyderabad
best Lip Blush Treatment in Hyderabad
Anonymous said…
Traceability requirements matrix is a tool to document the end-to-end test case lifecycle and keep them in a single place. It helps in documenting the test cases, requirements and other details that need to be carried out during each step of the lifecycle, which includes gathering information about user requirements, what to do if any error occurs and how we plan to include other routines.
chandini said…
It is a nice blog
https://www.procoscare.com
Best Face Glowup services in hyderabad
Tattoo Removing services in hyderabad
So many people using HP laptop and many people face major issue in laptop like not open laptop ,ram issue issue ,slowness issue ,charging and motherboard issue but don’t worry we here to solve any issue at client place also we have free pickup and drop facility just make call to our HP service center in Tagore Garden and we are ready to help you .We never deals in local parts of HP we always make sure parts of HP are original.
HP service center in Tagore Garden
Studylivezone said…
Thanks a lot for giving us such a helpful information. You can also visit our website for amity university assignment
thanks give me new knowledge
Raminfotech said…
Hi there,
loved all that you shared ,so clear and easy to follow.
is there any issue your laptop ....We are provide solution

Laptop Service Center in Chennai
BookMyEssay said…
I appreciate you providing this fantastic blog! If you are a student looking for the greatest Coursework Paper Writing Help assistance within your price range, I would advise you to seek assistance from the professionals at BookMyEssay.com.
John Mayer said…
I am grateful to you for giving such an excellent content. Are you a freelancer that needs to hire a Packaging Design Freelancer expert? Paperub is an all-in-one solution where you can launch or advance your professional career.
Jack said…
"Exploring options to buy coursework help in Hong Kong? Academic excellence requires reliable support. Can anyone share recommendations for reputable platforms offering excellent coursework help services in Hong Kong? buy coursework Help Hong Kong

Popular posts from this blog

Apa itu Big Data : Menyimak Kembali Definisi Big Data, Jenis Teknologi Big Data, dan Manfaat Pemberdayaan Big Data

Apache Spark: Perangkat Lunak Analisis Terpadu untuk Big Data

MapReduce: Besar dan Powerful, tapi Tidak Ribet

HBase: Hyper NoSQL Database

Validitas Rapid Test Covid 19 : Accuracy vs F1-Score, Pilih yang Mana?

Aplikasi iPhone : RETaS Read English Tanpa Kamus!

Big Data dan Rahasia Kejayaan Google

HDFS: Berawal dari Google untuk Big Data

Cara Sederhana Install Hadoop 2 mode Standalone pada Windows 7 dan Windows 10