Tutorial Python: Cara Mudah Web Scraping menggunakan Beautiful Soup
Pada artikel "Pentingnya Web Crawling sebagai Cara Pengumpulan Data di Era Big Data" telah kita bahas bahwa data dapat didapatkan dengan sejumlah cara, diantaranya:
- Input langsung dari pelanggan, melalui survey maupun angket.
- Menggunakan API pihak ketiga seperti Facebook API, Twitter API dan sebagainya.
- Log Web Server seperti Apache dan Nginx
- Dengan melakukan Web Crawling atau Web Scraping
Tutorial kali ini akan mengulas tentang bagaimana melakukan Web Scraping dengan bahasa program Python menggunakan module Beautiful Soup.
Sebagai langkah awak, mari kita coba scraping satu webpage yang sangat sederhana dengan url:
https://dataquestio.github.io/web-scraping-pages/simple.html
Oh iya, selain module Beautiful Soup, kita juga akan menggunakan module Requests untuk mengirim HTTP requests ke webpage yg akan kita jadikan target scraping.
import requests from bs4 import BeautifulSoup page = requests.get("https://dataquestio.github.io/web-scraping-pages/simple.html") page.status_code
200
Status code 200 artinya webpage target sudah berhasil diunduh.
soup = BeautifulSoup(page.content, 'html.parser') list(soup.children)
['html', '\n', <html> <head> <title>A simple example page</title> </head> <body> <p>Here is some simple content for this page.</p> </body> </html>]
[type(item) for item in list(soup.children)]
[bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
html = list(soup.children)[2] list(html.children)
['\n', <head> <title>A simple example page</title> </head>, '\n', <body> <p>Here is some simple content for this page.</p> </body>, '\n']
body = list(html.children)[3] list(body.children)
['\n', <p>Here is some simple content for this page.</p>, '\n']
p = list(body.children)[1] p.get_text()
'Here is some simple content for this page.'
ps = soup.find_all('p') #find all p tags for p in ps: print(p.get_text())
Here is some simple content for this page.
soup.find('p') #find the first p tag
<p>Here is some simple content for this page.</p>
Selanjutnya kita akan coba mencari tag berdasarkan class dan id
page = requests.get("https://dataquestio.github.io/web-scraping-pages/ids_and_classes.html") soup = BeautifulSoup(page.content, 'html.parser') soup
<html> <head> <title>A simple example page</title> </head> <body> <div> <p class="inner-text first-item" id="first"> First paragraph. </p> <p class="inner-text"> Second paragraph. </p> </div> <p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p> <p class="outer-text"> <b> Second outer paragraph. </b> </p> </body> </html>
soup.find_all('p', class_='outer-text')
[<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>]
soup.find_all(class_='outer-text')
[<p class="outer-text first-item" id="second"> <b> First outer paragraph. </b> </p>, <p class="outer-text"> <b> Second outer paragraph. </b> </p>]
soup.find_all(id='first')
[<p class="inner-text first-item" id="first"> First paragraph. </p>]
Menggunakan CSS Selector
- p a — finds all a tags inside of a p tag.
- body p a — finds all a tags inside of a p tag inside of a body tag.
- html body — finds all body tags inside of an html tag.
- p.outer-text — finds all p tags with a class of outer-text.
- p#first — finds all p tags with an id of first.
- body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
soup.select("div p")
[<p class="inner-text first-item" id="first"> First paragraph. </p>, <p class="inner-text"> Second paragraph. </p>]
for p in soup.select("div p"): print(p.get_text().strip())
First paragraph. Second paragraph.
Scraping Data Cuaca Kota San Fransisco dari forecast.weather.gov
Sebelum memulai scraping, kita mesti tahu dulu struktur webpage yang akan dijadikan target. Disini kita bisa menggunakan Developer Tools pada web browser Chrome.
View -> Developer -> Developer Tools
Kemudian pastikan pilih panel Elements.
Dengan mengklik kanan pada tulisan "Extended Forecast" , lalu mengklik "Inspect" , kamu akan membuka tag yang berisi teks "Extended Forecast" di panel Elements seperti gambar berikut:
Scroll up pada panel Elements untuk menemukan element terluar yang memuat semua teks yang berhubungan dengan Extended Forecast. Dalam hal ini adalah tag div dengan id seven-day-forecast.
Jika kamu klik console dan mencoba explore tag div tersebut, akan kamu temukan bahwa setiap forecast item (seperti today, tonight, tuesday ...) berada dalam tag div dengan class tombstone-container.
OK, saatnya mulai Scraping
- Download web page yang memuat forecast.
- Buat BeautifulSoup class untuk mem-parse web page tersebut.
- Temukan div dengan id seven-day-forecast, dan sematkan ke variabel seven_day
- Di dalam seven_day, temukan tiap forecast item.
- Extract dan print forecast item pertama.
page = requests.get("https://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168") soup = BeautifulSoup(page.content, 'html.parser') seven_day = soup.find(id="seven-day-forecast") forecast_items = seven_day.find_all(class_="tombstone-container") today = forecast_items[0] print(tonight.prettify())
<div class="tombstone-container"> <p class="period-name"> Today <br/> <br/> </p> <p> <img alt="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. " class="forecast-icon" src="DualImage.php?i=bkn&j=wind_few" title="Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. "/> </p> <p class="short-desc"> Mostly Cloudy <br/> then Sunny <br/> and Breezy </p> <p class="temp temp-high"> High: 67 °F </p> </div>
Extract informasi dari web page target
- Nama forecast item — dalam hal ini, Today.
- Deskripsi kondisi cuaca — disimpan pada property title pada tag img.
- Deskripsi singkat kondisi cuaca — dalam hal ini, Mostly Cloudy.
- Suhu tinggi— dalam hal ini, 67 degrees.
period = today.find(class_='period-name').get_text() short_desc = today.find(class_='short-desc') for br in short_desc.find_all('br'): br.replace_with('\n' + br.text) short_desc = short_desc.get_text().replace('\n', ' ') temp = today.find(class_='temp').get_text() print(period) #[print(line) for line in lines] print(short_desc) print(temp)
Today Mostly Cloudy then Sunny and Breezy High: 67 °F
img = tonight.find('img') #return a dictionary, and the img attributes as the dictionary keys desc = img["title"] print(desc)
Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph.
Extract semua informasi dari web page target:
- Pilih semua items dengan class period-name di dalam item dengan class tombstone-container di dalam seven_day.
- Gunakan list comprehension untuk memanggil method get_text pada setiap object BeautifulSoup.
period_tags = seven_day.select(".tombstone-container .period-name") periods = [pt.get_text() for pt in period_tags] periods
['Today', 'Tonight', 'Tuesday', 'TuesdayNight', 'Wednesday', 'WednesdayNight', 'Thursday', 'ThursdayNight', 'Friday']
short_descs = [sd for sd in seven_day.select(".tombstone-container .short-desc")] for sd in short_descs: for br in sd.find_all('br'): br.replace_with('\n' + br.text) short_descs = [sd.get_text().replace('\n', ' ') for sd in short_descs] short_descs
['Mostly Cloudy then Sunny and Breezy', 'Increasing Clouds', 'Gradual Clearing and Breezy', 'Increasing Clouds and Windy', 'Partly Sunny then Sunny and Breezy', 'Mostly Clear and Breezy then Partly Cloudy', 'Sunny', 'Mostly Clear', 'Sunny']
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")] temps
['High: 67 °F', 'Low: 56 °F', 'High: 66 °F', 'Low: 55 °F', 'High: 68 °F', 'Low: 56 °F', 'High: 73 °F', 'Low: 56 °F', 'High: 71 °F']
descs = [d["title"] for d in seven_day.select(".tombstone-container img")] descs
['Today: Cloudy through mid morning, then gradual clearing, with a high near 67. Breezy, with a west southwest wind 8 to 13 mph increasing to 17 to 22 mph in the afternoon. Winds could gust as high as 28 mph. ', 'Tonight: Increasing clouds, with a low around 56. West southwest wind 13 to 18 mph, with gusts as high as 24 mph. ', 'Tuesday: Cloudy through mid morning, then gradual clearing, with a high near 66. Breezy, with a west wind 13 to 18 mph increasing to 23 to 28 mph in the afternoon. Winds could gust as high as 36 mph. ', 'Tuesday Night: Increasing clouds, with a low around 55. Windy, with a west wind 23 to 30 mph, with gusts as high as 38 mph. ', 'Wednesday: Mostly sunny, with a high near 68. Breezy, with a west wind 20 to 24 mph, with gusts as high as 31 mph. ', 'Wednesday Night: Partly cloudy, with a low around 56. Breezy. ', 'Thursday: Sunny, with a high near 73.', 'Thursday Night: Mostly clear, with a low around 56.', 'Friday: Sunny, with a high near 71.']
Menggabungkan data dengan Pandas Dataframe
import pandas as pd weather = pd.DataFrame({ "period": periods, "short_desc": short_descs, "temp": temps, "desc": descs }) weather
Coba sedikit Analisis
Extract number dari kolom 'temp'
temp_nums = weather["temp"].str.extract('(\d+)', expand=False) temp_nums
0 67 1 56 2 66 3 55 4 68 5 56 6 73 7 56 8 71 Name: temp, dtype: object
weather["temp_num"] = temp_nums.astype('int') weather
weather["temp_num"].mean()
63.111111111111114
is_night = weather["temp"].str.contains("Low") weather["is_night"] = is_night is_night
0 False 1 True 2 False 3 True 4 False 5 True 6 False 7 True 8 False Name: temp, dtype: bool
weather[is_night]
Demikian, tutorial scraping dasar menggunakan Python dan BeautifulSoup.
Selamat mencoba!
Comments
Why is Big Data Important?
Why Big Data
Value assignment help act as a tuition teacher and guides students looking for online academic assistance for assignments, homework, and essay writing services. contact valueassignmenthelp.com they provide plagiarism free content. Value assignment help is the best solution for you. They are best known for delivering assignment help services to students without having to break the bank. WhatsApp no.: +91-9872003804, +61-413228507.
Australia assignment help
3D Laser Scanning Modelling
Reverse Engineering Cad Drawings
Point cloud to 3D Model California
https://www.procoscare.com
Advanced Laser Hair Removal Treatment in Hyderabad
best Lip Blush Treatment in Hyderabad
https://www.procoscare.com
Best Face Glowup services in hyderabad
Tattoo Removing services in hyderabad
HP service center in Tagore Garden
3D Scanning Services
3D Scanning Reverse Engineering
Reverse Engineering Services
Plant Engineering Services
Offshore Engineering Services India
loved all that you shared ,so clear and easy to follow.
is there any issue your laptop ....We are provide solution
Laptop Service Center in Chennai
it course in Chennai
it courses in chennai
Make your love story last a lifetime with stunning photographs that tell your unique tale. Secure your expert photographer and guarantee every detail is captured with passion and precision.