In: Computer Science
Use Python (hint: Beautiful Soup, Selenium, or Scrapy) to web scrape (clean and parse) an HTML data site. You can also use other modules or libraries to clean and manipulate the data. Identify and explain any inconsistencies in the dataset.
The website link : https://www.iii.org/table-archive/23284
Get the wildfire tables from only 2019 and 2020 from the website.
Structure of my answer.
1. images of code
2. code
3. explaination
Images of Code
Output : of the table
Explanation:
1. the webiste has devided the data into two sections current and archives, where the current section holds the data of the current year. (2019) (actually the previous year) and the archies holds data from 2010 - 2018
steps :
1. fist we get the parse the html page using beautiful soup ,
2. get the div elements
3. then split the div elements into current and archives
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
url = "https://www.iii.org/table-archive/23284"
wild_fire = requests.get(url)
site = BeautifulSoup(wild_fire.content,"html5lib")
divs = site.find_all('div',class_ = "view-content")
current,archives = divs[0],divs[1]
Extracting the Current table data (2019)
def parse_table(table):
cols = ['State','Number of Fires','Number of acres burned']
data = []
count = 1
index = 0
for i in table.find_all('td'):
if count == 1:
col1 = i.get_text()
elif count == 2:
col2 = i.get_text()
elif count == 3:
col3 = i.get_text()
data.append([col1,col2,col3])
count = 0
count += 1
df = pd.DataFrame(data,columns = cols)
df.set_index('State',inplace = True)
return df
df = parse_table(current.find_all("table")[-1])
df.head()
Extracting the table data from the archives
One can extract any table data by just changing the year_toget var
archives = list(archives.children)
year_toget = 2018
for table in archives:
res = table.find('span')
if res != -1 and res:
years = re.findall("\d{4}",res.get_text())
if len(years) == 1:
year = int(years[0])
if year_toget == year:
df = parse_table(table.find_all('table')[-1])
df
Output:
I have extracted 2019 and 2018 year tables since 2020 was not available. When it becomes available then 2020 tabel will be in the current section and 2019 in the archived section hence the process to extract data will not change .
You my answer helps then upvote!!