In: Computer Science
Choose a table on a web page (use your discretion) and scrape the contents to a csv file.
The table selected must have at least 3 numerical columns and at least 15 rows.
You must produce the following results for at least 2 of those columns:
● count - number of values in the column
● mean - average of all values in the column
● std - standard deviation of the column
● min - minimum value in the column
● 25% - percentile value in the column
● 50% - percentile value in the column
● 75% - percentile value in the column
● max - maximum value in the column
Deliverables :
- A file containing your python code.
- The csv file
code-
import requests
import pandas as pd
import csv
import urllib.request
from bs4 import BeautifulSoup
url =
'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area'
response = requests.get(url)
print(response.status_code)
soup = BeautifulSoup(response.text,"html.parser")
table = soup.findAll('table',{"class":"wikitable"})[0]
tr = table.findAll(['tr'])[0:151]
csvFile = open("cities0fUSA.csv",'wt',newline='',
encoding='utf-8')
writer = csv.writer(csvFile)
try:
for cell in tr:
th = cell.find_all('th')
th_data = [col.text.strip('\n') for col in th]
td = cell.find_all('td')
row = [i.text.replace('\n','') for i in td]
writer.writerow(th_data+row)
finally:
csvFile.close()
df=pd.read_csv("C:\\Users\\Preeti\\Downloads\\cities0fUSA.csv")
df["Population (2010)"]=df["Population
(2010)"].str.replace(",","")
df["Land area (km2)"]=df["Land area
(km2)"].str.replace(",","")
df["Population (2010)"]=df["Population
(2010)"].astype(int)
df["Land area (km2)"]=df["Land area
(km2)"].astype(int)
df["Population (2010)"].describe()
df["Land area (km2)"].describe()
Explanation :
import requests
import pandas as pd
import csv
import urllib.request
from bs4 import BeautifulSoup
We are using python libraries BeautifulSoups,Request and pandas . So make sure you have them install in your system.
To Install Beautiful Soup : pip install beautifulsoup4
url =
'https://en.wikipedia.org/wiki/List_of_United_States_cities_by_area'
response = requests.get(url)
print(response.status_code)
Here , we are fetching the URL from wikipedia contatining list of usa states and cities by area.You can visit the page to check the web table.since it is not legal to scrape the data,hence we must always check the response code . if its 200 mean you can go ahead and download it.
soup = BeautifulSoup(response.text,"html.parser")
table = soup.findAll('table',{"class":"wikitable"})[0]
Here we create a BeautifulSoup object and using the findAll() method extract the relevant information,which in our case is the <table> tag.Since there can be many tables in wikipedia page hence we specify the class attribute(which you can get by right clicking and select inspect)
tr = table.findAll(['tr'])[0:151]
Here , we specify the number of rows that we want to extract from that table[0:151]
csvFile = open("cities0fUSA.csv",'wt',newline='',
encoding='utf-8')
writer = csv.writer(csvFile)
try:
for cell in tr:
th = cell.find_all('th')
th_data = [col.text.strip('\n') for col in th]
td = cell.find_all('td')
row = [i.text.replace('\n','') for i in td]
writer.writerow(th_data+row)
finally:
csvFile.close()
We need to store the data in a .csv format. So get the the headers from the table using <th> tag and data in the columns using <td> tag. Then write the fetched data to a file called cities0fUSA.csv with options ‘wt’, newline =’ ‘ is used to avoid any newline between two rows.
The above code will create a csv file in your downloads folder with name cities0fUSA.csv
df=pd.read_csv("C:\\Users\\Preeti\\Downloads\\cities0fUSA.csv")
Above code will read csv file and convert it to dataframe using pandas library . You need to specify the path where your csv file created is stored.In my case , it was in downloads so i have mentioned that path.
you can also use df.head() to see the top 5 rows.
df["Population (2010)"]=df["Population
(2010)"].str.replace(",","")
df["Land area (km2)"]=df["Land area
(km2)"].str.replace(",","")
since we need to find count,min,max of values in columns .column must have numeric values and without any special characters hence we use replace function to remove all commas
df["Population (2010)"]=df["Population
(2010)"].astype(int)
df["Land area (km2)"]=df["Land area
(km2)"].astype(int)
we then convert it all the values of the 2 columns Population (2010) and Land area (km2) to integers using astype(int)
df["Population (2010)"].describe()
df["Land area (km2)"].describe()
we finally use describe function on those 2 columns to get count.min,max,median,25th,75th percentile etc