Data Scraping and Analysis using Python
Competitive Pricing using Data Scraping.
Web Scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
This technique is highly useful in competitive pricing. To check what our product’s optimal price should be we can compare the similar products that are already in the market. These prices can vary a lot. So, in this blog, I’m going to show how we can scrap data regarding a particular product.
The Internet hosts perhaps the greatest source of information — and misinformation — on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analysing data from websites.
Scrape and Parse Text From Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:
1. The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.
2. Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.
The most common technique for Data Scraping is using BeautifulSoup library in Python. It extracts the html for the page and stores it as an unstructured data. We’ll have to convert that into structured format.
Let’s import all the necessary libraries are:
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import pandas as pd
from urllib.parse import urljoin
import bs4
Data Extract is unstructured data and stored in an empty lists in a structured form.
lstproductname=[]# List to store name of the product
lstprice =[] # List store pirice of the product
lstrating =[] # List to store ratings of the product
lstspecification = [] #List sto store specifications of the product
lstprocessor=[]
lstram=[]
lststorage=[]
lstos=[]
lstdisplay=[]
lstcamera=[]
lstbattery=[]
lstwarranty=[]
lstsimstype=[]
#lsthybridsim=[]
base_url=”https://www.flipkart.com" #to read the sim type
Creating a user agent. Refer to this link https://pypi.org/project/fake-useragent/
user_agent = UserAgent()
Provide an input as a product name. The extracted data will be related to that product.
product_name = input(“Please enter a Product Name- “)
To extract data from multiple pages of the product listing we are going to use a for loop. The range will specify the number of pages to be extracted.
Please find the source code in the link
For extracting data from soup form you need to specify the html tags you want retrieve the data it. You could use inspect element on the webpage..
The above code will store the data in a structured format. And when you print the dfProd you’ll get:
Cleaning up Data
Remove the symbols from Price and clean the unnecessary special characters as well.
Split the product name by comma (,) assign the product name from first arrays of string and second arrays into color of the product.
To check the data types for price and ratings
Create a product company name by splitting the product name column.
Fundamental Analysis of the mobile phone data.
Plotting Boxplot
Barplot for Sim Type vs Price
Will choose the budget range between Rs 14000 to 30000.
We can conclude from here that products with lower price have a higher ratings to some extent.
We can also observe that the color has almost no effect on the ratings of the product.
We can also observe that how the prices have effected on the company of the product
Most of the company products are have ratings above 4
In conclusion the best option (with in the Rs 25000 budget) it would be more preferable to buy the Mi branded phone based on the ratings and the number of customers that have bought the product, means that the product will be more reliable.