Data Scraping and Analysis using Python

Umashankar Dyavalingaiah
5 min readFeb 21, 2021

Competitive Pricing using Data Scraping.

Web Scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.

This technique is highly useful in competitive pricing. To check what our product’s optimal price should be we can compare the similar products that are already in the market. These prices can vary a lot. So, in this blog, I’m going to show how we can scrap data regarding a particular product.

The Internet hosts perhaps the greatest source of information — and misinformation — on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analysing data from websites.

Scrape and Parse Text From Websites

Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:

1. The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.

2. Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.

The most common technique for Data Scraping is using BeautifulSoup library in Python. It extracts the html for the page and stores it as an unstructured data. We’ll have to convert that into structured format.

Let’s import all the necessary libraries are:

import requests

from fake_useragent import UserAgent

from bs4 import BeautifulSoup

import pandas as pd

from urllib.parse import urljoin

import bs4

Data Extract is unstructured data and stored in an empty lists in a structured form.

lstproductname=[]# List to store name of the product

lstprice =[] # List store pirice of the product

lstrating =[] # List to store ratings of the product

lstspecification = [] #List sto store specifications of the product

lstprocessor=[]

lstram=[]

lststorage=[]

lstos=[]

lstdisplay=[]

lstcamera=[]

lstbattery=[]

lstwarranty=[]

lstsimstype=[]

#lsthybridsim=[]

base_url=”https://www.flipkart.com" #to read the sim type

Creating a user agent. Refer to this link https://pypi.org/project/fake-useragent/

user_agent = UserAgent()

Provide an input as a product name. The extracted data will be related to that product.

product_name = input(“Please enter a Product Name- “)

To extract data from multiple pages of the product listing we are going to use a for loop. The range will specify the number of pages to be extracted.

Please find the source code in the link

For extracting data from soup form you need to specify the html tags you want retrieve the data it. You could use inspect element on the webpage..

The above code will store the data in a structured format. And when you print the dfProd you’ll get:

Cleaning up Data

Remove the symbols from Price and clean the unnecessary special characters as well.

Split the product name by comma (,) assign the product name from first arrays of string and second arrays into color of the product.

To check the data types for price and ratings

Create a product company name by splitting the product name column.

Fundamental Analysis of the mobile phone data.

Plotting Boxplot

Barplot for Sim Type vs Price

Will choose the budget range between Rs 14000 to 30000.

We can conclude from here that products with lower price have a higher ratings to some extent.

We can also observe that the color has almost no effect on the ratings of the product.

We can also observe that how the prices have effected on the company of the product

Most of the company products are have ratings above 4

In conclusion the best option (with in the Rs 25000 budget) it would be more preferable to buy the Mi branded phone based on the ratings and the number of customers that have bought the product, means that the product will be more reliable.

--

--

Umashankar Dyavalingaiah
0 Followers

Data Scientist, Data Engineer ,Data Analyst, BI Developer