Web Scraping with help of Beautyfulsoup

Harshavardhan Reddy Peddireddy
4 min readNov 19, 2022

--

webscraping on flipkart

The people working with data related works need to know about the web scraping. Not only the data analyst, engineer and scientist event the person from non techinical background can scrap the data by following the scraping steps.

First we need to know what is web scraping. The webscraping is nothing but scraping the front end data with help of some in built tools. Here we have many tools to scrap the data but we mainly focus on Beautifulsoup. the Beautifulsoup will help you to scrap the data. The module name is BS4 in python this module contain the Beautifulsoup.

I am using Flipkart web www.flipkart.com site to scrap the mobile data

Steps to follow:

  1. Import required libraries.
import requests
from bs4 import BeautifulSoup
import numpy as np

2. This header will tell that we are not local host and we are browser. it will work with out this in some web sites but following this rules will help us for smooth scraping the data.

headers = {"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

3. Taking the flipcart modiles url

URL=r'https://www.flipkart.com/search?q=mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobiles%7CMobiles&requestId=815e1c0f-d1b0-487a-82ed-fa2047b15b1b&page=1'

4. Here we are sending the request to scrap the data for this we are using the requests module in python. status_code will give the status of the request. the 200 satus will tell us this site is allowing us to scrap the data.

page = requests.get(URL,headers=headers)
print(page.status_code)
htmlCode = page.text
soup = BeautifulSoup(htmlCode,'html')

Here we have the common dout how to get the HTML code. By right click on your web site we can see the inspect. by click the inspect the inspecting panel open see the below image this is the inspect panel for the flipkart mobile .

Inspect panel

we can see that when we can go to the html code in inspect panel blue colour highlights on the front end. when we drag down the highlighter changes according to the product.

4. Here we need to get each modile product url. we can scrap from mobiles page but we can’t get the total information in side the page. so we are getting the each product url. we can’t see the product reviews from the main panel. This product reviews help for the sentiment analysis for the NPL. the below code will give the information how to scrap the each modile product url.

for x in soup.find_all('div',attrs={'class': '_2kHMtA'}):
k=x.find('a',attrs={'class':'_1fQZEK'})['href']
print('https://www.flipkart.com'+k)
highliter shows the class and the url

This code will give each product url for the page but we need to get the data from the other pages. for this we are looping the for loop and the pages are looping. By this aproch we can scrap the total urls.

5. looping the code for getting the data from other pages. the below code will help us to get the urls in to the list url_list. k is the variable it stors the url and that url will not work directly we need to concinate the url with the main domine name “www.flipkart.com”. verify the below code this will store the total data inside the list. In forther steps we can loop the other code for the scarp with help of this urls.

url_list=[] 
headers = {"User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
no_page=int(input())
for i in range(1,no_page):
URL=r'https://www.flipkart.com/search?q=mobiles&sid=tyy%2C4io&as=on&as-show=on&otracker=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&otracker1=AS_QueryStore_OrganicAutoSuggest_1_6_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobiles%7CMobiles&requestId=815e1c0f-d1b0-487a-82ed-fa2047b15b1b&page={}'.format(i)
page = requests.get(URL,headers=headers)
htmlCode = page.text
soup = BeautifulSoup(htmlCode,'html')
for x in soup.find_all('div',attrs={'class': '_2kHMtA'}):
k=x.find('a',attrs={'class':'_1fQZEK'})['href']
url_list.append(('https://www.flipkart.com'+k))

6. Collecting the data from the each product url.

    for j in tqdm(url_list):
pag = requests.get(j)
htmlCod = pag.text
s = BeautifulSoup(htmlCod, 'lxml')
pr = s.find('div', attrs={'class': '_30jeq3 _16Jk6d'})
if pr:
price = pr.text
else:
price = None
pd = s.find('span', attrs={'class': 'B_NuCI'})
if pd:
product = pd.text
else:
product = None
rt=s.find('div', attrs={'class': '_3LWZlK'})
if rt:
rating = rt.text
else:
rating = None
pe = s.find('div', attrs={'class': '_3Ay6Sb _31Dcoz'})
if pe:
percentage = pe.text
else:
percentage = None

The above code will scrap the each product data. and here we written if condition for the null values. By scraping the data we can get null values so to avoid the null values we written if condition. Here we can get the column like price, product, rating and percentage.

URL

The above picture shows how to scrap the product url like this we can scrap the data.

This is up to web scraping in future i will try to upload how to insert this this data in to the my sql data base and insert data into dynamodb with help of lambda functions in AWS.

If u like my blogs please follow me and i will update more blogs in future days on data analysis, engineering , machine learning, AWS and Big Data.

--

--

No responses yet