Implementing crawler using python - search engine implementation part 1

How to implement a simple crawler using python, how to crawl a website using python, how to create a simple search engine


Category: Machine Learning Tags: Python, Python 3

Crawler using python code files

Introduction

    If you have ever used webmaster tools of Google, you might aware of crawler. As name describes a crawler can crawl all web pages of a website and index them. Crawler takes URL as input, index it and fetches all links on that page, crawler then moves on to previously fetched links, index them and find more links on each page and this process goes on until all URLs are indexed.

Crawler is required for search engines to index web. In this article, we will learn how we can create a crawler and store page content into DB, and how to make a simple search engine. We will use some packages like urllib, BeautifulSoup and sqlite3.

Implementation

    First, we must create database schema to store indexed web pages and its contents, below I have given script to generate database tables in SQLite, you can use any database you want. Create database called SearchEngine in SQLite, store it in your D drive and execute scripts given below:

CREATE TABLE Url (Id INTEGER PRIMARY KEY AUTOINCREMENT, Link TEXT);
CREATE TABLE Word (Id INTEGER PRIMARY KEY AUTOINCREMENT, Word TEXT);
CREATE TABLE UrlWordLocation (UrlId INTEGER, WordId INTEGER, Location INTEGER, FOREIGN KEY(UrlId) REFERENCES URL(Id), FOREIGN KEY(WordId) REFERENCES Word(Id));
CREATE TABLE Link (Id INTEGER PRIMARY KEY AUTOINCREMENT, FromId INTEGER, ToId INTEGER, FOREIGN KEY(FromId) REFERENCES URL(Id), FOREIGN KEY(ToId) REFERENCES URL(Id));
CREATE TABLE LinkWords (WordId INTEGER, LinkId INTEGER, FOREIGN KEY(WordId) REFERENCES Word(Id), FOREIGN KEY(LinkId) REFERENCES URL(Id));

You can see Database Diagram below:

Search Engine Database Diagram
Fig 1: Search Engine Database Diagram

Above you can see Url and Word are main tables, other tables are having foreign keys from these tables. UrlWordLocation keeps track of words location. Link table holds which link is followed by which links and LinkWords keeps track of words with link. Now let’s create a DataAccess.py file which can perform CRUD operations on database:

import sqlite3

class dataAccess:

    #parameter database is database file path
    def __init__(self, database):
        self.database = database

    #creates connection to database and returns connection object
    def create_connection(self):
        try:
            conn = sqlite3.connect(self.database)
            return conn
        except Error as e:
            print(e)

        return None

    #select data
    def selectCommand(self, query):
        con = self.create_connection()
        cur = con.cursor()
        try:
            cur.execute(query)
            rows = cur.fetchall()
        except Exception as e:
            print(e)
            return None
        finally:
            cur.close()
            con.close()
        return rows

    #insert/update data
    def executeCommand(self, query, row):
        con = self.create_connection()
        cur = con.cursor()
        try:
            cur.execute(query, row)
            con.commit()
            return cur.lastrowid
        except Exception as e:
            print(e)
            return None
        finally:
            cur.close()
            con.close()

In above file two methods selectCommand is to fetch data from database and executeCommand is to Insert/Update data. In constructor, we are passing database file path to connect database. We used sqlite3 package to perform operations in SQLite database. Now we create file Crawler.py :

from urllib import request
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import re
from DataAccess import dataAccess

class crawler:
    ignorewords = set(['a', 'an', 'the', 'in','is', 'am', 'are', 'was', 'were', 'will', 'shall', 'it', 'this', 'that', 'of', 'to', 'and'])

    #parameter database is database file path
    def __init__(self, database):
        self.database = database

    #Reads the page and return soup object of page content
    def getPage(self, url):
        try:
            httpOpen = request.urlopen(url, timeout = 10)
            content =  httpOpen.read()
            soup = BeautifulSoup(content)
            return soup
        except Exception as e:
            return None

    #Returns all page urls with url-text
    def getPageURLs(self, url, soup):
        links = soup('a')
        urls = []
        for link in links:
            if('href' in dict(link.attrs)):
                url = urljoin(url, link['href'])
                if url not in urls:
                    urls.append([url.split('#')[0], link.text])

        return urls

    #Returns array of word and word location
    def getWords(self, text):
        text = text.lower()
        words = re.compile(r'[^A-Z^a-z]+').split(text)
        filteredWords = []
        for i in range(len(words)):
            word = words[i]
            #Removing ignored words and blank spaces
            if word not in self.ignorewords and word != '':
                filteredWords.append((word, i))    #setting word location
        return filteredWords

    #Returns urlId if available in database or zero
    def getUrlId(self, url):
        dataAcc = dataAccess(self.database)
        data = dataAcc.selectCommand('SELECT Id FROM Url WHERE Link like \''+url+'\'')
        return  data[0][0] if len(data) > 0 else 0 

    #Insert new url
    def insertUrl(self, url):
         dataAcc = dataAccess(self.database)
         lastrowid = dataAcc.executeCommand('INSERT INTO Url(Link) VALUES(?)', (url,))
         return lastrowid

    #Get wordId from database
    def getWordId(self, word):
        dataAcc = dataAccess(self.database)
        wordData = dataAcc.selectCommand('SELECT Id, Word FROM Word WHERE Word like \''+word+'\'')
        wordId = 0
        if(len(wordData) > 0):      #if word is availabe in database what is id of that
            wordId = wordData[0][0]
        else:
            wordId = dataAcc.executeCommand('INSERT INTO Word(Word) VALUES(?)', (word,))
        return wordId

    #Insert new word
    def insertWordLocation(self, UrlId, word, word_location):
        wordId = self.getWordId(word)
        dataAcc = dataAccess(self.database)
        #Mapping of URL and Word and location
        dataAcc.executeCommand('INSERT INTO UrlWordLocation(UrlId, WordId, Location) VALUES(?,?,?)', (UrlId, wordId, word_location))

    #insert Link Words
    def insertLinkTextWord(self, urlId, word):
        wordId = self.getWordId(word)
        dataAcc = dataAccess(self.database)
        #Inser Link words
        dataAcc.executeCommand('INSERT INTO LinkWords(WordId, LinkId) VALUES(?,?)', (wordId, urlId))

    #insert FromUrl and ToUrl
    def insertFromToUrl(self, fromId, toId):
        dataAcc = dataAccess(self.database)
        lastrowid = dataAcc.executeCommand('INSERT INTO Link(FromId, ToId) VALUES(?,?)', (fromId,toId,))
        return lastrowid

    #index all pages in a website
    def crawl(self, url, domain, urlText = None, lastUrlId = None):
        urlId = self.getUrlId(url)
        #if url isn't indexed and in same domain
        if urlId == 0 and domain in url:
            soup = self.getPage(url)
            if(soup != None):
                print('indexing ', url)
                urls = self.getPageURLs(url, soup)
                words = self.getWords(soup.get_text())
                urlId = self.insertUrl(url)

                if urlText != None:
                    linkWords = self.getWords(urlText)
                    for word in linkWords:
                        if word not in self.ignorewords and word != '':
                            #mapping link with links in word
                            self.insertLinkTextWord(urlId, word[0])

                #inserting words location
                for word in words:
                    self.insertWordLocation(urlId, word[0], word[1])

                #recursive call to index all new urls found on this page
                for url in urls:
                    self.crawl(url[0], domain, url[1], urlId)

        #inserting from url, to url
        if lastUrlId != None and urlId != 0:
            self.insertFromToUrl(lastUrlId, urlId)

We have 10 methods in above class:

  1. getPage: To download contents of webpage
  2. getPageURLs: To fetch all linked URLs available on webpage
  3. getWords: To get all words on webpage
  4. getUrlId: To fetch URLs id if indexed else returns 0
  5. insertUrl: To index URL, inserts URL in Url table
  6. getWordId: To fetch a word id if present in database else insert and generate id
  7. insertWordLocation: To map word to URL
  8. insertLinkTextWord: To insert words on link and map it to URL
  9. insertFromToUrl: To maintain mapping which link has followed by which links
  10. crawl: To index website’s all pages

Let’s see crawl method, crawl takes any URL of any website, download the html content and filter out words, links and words on links from html content then it inserts all data in database and call itself recursively to index previously fetched links. In every iteration, it checks whether link is already indexed or not, if it is indexed it skips indexing. As well we maintain FromUrl to ToUrl mapping so it can be used to track how many links follow a link and this data we will use in page ranking algorithm which we will analyze in next article.

Now we will create another file called SearchEngine.py with two methods search and crawlWebsite, crawlWebsite simply will call crawl method from crawler class and search will take search text as input and generate query to fetch results from database:

from Crawler import crawler
from DataAccess import dataAccess

class searchEngine:
    def __init__(self, database):
        self.database = database

    #to search text at webpages
    def search(self, searchText):
        #spliting text to get words
        words = searchText.split(' ')
        n = len(words)
        searchQuery = 'select url.Link'
        #[,u0.Location, u1.Location ..]
        selectQuery = [',u{}.Location'.format(i) for i in range(n)]
        #[(UrlWordLocation u0, u0), (UrlWordLocation u1, u1) ..]
        wordLocationJoinQuery = [('UrlWordLocation u{}'.format(i), 'u{}'.format(i)) for i in range(n)]
        #[(Word w0, w0), (Word w1, w1) ..]
        wordQuery = [('Word w{}'.format(i), 'w{}'.format(i)) for i in range(n)]

        #generating select part
        for i in range(n):
            searchQuery += selectQuery[i]

        searchQuery += ' from '
        #generating inner join with wordLocationJoinQuery
        for i in range(len(words)):
            if i==0:
                searchQuery += wordLocationJoinQuery[i][0]
            else:
                searchQuery += ' inner join '
                searchQuery += wordLocationJoinQuery[i][0]
                searchQuery += ' on ' + wordLocationJoinQuery[i][1]+'.urlid = '+wordLocationJoinQuery[i-1][1]+'.urlid'
        #generating inner join with words
        for i in range(len(words)):
            columnMatch =  wordLocationJoinQuery[i][1]+'.WordId = '+wordQuery[i][1]+'.id'
            searchQuery += ' inner join '+wordQuery[i][0]
            searchQuery += ' on ' + columnMatch
        #generating inner join with url
        searchQuery += ' inner join url on u0.UrlId = url.id where '
        #generating where part
        for i in range(len(words)):
            searchQuery += wordQuery[i][1]+'.word like \'' + words[i] +'\''
            if i != len(words) - 1:
                searchQuery += ' and '

        dataAcc = dataAccess(self.database)
        #executing query
        data = dataAcc.selectCommand(searchQuery)
        print(data)
    
    #crawl a website
    def crawlWebsite(self, domain):
        url = 'https://www.' + domain
        crawlerObj = crawler(self.database)
        crawlerObj.crawl(url, domain)

Above search method will generate query for search text “green tea” will be:

select url.Link,u0.Location,u1.Location from UrlWordLocation u0 
inner join UrlWordLocation u1 on u1.urlid = u0.urlid 
inner join Word w0 on u0.WordId = w0.id 
inner join Word w1 on u1.WordId = w1.id 
inner join url on u0.UrlId = url.id 
where w0.word like 'green' and w1.word like 'tea'

Now let’s crawl website “practiceselenium.com” which is a dummy website on tea:

searchEng = searchEngine('D:\\SearchEngine.db')
searchEng.crawlWebsite('practiceselenium.com')

It will crawl and save results in DB, now let’s search for "green tea":

searchEng = searchEngine('D:\\SearchEngine.db')
searchEng.search('green tea')

Output:

[('https://www.practiceselenium.com/menu.html', 1143, 1121), 

('https://www.practiceselenium.com/menu.html', 1143, 1132), 

('https://www.practiceselenium.com/menu.html', 1143, 1142), 

('https://www.practiceselenium.com/menu.html', 1143, 1144),...]

You can see it gives same URL with different combination of locations, we can see green tea is mentioned at multiple places at link https://www.practiceselenium.com/menu.html .

Conclusion

    We have seen how to create a simple Crawler which crawls a website and index all its pages as well a simple search engine is implemented which can search keywords. In our next article we will learn advance search with page ranking algorithm with this data.


Like 0 People
Last modified on 11 October 2018
Nikhil Joshi

Nikhil Joshi
Ceo & Founder at Dotnetlovers
Atricles: 133
Questions: 9
Given Best Solutions: 9 *

Reference:

programming collective intelligence - by Toby Segaran and published by O'Reilly

Comments:

Kiran

how do we execute the above python programs?

Like 1 Person on 13 September 2018
Nikhil Joshi

hi kiran, you can follow below steps:

  1. Download and extract zip folder of code files
  2. Open command prompt
  3. Locate the extracted folder
  4. Give command python FileNameYouWantToExecute, to give python command you must install python on your machine. i.e. in case of this article you need to give command pyhton SearchEngine.py and you must install SQL Lite in you machine and place SearchEngine.db file in D drive. You can change the code if you keep db fie somewhere else and and execute same file to see results.
Like 0 People on 13 September 2018

You are not loggedin, please login or signup to add comments:

Existing User

Login via:

New User



x