Introduction
If you have ever used webmaster tools of Google, you might aware of crawler. As name describes a crawler can crawl all web pages of a website and index them. Crawler takes URL as input, index it and fetches all links on that page, crawler then moves on to previously fetched links, index them and find more links on each page and this process goes on until all URLs are indexed.
Crawler is required for search engines to index web. In this article, we will learn how we can create a crawler and store page content into DB, and how to make a simple search engine. We will use some packages like urllib, BeautifulSoup and sqlite3.
Implementation
First, we must create database schema to store indexed web pages and its contents, below I have given script to generate database tables in SQLite, you can use any database you want. Create database called SearchEngine in SQLite, store it in your D drive and execute scripts given below:
CREATE TABLE Url (Id INTEGER PRIMARY KEY AUTOINCREMENT, Link TEXT); CREATE TABLE Word (Id INTEGER PRIMARY KEY AUTOINCREMENT, Word TEXT); CREATE TABLE UrlWordLocation (UrlId INTEGER, WordId INTEGER, Location INTEGER, FOREIGN KEY(UrlId) REFERENCES URL(Id), FOREIGN KEY(WordId) REFERENCES Word(Id)); CREATE TABLE Link (Id INTEGER PRIMARY KEY AUTOINCREMENT, FromId INTEGER, ToId INTEGER, FOREIGN KEY(FromId) REFERENCES URL(Id), FOREIGN KEY(ToId) REFERENCES URL(Id)); CREATE TABLE LinkWords (WordId INTEGER, LinkId INTEGER, FOREIGN KEY(WordId) REFERENCES Word(Id), FOREIGN KEY(LinkId) REFERENCES URL(Id));
You can see Database Diagram below:

Above you can see Url and Word are main tables, other tables are having foreign keys from these tables. UrlWordLocation keeps track of words location. Link table holds which link is followed by which links and LinkWords keeps track of words with link. Now let’s create a DataAccess.py file which can perform CRUD operations on database:
import sqlite3 class dataAccess: #parameter database is database file path def __init__(self, database): self.database = database #creates connection to database and returns connection object def create_connection(self): try: conn = sqlite3.connect(self.database) return conn except Error as e: print(e) return None #select data def selectCommand(self, query): con = self.create_connection() cur = con.cursor() try: cur.execute(query) rows = cur.fetchall() except Exception as e: print(e) return None finally: cur.close() con.close() return rows #insert/update data def executeCommand(self, query, row): con = self.create_connection() cur = con.cursor() try: cur.execute(query, row) con.commit() return cur.lastrowid except Exception as e: print(e) return None finally: cur.close() con.close()
In above file two methods selectCommand is to fetch data from database and executeCommand is to Insert/Update data. In constructor, we are passing database file path to connect database. We used sqlite3 package to perform operations in SQLite database. Now we create file Crawler.py :
from urllib import request from urllib.parse import urljoin from bs4 import BeautifulSoup import re from DataAccess import dataAccess class crawler: ignorewords = set(['a', 'an', 'the', 'in','is', 'am', 'are', 'was', 'were', 'will', 'shall', 'it', 'this', 'that', 'of', 'to', 'and']) #parameter database is database file path def __init__(self, database): self.database = database #Reads the page and return soup object of page content def getPage(self, url): try: httpOpen = request.urlopen(url, timeout = 10) content = httpOpen.read() soup = BeautifulSoup(content) return soup except Exception as e: return None #Returns all page urls with url-text def getPageURLs(self, url, soup): links = soup('a') urls = [] for link in links: if('href' in dict(link.attrs)): url = urljoin(url, link['href']) if url not in urls: urls.append([url.split('#')[0], link.text]) return urls #Returns array of word and word location def getWords(self, text): text = text.lower() words = re.compile(r'[^A-Z^a-z]+').split(text) filteredWords = [] for i in range(len(words)): word = words[i] #Removing ignored words and blank spaces if word not in self.ignorewords and word != '': filteredWords.append((word, i)) #setting word location return filteredWords #Returns urlId if available in database or zero def getUrlId(self, url): dataAcc = dataAccess(self.database) data = dataAcc.selectCommand('SELECT Id FROM Url WHERE Link like \''+url+'\'') return data[0][0] if len(data) > 0 else 0 #Insert new url def insertUrl(self, url): dataAcc = dataAccess(self.database) lastrowid = dataAcc.executeCommand('INSERT INTO Url(Link) VALUES(?)', (url,)) return lastrowid #Get wordId from database def getWordId(self, word): dataAcc = dataAccess(self.database) wordData = dataAcc.selectCommand('SELECT Id, Word FROM Word WHERE Word like \''+word+'\'') wordId = 0 if(len(wordData) > 0): #if word is availabe in database what is id of that wordId = wordData[0][0] else: wordId = dataAcc.executeCommand('INSERT INTO Word(Word) VALUES(?)', (word,)) return wordId #Insert new word def insertWordLocation(self, UrlId, word, word_location): wordId = self.getWordId(word) dataAcc = dataAccess(self.database) #Mapping of URL and Word and location dataAcc.executeCommand('INSERT INTO UrlWordLocation(UrlId, WordId, Location) VALUES(?,?,?)', (UrlId, wordId, word_location)) #insert Link Words def insertLinkTextWord(self, urlId, word): wordId = self.getWordId(word) dataAcc = dataAccess(self.database) #Inser Link words dataAcc.executeCommand('INSERT INTO LinkWords(WordId, LinkId) VALUES(?,?)', (wordId, urlId)) #insert FromUrl and ToUrl def insertFromToUrl(self, fromId, toId): dataAcc = dataAccess(self.database) lastrowid = dataAcc.executeCommand('INSERT INTO Link(FromId, ToId) VALUES(?,?)', (fromId,toId,)) return lastrowid #index all pages in a website def crawl(self, url, domain, urlText = None, lastUrlId = None): urlId = self.getUrlId(url) #if url isn't indexed and in same domain if urlId == 0 and domain in url: soup = self.getPage(url) if(soup != None): print('indexing ', url) urls = self.getPageURLs(url, soup) words = self.getWords(soup.get_text()) urlId = self.insertUrl(url) if urlText != None: linkWords = self.getWords(urlText) for word in linkWords: if word not in self.ignorewords and word != '': #mapping link with links in word self.insertLinkTextWord(urlId, word[0]) #inserting words location for word in words: self.insertWordLocation(urlId, word[0], word[1]) #recursive call to index all new urls found on this page for url in urls: self.crawl(url[0], domain, url[1], urlId) #inserting from url, to url if lastUrlId != None and urlId != 0: self.insertFromToUrl(lastUrlId, urlId)
We have 10 methods in above class:
- getPage: To download contents of webpage
- getPageURLs: To fetch all linked URLs available on webpage
- getWords: To get all words on webpage
- getUrlId: To fetch URLs id if indexed else returns 0
- insertUrl: To index URL, inserts URL in Url table
- getWordId: To fetch a word id if present in database else insert and generate id
- insertWordLocation: To map word to URL
- insertLinkTextWord: To insert words on link and map it to URL
- insertFromToUrl: To maintain mapping which link has followed by which links
- crawl: To index website’s all pages
Let’s see crawl method, crawl takes any URL of any website, download the html content and filter out words, links and words on links from html content then it inserts all data in database and call itself recursively to index previously fetched links. In every iteration, it checks whether link is already indexed or not, if it is indexed it skips indexing. As well we maintain FromUrl to ToUrl mapping so it can be used to track how many links follow a link and this data we will use in page ranking algorithm which we will analyze in next article.
Now we will create another file called SearchEngine.py with two methods search and crawlWebsite, crawlWebsite simply will call crawl method from crawler class and search will take search text as input and generate query to fetch results from database:
from Crawler import crawler from DataAccess import dataAccess class searchEngine: def __init__(self, database): self.database = database #to search text at webpages def search(self, searchText): #spliting text to get words words = searchText.split(' ') n = len(words) searchQuery = 'select url.Link' #[,u0.Location, u1.Location ..] selectQuery = [',u{}.Location'.format(i) for i in range(n)] #[(UrlWordLocation u0, u0), (UrlWordLocation u1, u1) ..] wordLocationJoinQuery = [('UrlWordLocation u{}'.format(i), 'u{}'.format(i)) for i in range(n)] #[(Word w0, w0), (Word w1, w1) ..] wordQuery = [('Word w{}'.format(i), 'w{}'.format(i)) for i in range(n)] #generating select part for i in range(n): searchQuery += selectQuery[i] searchQuery += ' from ' #generating inner join with wordLocationJoinQuery for i in range(len(words)): if i==0: searchQuery += wordLocationJoinQuery[i][0] else: searchQuery += ' inner join ' searchQuery += wordLocationJoinQuery[i][0] searchQuery += ' on ' + wordLocationJoinQuery[i][1]+'.urlid = '+wordLocationJoinQuery[i-1][1]+'.urlid' #generating inner join with words for i in range(len(words)): columnMatch = wordLocationJoinQuery[i][1]+'.WordId = '+wordQuery[i][1]+'.id' searchQuery += ' inner join '+wordQuery[i][0] searchQuery += ' on ' + columnMatch #generating inner join with url searchQuery += ' inner join url on u0.UrlId = url.id where ' #generating where part for i in range(len(words)): searchQuery += wordQuery[i][1]+'.word like \'' + words[i] +'\'' if i != len(words) - 1: searchQuery += ' and ' dataAcc = dataAccess(self.database) #executing query data = dataAcc.selectCommand(searchQuery) print(data) #crawl a website def crawlWebsite(self, domain): url = 'https://www.' + domain crawlerObj = crawler(self.database) crawlerObj.crawl(url, domain)
Above search method will generate query for search text “green tea” will be:
select url.Link,u0.Location,u1.Location from UrlWordLocation u0 inner join UrlWordLocation u1 on u1.urlid = u0.urlid inner join Word w0 on u0.WordId = w0.id inner join Word w1 on u1.WordId = w1.id inner join url on u0.UrlId = url.id where w0.word like 'green' and w1.word like 'tea'
Now let’s crawl website “practiceselenium.com” which is a dummy website on tea:
searchEng = searchEngine('D:\\SearchEngine.db') searchEng.crawlWebsite('practiceselenium.com')
It will crawl and save results in DB, now let’s search for "green tea":
searchEng = searchEngine('D:\\SearchEngine.db') searchEng.search('green tea')
Output:
[('https://www.practiceselenium.com/menu.html', 1143, 1121),
('https://www.practiceselenium.com/menu.html', 1143, 1132),
('https://www.practiceselenium.com/menu.html', 1143, 1142),
('https://www.practiceselenium.com/menu.html', 1143, 1144),...]
You can see it gives same URL with different combination of locations, we can see green tea is mentioned at multiple places at link https://www.practiceselenium.com/menu.html .
Conclusion
We have seen how to create a simple Crawler which crawls a website and index all its pages as well a simple search engine is implemented which can search keywords. In our next article we will learn advance search with page ranking algorithm with this data.
Comments:
how do we execute the above python programs?
hi kiran, you can follow below steps:
hello sir!
we are implementing websearch engine with pagerank algorithm. while executing the code we are getting errors in search engine code.
And we want to know whether we can run these files in jupyter notebook.
pls reply as soon as possible.
hello sir!
Even I am facing the same problem please help me with this
Hey, Can you post what is the error displayed?
C:\Users\sweety\Desktop\crawlerusingpythoncode>cd SearchEngine
C:\Users\sweety\Desktop\crawlerusingpythoncode\SearchEngine>ls
Crawler.py DataAccess.py SearchEngine.db __pycache__
DA.ipynb SE.ipynb SearchEngine.py c.ipynb
C:\Users\sweety\Desktop\crawlerusingpythoncode\SearchEngine>python DataAccess.py
C:\Users\sweety\Desktop\crawlerusingpythoncode\SearchEngine>python Crawler.py
C:\Users\sweety\Desktop\crawlerusingpythoncode\SearchEngine>python SearchEngine.py
unable to open database file
Traceback (most recent call last):
File "SearchEngine.py", line 52, in <module>
searchEng.search('green tea')
File "SearchEngine.py", line 42, in search
data = dataAcc.selectCommand(searchQuery)
File "C:\Users\sweety\Desktop\crawlerusingpythoncode\SearchEngine\DataAccess.py", line 27, in selectCommand
cur = con.cursor()
AttributeError: 'NoneType' object has no attribute 'cursor'
Hi Mounika,
In SearchEngine.py there is line in the end "searchEngine('D:\\SearchEngine.db')". Did you give correct path of this db file? also check if sqlite3 is installed on not.
Thanks for your fast response sir. we resolved the errors. It got runned . Your article made our day .It is soo Informative and helpful . once again Thank you sir! :)
hello sir!
I am implementing websearch engine with pagerank algorithm by using your article as a reference.Sir, can you please explain in detail about what is the data information storing in each of the database table.
Hi Rajy,
As shown in DB schema, we have 5 tables.
Hope I answered your question.
Thanks