Finding Similar Items/Products using Python/Machine Learning

Python program to find similar items, Finding similar items in Machine Learning, How to find similar items/products from dataset


Category: Machine Learning Tags: Python, Python 3


Introduction

    In our previous articles we learned about finding similar users, In Euclidean Distance Score and Pearson Correlation Score we had users and their ratings to singers and based on ratings we calculated similarity scores, where high score means high similarity. Before going forward we recommend you to read Euclidean Distance Score.

Companies like Amazon which are having millions of users and products it is not possible to process this much data and find similar users on real-time. To overcome this situation, finding similar items/products instead of similar users will be a better idea. Items are unlike users don't change their properties so frequently so pre-processing of data is possible, we can pre-find similar items and store that data for repetitively use and we can schedule this task to be executed on non traffic hours.
So instead of finding similar users we can simply look for similar items and recommend similar items while user is checking out an item.

To do that, we just need to swap users with items, below we can see how we switched between users and items for existing data set (below data set we have used in out previous articles):

FROM

online_music = {
'Donald':{'Taylor Swift':3.5,'Rihanna':3.0,'Justin Bieber':4.0},
'Chandler':{'Taylor Swift':3,'Rihanna':3.5,'Justin Bieber':4.5},
'Ruby':{'Rihanna':5.0,'Justin Bieber':2.0,'Demi Lovato':3.5, 'MJ':3.0},
'Zoya':{'Taylor Swift': 3.0, 'Rihanna':2.0, 'Justin Bieber':4.0,'Demi Lovato':3.0},
'Sam': {'Rihanna':3.0, 'Justin Bieber':3.5, 'MJ':4.0},
'Robert': {'Rihanna':1.0,'Justin Bieber':2.5,'Demi Lovato':2.5}
}

TO

online_music_reverse = {
'Taylor Swift': {'Donald': 3.5, 'Chandler': 3.0, 'Zoya': 3.0},
'Rihanna': {'Donald': 3.0, 'Chandler': 3.5, 'Ruby': 5.0, 'Zoya': 2.0, 'Sam': 3.0, 'Robert': 1.0},
'Justin Bieber': {'Donald': 4.0, 'Chandler': 4.5, 'Ruby': 2.0, 'Zoya': 4.0, 'Sam': 3.5, 'Robert': 2.5},
'Demi Lovato': {'Ruby': 3.5, 'Zoya': 3.0, 'Robert': 2.5},
'MJ': {'Ruby': 3.0, 'Sam': 4.0}
}

After swapping users with items we can see how many users rated an item (here item is a singer) and even we can calculate average rating for an item easily, so doing this gives us important insights.

Implementation

    To implement we have to write a function which converts online_music dataset into online_music_reverse(from user centric to item centric), below function does the job:

def transform_dataset(data):
result = {}
for item in data.keys(): #item is a person like 'Donald'
for subitem in data[item].keys(): #subitem is actually an item like 'Tylor Swift'
result.setdefault(subitem, {})
result[subitem][item] = data[item][subitem] #swap between person and item

return result

Now we have to write method which will find similar items for each item in dataset and store it, see the code below:

def calculateSimilarItems(music_data, n=2):   #n is how many top similar items to store
result = {}
music_data_reverse = transform_dataset(music_data) #transforms dataset to item centric
for item in music_data_reverse.keys():
#finding distance score of all other items with respect to current item
similarities = [(euclidean_distnce(music_data_reverse, item, other), other) for other in music_data_reverse.keys() if item != other]
similarities.sort()
similarities.reverse() #top scores first
result[item] = similarities[0:n] #taking only top 2 items and storing in result
return result

Now we will call above method and print the results:

print(calculateSimilarItems(online_music))

Output:

{
    'Demi Lovato': [(1.0, 'Taylor Swift'), (0.6666666666666666, 'MJ')],
    'Taylor Swift': [(1.0, 'Demi Lovato'), (0.4494897427831781, 'Rihanna')],
    'MJ': [(0.6666666666666666, 'Demi Lovato'), (0.4721359549995794, 'Justin Bieber')],
    'Justin Bieber': [(0.4721359549995794, 'MJ'), (0.3567891723253309, 'Demi Lovato')],
    'Rihanna': [(0.4494897427831781, 'Taylor Swift'), (0.3090169943749474, 'MJ')]
}

Above we can see we have got 2 similar singers for every singer which we can show to user as similar singers he might like.

Conclusion

    Finding similar items will be feasible than finding similar users because we can do pre-calculation in case of finding similar items since properties of items cannot change frequently or cannot change at all so only once in a certain period of time we will have to re run the method. But in case of finding similar users we have to calculate every time at run time because users are dynamic in nature, some users create new accounts and some will delete. So finding similar items is better in real-time applications.


Like 0 People
Last modified on 11 October 2018
Nikhil Joshi

Nikhil Joshi
Ceo & Founder at Dotnetlovers
Atricles: 133
Questions: 9
Given Best Solutions: 9 *

Comments:

No Comments Yet

You are not loggedin, please login or signup to add comments:

Existing User

Login via:

New User



x