Creating Word Vectors using python, Finding Common set of words for clustering in python

Finding keywords in set of items and clustering, creating word vectors from given data

Category: Machine Learning Tags: Python, Python 3

Create Word Vector and Clustering Code


    Suppose you have list of companies URL’s and you have to find out what a company does or company’s area of business. To achieve this we have to go through company website and read AboutUs/Services/Products pages and we can figure out company’s area of business.

Now this is really time consuming job what if we automate this process, using automated process we bring down all text written on different pages and then start finding keywords common in multiple company’s data. Suppose if we analyze 100 URL’s and out of 100 companies 20 companies in healthcare we can say healthcare is a keyword which can be used to create cluster of 20 healthcare companies.

Word Vector is a kind of matrix between all combinations of set of words and a score between every pair of words. In our companies URL’s case we can create a matrix like:

Keywords and keywords count for companies
Keywords and keywords count for companies

Above we can see we have a matrix between companies and keywords and have a score which is nothing but frequency of a keyword in respective company’s website. In this article we will learn how we can achieve this kind of clustering, if we take these scores and try to find similarity score we can see Microsoft and Oracle are more similar than Dotnetlovers and Microsoft.


    I have sample data of around 250 companies in JSON format which you can find in attached files. Below is structure of JSON data with company Accenture

      "company_name_id": "accenture",
      "company_name": "Accenture",
      "url": "",
      "year_founded": 1989,
      "city": "Chicago",
      "state": "IL",
      "country": "us",
      "zip_code": 60601,
      "full_time_employees": "10,001+",
      "company_type": "Public",
      "company_category": "",
      "revenue_source": "Not reported by company",
      "business_model": "Business to Business",
      "social_impact": "",
      "description": "Accenture delivers its services and solutions through 19 focused industry groups in five operating groups. This industry focus provides Accenture�s professionals with a thorough understanding of industry evolution, business issues and applicable technologies, enabling Accenture to deliver solutions tailored to each client's industry.",
      "description_short": "Accenture provides management consulting, technology and outsourcing services.",
      "source_count": "NA",
      "data_types": "Health/Healthcare",
      "example_uses": "",
      "data_impacts": "[]",
      "financial_info": "",
      "last_updated": "2014-09-18 15:44:37.967430"

Let’s create a class called USCompaniesWordVector in file

import json
import sys
import re
import operator

class USCompaniesWordVector:
    def __init__(self, us_companies_data):
        self.us_companies_data = us_companies_data

Now we are going to create method in above class which removes all html tags from text and return list of words:

def getWords(self, text):
    txt = re.compile(r'<[^>]+>').sub('', text)
    words = re.compile(r'[^A-Z^a-z]+').split(txt)
    return [word.lower() for word in words if word != '']

Now we create method to which returns list of words with count (how many times a word gets repeated in text):

def getWordsFromDescription(self, description):
    wc = {}  #list of words with count
    words = self.getWords(description)
    for word in words:
        wc.setdefault(word, 0)
        wc[word] += 1;
    return wc

Now we have to create vector and for that we will create method:

def clusterCompaniesInCategories(self):
    globalCount = {}
    companyWordCount = {}
    #1. inserting in globalCount and companyWordCount
    for i in range(0, len(self.us_companies_data["data"])):   #looping through companies
        company_name = self.us_companies_data["data"][i]["company_name"]
        description = self.us_companies_data["data"][i]["description"]
        revenue_source = self.us_companies_data["data"][i]["revenue_source"]
        description_short = self.us_companies_data["data"][i]["description_short"]
        #getting words from descriptin, discription_short and revenue_source
        wc = self.getWordsFromDescription(description + " " + description_short + " " + revenue_source)
        companyWordCount[company_name] = wc
        #Creating word list from all companies data
        for word,count in wc.items():
            globalCount.setdefault(word, 0)
            if count > 1:
                globalCount[word] += 1   #if word count > 1 increasing it's count

    #2. filtering Words, eliminating words like A, the, in etc
    wordList = []
    for word, count in globalCount.items():
        #fraction is average count of word per company
        fraction = float(count) / len(self.us_companies_data["data"])
        #selecting average count from 6% to 20% as well eliminating words up to 3 charactors
        if(fraction > 0.06 and fraction < 0.20 and len(word) > 3): wordList.append(word)

    #3. creating output JSON file(word vector)
    fileName = "output.json"
    file = open(fileName, "w")
    jsonObj = {}
    for company,words in companyWordCount.items():
        jsonObj.setdefault(company, {})
        for word in wordList:
            jsonObj[company][word] = 0
            if word in words:
                jsonObj[company][word] += words[word]

        data = [(key) for key in jsonObj[company].keys() if jsonObj[company][key] > 0]
    json.dump(jsonObj, file)
    print("Created file " + fileName)

Above method has three parts:
1. In first part it is creating company wise word count list and creating overall word count list for all companies.
2. In second part we are filtering over all list, where we have to choose lower and upper limit of frequencies we want to take, if we see words like the, a, in, on, is, are etc will be more and their count will be more so we have to discard these. So here I'm taking average counts between 6% and 20%. You can change these numbers and see the output what fits best.
3. In third part we are creating file which has actual word vector, In json output file we write count of filtered keywords for every company.

Now to execute code:

loaded_json = json.load(open("Data/USCompaniesData.json"))
usCmpny = USCompaniesWordVector(loaded_json)

The output file is also attached in code, the file contents will look like:

  "Accenture": {
    "platform": 0,
    "provides": 2,
    "government": 0,
    "through": 1,
    "clients": 0,
    "business": 1,
    "financial": 0,
    "services": 2,
    "information": 0,
    "research": 0,
    "health": 0,
    "more": 0,
    "global": 0,
    "solutions": 2,
    "technology": 1,
    "products": 0,
    "analytics": 0,
    "software": 0,
    "their": 0,
    "help": 0,
    "from": 0

You can see above Accenture is having word count 2 in services, 2 in solutions and 1 in technologies. so we can say Accenture falls under category/keywords [services, solution, technologies].


    Word vector is powerful technique and getting used in many modern AI systems like keyboard suggestions, anonyms, synonyms, voice recognization etc. Here in given output we can calculate Distance or Pearson score for similar companies also we can cluster companies based on keywords.

Like 0 People
Last modified on 11 October 2018
Nikhil Joshi

Nikhil Joshi
Ceo & Founder at Dotnetlovers
Atricles: 127
Questions: 9
Given Best Solutions: 8 *


No Comments Yet

You are not loggedin, please login or signup to add comments:

Existing User

Login via:

New User