Constantin Orasan's page

All kind of random stuff on NLP, AI, photography and who knows

19-Minute Read

A few weeks ago I introduced my students to the notion of dictionaries in python. The obvious way of teaching them, especially in the context of my module, is to use dictionaries to create word frequency lists. In order to make the students understand better why dictionaries are useful, we discussed other ways to produce frequency lists which use only lists. This prompted me to try to think as many ways as possible to produce frequency lists. I also wanted to see how feasible is to use these methods.

Below you can see a notebook where I explain four main approaches used to create word frequency lists. It is stored in the same repository as the rest of materials for my python course.

Can you think of other methods to produce frequency lists? If yes, don’t forget to leave a comment.

Creating word frequencies in python

Rank Word Frequency
1 the 62713
2 , 58334
9 that 10237
10 is 10011
11 was 9777
19 he 6566
20 his 6466

A word frequency list is “a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.” (https://en.wikipedia.org/wiki/Word_lists_by_frequency).

For example, the frequency list at the right is extracted from the Brown corpus. We user here the version that is distributed with NLTK. No preprocessing was applied, so the list contains both words and punctuation. As expected, the most frequent word is the, with a frequency of 62,713. The 19th most frequent token is the word he with a frequency of 6,566.

In general, the process of creating a word frequency list requires to find out the frequency of each word in the corpus. After this, the list is usually sorted in descending order according to the frequency. In this document, I will show how we can determine the frequency of words using four different methods (a several variants for each method).

It should be noted that frequency lists can be created from any kind of sequence of items using the approaches described below.

Before we start, we need to import the necessary modules:

  • from nltk.corpus import brown gives us access to the Brown corpus
  • from operator import itemgetter enables us to use itemgetter when we sort
from nltk.corpus import brown
from operator import itemgetter

Because the first method proposed requires to have the words in a list, I will first create a variable which has the type list and which stores the whole corpus. This is very inefficient from the point of view of memory use and should be avoided when not really necessary. Because I want to compare the methods, I will use this list in all the cases.

Here I will not apply any filtering. This means that the frequency lists will contain both words and punctuation marks. The methods can be easily adapted to count other phenomena (e.g. part of speech tags, lemmas, etc.) All that you have to do is to ensure that the list words contains the elements that you want to count.

words = list(brown.words())
Rank Word Frequency
1 the 62713
5606 Cobb 18
11212 responds 7
16817 Outdoor 4
22423 offenders 2
28029 howse 2
33634 Jimenez 1
39240 water-proof 1
44845 methode 1
50451 clomped 1
56057 stupefying 1

For each method I will measure the time it takes to produce the frequency list. This is done using the ExecuteTime jupyter extension. Alternative methods to measure the time are using %%timeit magic command.

In addition to measuring how long it takes to produce the frequency list, I will also check how long it takes to access elements in the frequency list (i.e. retrieve the frequency for a given token). To measure this, I will select 11 words that are distributed equally throughout the frequency list. The list of words is displayed in the table on the right.

The methods below rely on different ways to produce the frequency lists and to represent the result internally. Therefore it will be interesting to see how long it takes to retrieve information about these words. Because it is no uncommon to have unknown words, I also inserted a token that I know it does not appear in the corpus.

words_to_search = ["the", "Cobb", "responds", "Outdoor", "offenders", 
                   "howse", "Jimenez", "water-proof", "methode", 
                   "clomped", "stupefying", "dinel.org.uk"]

Method 1: Counting the frequency of tokens using list.count()

The first method explored relies on the list.count() method which returns how many times an item appears in a list. See below the output of help(list.count)

help(list.count)
Help on method_descriptor:

count(...)
    L.count(value) -> integer -- return number of occurrences of value

This means that we can take each word from the corpus and find out how many times it appears in the corpus. For example if we want to find out how many times the word bank appears in the corpus we can do the following:

print("The word bank appears", words.count("bank"), "times in the corpus")
The word bank appears 54 times in the corpus

There are several versions for the algorithm we need to implement.

Version 1

  1. Take each word from the corpus.
  2. Count how many times it appears in the corpus
  3. Add the pair word, frequency to a list
  4. Remove any duplicates from the list
  5. Sort the list

Version 2

  1. Take each word from the corpus
  2. Verify if it appears in a list of words already counted
  3. If it does not appear count how many times it appears in the corpus
  4. Add the word to the list of words that have been already counted
  5. Add the frequency to a second list ensuring that the frequency at position i corresponds to the word at position i in the list of words already seen
  6. Combine the two lists
  7. Sort the list

There are probably a few variations of these two versions. Below, I will implement both of them and time them.

Method 1: Version 1 - careless counting

# list that stores words and their frequency. it will contain repeated items
words_and_frequencies = []

for word in words:
    words_and_frequencies.append((word, words.count(word)))

sorted_list = sorted(frozenset(words_and_frequencies), key=itemgetter(1), reverse=True)
print(sorted_list[:20])
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

The removal of duplicates is achieved using the frozenset datatype (See more about frozensets at https://docs.python.org/3.8/library/stdtypes.html#set). An example how frozenset is use here to remove duplicated items can be seen below.

l = [("the", 10), ("is", 5), ("the", 10), ("time", 4)]
print(sorted(frozenset(l), key = itemgetter(1), reverse=True))
[('the', 10), ('is', 5), ('time', 4)]

As you can see this approach is very clear, but it is very inefficient. It take very long to run and builds a list that takes lots of memory. This list needs to be processed afterwards to remove duplicates. Surely no one is willing to wait more than three hours and a half to see the frequency list.

Let’s try to understand why it takes so long. If we check the lengths of words and of words_and_frequencies we see that they both contain 1161192 items. This means that the words.count(word) needs to be executed that many times. Even though the count operation seems to fast (see in cell No. 6), repeating it many times adds up. In addition, there is the process of continuously expanding the words_and_frequencies list with new items.

In contrast, discarding the duplicates and sorting the list is fairly fast, as the code below shows.

sorted_list = sorted(frozenset(words_and_frequencies), key=itemgetter(1), reverse=True)

There are two options to find the frequency of individual words. The first one involves using list.count for each word. The second one requires to take all the pairs (word, frequency) from the sorted frequency list and attempts to locate the word for which we need to determine the frequency. Both approaches can be applied to all the methods investigated here.

As can be seen below, the second approach run faster. As a bonus it also features a rare case of using the for ... else construction.

# obtaining the frequency of an item using directly the words list
for word_to_search in words_to_search:
    if word_to_search in words:
        print(f"{word_to_search} appears {words.count(word_to_search)} times")
    else:
        print(f"{word_to_search} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus
# obtaining the frequency of an item using information from the sorted_list
for word_to_search in words_to_search:
    for word, freq in sorted_list:
        if word == word_to_search:
            print(f"{word_to_search} appears {freq} times")
            break
    else:
        print(f"{word_to_search} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus

I really do not recommend using this method to build frequecy lists at home as it takes very long on a good computer!

Method 1: Version 2 - more careful counting

The second version of this method tries to address the problem of counting a word more than once. To achieve this, we are going to keep a list of the words which were counted (in variable list_words). In addition, a second list, which stores the frequencies, will be kept. We cannot have only one list that contains tuples, like in the previous case, because we need to have an efficient way to check whether the word has already counted (i.e. if word not in list_words:).

Given the way in which the list_words and list_freq lists are built, the token at position i (i.e. list_words[i]), will have the frequency list_freq[i].

list_words = []
list_freq = []
for word in words:
    if word not in list_words:
        list_words.append(word)
        list_freq.append(words.count(word))

sorted_list = sorted(zip(list_words, list_freq), key=itemgetter(1), reverse=True)
print(sorted_list[:20])
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

This version is way faster than the previous one, producing a frequency list in a bit over 10 minutes.

In order to produce a list that contains word, frequency pairs, we use the zip function which aggregates information from the list_words and list_freq lists. For more information about the zip function, read the excellent tutorial from https://realpython.com/python-zip-function/.

Let’s see how long it takes to retrieve the frequency for specific words. The two solutions implemented above are appropriate here, but I am going to explore another method that takes advantage of the format in which the data is. We can use list_words.index to find out at which position the word appears in the list. This method throws an error if the word we are searching for is not present in the list. For this reason we need first to test whether it appears in the list_words. Because of the correspondence between list_words and list_freq we can use this position to return the frequency of the word from list_freq.

for word in words_to_search:
    if word in list_words:
        pos = list_words.index(word)
        print(f"{word} appears {list_freq[pos]} times")
    else:
        print(f"{word} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus

This version is quite a bit faster than the first version, and the format in which data is stored allows fast retrieval of frequencies, but it is still too slow to be used in practical situations. I see no reason to use this approach either.

Method 2: imitate sort | uniq -c | sort -nr from Unix

One of the easy ways of producing a frequency list on Unix/Linux is to use sort | uniq -c | sort -nr. If the input for this command is a file with one word/token per line, the output is a frequency list. Without going into too many details, the idea of this series of commands is that

  • the first sort takes the input file and outputs a sorted list of the input words
  • uniq -c takes this sorted list and produces an output that omits repeated lines (i.e. keeps the word only once). In order to work, the input must be a sorted list, hence the first sort. The -c parameter means that each line will be prefixed by the number of occurrences of the word. (i.e. 62713 the)
  • sort -nr takes the output of uniq command and sorts it in descending order

This method works very well if the data is too big to fit in memory. Many year ago, I generated various frequency lists from the British National Corpus (BNC) on a computer that did not have enough memory to build the lists using my own code. The sort commands created various temporary files on disk without me having to do anything. The processing took long, but at the end I had the frequency lists I needed.

Below is an implementation which somehow follows this approach:

  1. Sorts the list of words
  2. Compares words which appear one after another to find out how many times a word appears in the corpus
  3. Produces the frequency list
previous_word = ""
counter = 0
list_of_frequencies = []

# get every word from the sorted list and compare consecutive words
for word in sorted(words):
    if(word == previous_word):        
        # the previous word is repeated so increment its frequency        
        counter += 1    
    else:        
        # if the words are different display the frequency of the previous one        
        if previous_word:                        
            list_of_frequencies.append((previous_word, counter))        
            
        # and the current word become previous word with frequency one        
        previous_word = word        
        counter = 1
        
list_of_frequencies.append((previous_word, counter))
sorted_list = sorted(list_of_frequencies, key=itemgetter(1), reverse=True)
print(sorted_list[:20])
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

This method is faster than the two versions of method 1 presented above, but it requires to sort the list of words before processing it, which takes time. In addition, if the original list of words is necessary in other places, you will need to make a copy using sorted. This means that extra memory is being used. If you do not need the list of words later on in your program, you can use sort which changes the list saving memory.

Retrieval of frequencies can take advantage of the fact that the list of words is sorted and count how many times the word appears one after another. The implementation of this approach is given below. It’s performance is rather poor in comparison to the method that searches through the sorted_list and can be used for any of these approaches.

# to improve it's efficiency we first sort the words and store the result in a different variable
# this sorting operation takes about half a second
sorted_words = sorted(words)

for word_to_search in words_to_search:
    if word_to_search in sorted_words:
        # get the position at which the word appears
        pos = sorted_words.index(word_to_search)
        counter = 0
        
        # because the list is sorted all we have to do is to count how many times the word is repeated
        while pos < len(sorted_words) and word_to_search == sorted_words[pos]:
            counter += 1
            pos += 1
        print(f"{word_to_search} appears {counter} times")
    else:
        print(f"{word_to_search} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus

Method 3: Using dictionaries

The obvious approach to implement a program that determines the frequency list from a corpus is by using the dict datatype (https://docs.python.org/3.8/tutorial/datastructures.html#dictionaries). You probably already know that a dictionary is a datatype that stores pairs of (key, values). In our case, the key will be the word for which we calculate the frequency and the value will store the frequency of the key.

The idea of this approach is to iterate through the text and update the number of occurrences of words as they are encountered in the text. Three versions are tested. The first two use dictionaries, whilst the third one uses a defaultdict

Method 3: Version 1

The first variant tests whether a word was found in the text before (i.e. appears in the dictionary) using the in operator.

freq_dict = {}
for word in words:
    if word in freq_dict:
        freq_dict[word] += 1
    else:
        freq_dict[word] = 1
        
sorted_list = sorted(freq_dict.items(), key=itemgetter(1), reverse=True)
print(sorted_list[:20])
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

Method 3: Version 2

Instead of testing whether a word appears in the dictionary before updating its frequency, here we use the get operation which returns the value associated to a key or a default value if the key does not appear in the dictionary. In this case, it will return 0 to indicate that the word hasn’t been seen before.

freq_dict = {}
for word in words:
    freq_dict[word] = freq_dict.get(word, 0) + 1
        
sorted_list = sorted(freq_dict.items(), key=itemgetter(1), reverse=True)
print(sorted_list[:20])
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

Method 3: Version 3

collections is an often ignored library which comes with python by default. This library provides two very useful data types which can be used for producing frequency lists. The first one is defaultdict which we will use next and the other one is Counter that will be explored later on.

For more information about the collections library, you can check its documentation.

The defaultdict is like a dictionary, but it has a default class for values. This means, it is possible to specify which value to return for missing keys. In this example, here I will use int to indicate that the values are integers and the default value will be 0

from collections import defaultdict
freq_dict = defaultdict(int)

for word in words:
    freq_dict[word] += 1
    
sorted_list = sorted(freq_dict.items(), key=itemgetter(1), reverse=True)
print(sorted_list[:20])
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

The times to run the three methods are very similar, but the method that uses the defaultdict seems slightly faster. That variant is also more elegant that the other two.

Retrieving information about the frequency of a word is very fast and quite natural.

for word_to_search in words_to_search:
    if word_to_search in freq_dict:
        print(f"{word_to_search} appears {freq_dict[word_to_search]} times")
    else:
        print(f"{word_to_search} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus

Any of the three variants above is appropriate.

Method 4: using tools developed specifically for counting

One way to produce frequency lists is by using tools that are for counting things such as collections.Counter and nltk.probability.FreqDist.

Method 4: Version 1 using collections.Counter

Counter is a tool “to support convenient and rapid tallies” which is part of the collections module. More information can be found at https://docs.python.org/3.8/library/collections.html#collections.Counter. In order to create the frequency list, all that we need to do is to pass the list of words as a parameter when we create the Counter.

from collections import Counter
freq_dict = Counter(words)
print(freq_dict.most_common(20))
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]

As can be seen this approach is the fastest among all the ones considered.

Accessing information is also the quickest. Because counters are a type of dictionary, the code is exactly the same as for dictionaries.

for word_to_search in words_to_search:
    if word_to_search in freq_dict:
        print(f"{word_to_search} appears {freq_dict[word_to_search]} times")
    else:
        print(f"{word_to_search} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus

Method 4: Version 2: using FreqDist

This is probably the preferred option of people who use NLTK on regular basis. FreqDist is derived from Counter, so it should be fairly fast. In addition, it provides various functions which can be used to solve probabilistic/statistical problems. A discussion about the differences between Counter and FreqDist can be found at https://stackoverflow.com/a/34606637.

from nltk.probability import FreqDist
freq_dist = FreqDist(words)
print(freq_dist.most_common(20))
[('the', 62713), (',', 58334), ('.', 49346), ('of', 36080), ('and', 27915), ('to', 25732), ('a', 21881), ('in', 19536), ('that', 10237), ('is', 10011), ('was', 9777), ('for', 8841), ('``', 8837), ("''", 8789), ('The', 7258), ('with', 7012), ('it', 6723), ('as', 6706), ('he', 6566), ('his', 6466)]
for word_to_search in words_to_search:
    if word_to_search in freq_dist:
        print(f"{word_to_search} appears {freq_dist[word_to_search]} times")
    else:
        print(f"{word_to_search} does not appear in the corpus")
the appears 62713 times
Cobb appears 18 times
responds appears 7 times
Outdoor appears 4 times
offenders appears 2 times
howse appears 2 times
Jimenez appears 1 times
water-proof appears 1 times
methode appears 1 times
clomped appears 1 times
stupefying appears 1 times
dinel.org.uk does not appear in the corpus

Given how many people use NLTK for various tasks, it is a bit disappointing to see how long it takes to build the frequency list using FreqDist. The speed is comparable with that of Method 2. Retrieval of frequencies is very fast. The advantage of using FreqDist is that it is used by other methods from NLTK, so there is no need to worry about converting between different data types if calling these functions. Still, I find this …

Conclusion

The winner is without doubt the method which uses collections.Counter. It builds the frequency list very very fast and retrieval of information is also fast. The code is very clear, so I am not sure why I do not see this approach used more often.

Recent Posts

Categories