Working Through NLTK: CH1¶

Joshua Gary Mausolf¶

University of Chicago¶

Preliminary Steps¶

Before we begin the problemset, let us import the required modules and data.

#Import NLTK and Texts
from nltk import *
from nltk.book import *
from nltk.corpus import stopwords

Problem 17¶

Instructions: Use text9.index() to find the index of the word sunset. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.

To begin this problem, let us first clarify the name of "text9" and find the location of the specified keyword, "sunset."

#Title to Text9
text9

<Text: The Man Who Was Thursday by G . K . Chesterton 1908>

#Index Number for the Location of the Word "Sunsent"
text9.index("sunset")

629

Now that the location is found, simply set a range (by trial and error) around the keyword and use python's text functions to clean up the output into a nicely formatted sentence string.

#Isolating the words in Text9 and Turning the Tokens Into a Clean Sentence.
s0 = text9[621:644]
s1 = str(s0).replace("'", "").replace(".,", ".").replace(", ", " ")
s2 = s1.replace(" ,", ",").replace(" .", ".").replace("THE", "The")

#Print the Cleaned Sentence
print (s2)

[The suburb of Saffron Park lay on the sunset side of London, as red and ragged as a cloud of sunset.]

Problem 19¶

Instructions: What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?

>>> sorted(set(w.lower() for w in text1))
>>> sorted(w.lower() for w in set(text1))

If we attempt to implement the following two NLTK lines, what do we get? First realize that if the lines are implemented individually, we will get a sorted list of tokens. To evaluate which has the larger value, we can use the len() function.

First, let us assign variable names to these two sets, set1_1 for (set1, text1) and set2_1 for (set2, text1).

set1_1 = sorted(set(w.lower() for w in text1))
set2_1 = sorted(w.lower() for w in set(text1))

Now, let us print the number of items in each set.

print ("Number of items in Set1 =", len(set1_1))

Number of items in Set1 = 17231

print ("Number of items in Set2 =", len(set2_1))

Number of items in Set2 = 19317

So, set2 is bigger than set1 for text1.¶

What about other cases? We can make the evaluation faster with a function.¶

from types import *


def bigger_set(selected_text, print_option=0):

    text = selected_text
    set1 = sorted(set(w.lower() for w in text))
    set2 = sorted(w.lower() for w in set(text))

    #Assert Error if Print Option Not Specified
    #Depending on Future Usage, Printing May or May Not Be Desired
    assert print_option == 1 or print_option == 0, \
    "Print_option is not equal to 0 or 1. " \
    "Please input either 0 or 1. 0 = Do not print (the default). 1 = Print results."

    #Get Clean Title
    clean_title0 = str(text).replace("<Text: ", "").replace(">", "")
    try:
        c2 = clean_title0.split(" by ")[0].upper()
    except:
        c2 = clean_title0.upper()
    conclusion = "for the text: "+'\n'+" "+c2+"."

    #Set Initial Values
    x = 0 #Set1 is Bigger
    y = 0 #Set2 is Bigger
    z = 0 #Set1 and Set2 are Equal

    if len(set1) > len(set2):
        statement = " Set1 is greater than set 2,"
        x +=1
    elif len(set1) < len(set2):
        statement = " Set 2 is greater than set 1,"
        y +=1
    elif len(set1) == len(set2):
        statement = " Set 1 is equal in length to set 2,"
        z +=1

    if print_option == 1:
        #Print Results
        print ("__"*30, '\n',
              "SELECTED TEXT:", '\n', str(selected_text),
              '\n'*2, "Results of Text Analysis: ")

        print (" Set1 had a length of ", len(set1), "tokens.", '\n',
               "Set2 had a length of ", len(set2), "tokens.",
              '\n'*2, "Thus:")

        print (statement, conclusion)

        print ("__"*30, '\n')
    else:
        pass

    return x, y, z

We can confirm our prior results for text1 to see if the function works.¶

bigger_set(text1, 1)

____________________________________________________________
 SELECTED TEXT:
 <Text: Moby Dick by Herman Melville 1851>

 Results of Text Analysis:
 Set1 had a length of  17231 tokens.
 Set2 had a length of  19317 tokens.

 Thus:
 Set 2 is greater than set 1, for the text:
 MOBY DICK.
____________________________________________________________

(0, 1, 0)

We can test this for another text, say text3:¶

bigger_set(text3, 1)

____________________________________________________________
 SELECTED TEXT:
 <Text: The Book of Genesis>

 Results of Text Analysis:
 Set1 had a length of  2628 tokens.
 Set2 had a length of  2789 tokens.

 Thus:
 Set 2 is greater than set 1, for the text:
 THE BOOK OF GENESIS.
____________________________________________________________

(0, 1, 0)

We can now get the result for every text. We will run a simple loop.¶

texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9]

set1_bigger = 0
set2_bigger = 0
sets__equal = 0

for text in texts:
    set1_bigger += bigger_set(text, 0)[0]
    set2_bigger += bigger_set(text, 0)[1]
    sets__equal += bigger_set(text, 0)[2]


print ("__"*30, '\n', \
       "Total corpora = ", len(texts), '\n',
       "Set1 is bigger in ", set1_bigger, "cases.", '\n', \
       "Set2 is bigger in ", set2_bigger, "cases.", '\n', \
       "The sets are equal in ", set1_bigger, "cases.", '\n', "__"*30 \
      )

____________________________________________________________
 Total corpora =  9
 Set1 is bigger in  0 cases.
 Set2 is bigger in  9 cases.
 The sets are equal in  0 cases.
 ____________________________________________________________

Going back to the original question, set2 provides the larger result not only for text1 but for every corpus in the default NLTK package.¶

Problem 20¶

Instructions: What is the difference between the following two tests: w.isupper() and not w.islower()?

If we look at these two commands,

>>> w.isupper()

and

>>> not w.islower()

at first, we might think they give the same result. In fact, w.isupper() provides a list of all upper-case words, while not w.islower() provides a list of all words not entirely lower-case. Thus we see words like "MY" or "TITLE" qualifies as w.isupper while "TITLE", "The", and "End", as well as punctuation such as "]" or "." all qualify for not w.islower().

We can see this below:

#Sample 1
sample_isupper = [w for w in text1 if w.isupper()]
print ("Sample for 'w.isupper()':", '\n', sample_isupper[:10], '\n', \
       "Total tokens: ", len(sample_isupper))

Sample for 'w.isupper()':
 ['ETYMOLOGY', 'I', 'H', 'HACKLUYT', 'WHALE', 'HVAL', 'HVALT', 'WEBSTER', 'S', 'DICTIONARY']
 Total tokens:  4083

#Sample 2
sample_not_islower = [w for w in text1 if not w.islower()]
print ("Sample for 'not w.islower()':", '\n', sample_not_islower[:10], '\n', \
      "Total tokens: ", len(sample_not_islower))

Sample for 'not w.islower()':
 ['[', 'Moby', 'Dick', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.', '(']
 Total tokens:  62579

Thus, we can see that not only are the sets quite different in terms of content, but

not w.islower()

is a much larger set of terms.

Problem 21¶

Instructions: Write the slice expression that extracts the last two words of text2.

def last_two(text):
    '''This function prints the last two words of a text.'''

    #Restrict to only non-empty, alphabetic words
    text_words = [w for w in text if w.isalpha()]
    y = len(text_words)
    x = y-2
    last_two_words = text_words[x:y]
    print (last_two_words)

last_two(text2)

['THE', 'END']

Problem 23¶

Instructions: Review the discussion of looping with conditions in 4. Use a combination of for and if statements to loop over the words of the movie script for Monty Python and the Holy Grail (text6) and print all the uppercase words, one per line.

For this problem, it is easiest to write a simple function, which provides the requested results.

def print_upper(text):
    """Prints All Uppercase Words, One Per Line"""
    for w in text:
        #Get All Uppercase Words of All Length
        if w.isupper():
            print(w)

#To Print Uncomment the Code
#print_upper(text6)

Problem 27¶

Instructions: Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text.

We can define a function evaluating vocabulary size of several types. For example, we might consider unique tokens (which include puntuation, words, and numbers, only words and numbers, just words, only words less common english 'stopwords' such as "a," "the," "I," or "we," or perhaps just complex words over a certain length.

def vocab_size(text, k=15, r=0):
    """This functions calculates and prints the results for a text's vocabulary
    It returns vocabulary in terms of unique:
    1. Tokens
    2. Words or Numbers
    3. Words Only
    4. Words (less English stopwords)
    5. Complex Words > Length k (default = 15)"""

    #Default Option, Print
    if r == 0:
        #Unique or "u" Sets
        u_tokens = len(set(text))
        u_words_num = len(set([w for w in text if w.isalnum()]))
        u_words = len(set([w for w in text if w.isalpha()]))
        u_words_less_stopwords = len(set([w for w in text if w.lower() not in stopwords.words('english')]))
        u_complex_words = len(set([w for w in text if w.isalnum() and len(w.lower()) > k ]))

        print ("__"*30, '\n', "SELECTED TEXT:", '\n', str(text), '\n'*2, \
               "The text has the following vocabulary: ", '\n', \
               "Unique Tokens: ", u_tokens, '\n', \
               "Unique Words or Numbers: ", u_words_num, '\n', \
               "Unique Words: ", u_words,'\n', \
               "Unique Words (Less Stopwords): ", u_words_less_stopwords, '\n', \
               "Unique Complex Words (> ", k, " Characters): ", u_complex_words, '\n', \
               "__"*30 \
               )

    #Option to Return Complex Words
    elif r == 1:
        ucw = [w.lower() for w in text if w.isalnum() and len(w.lower()) > k ]
        return ucw
    else:
        pass

Let's try to implement this function on several texts:

#Check Vocabulary Function on Moby Dick
vocab_size(text1)

____________________________________________________________
 SELECTED TEXT:
 <Text: Moby Dick by Herman Melville 1851>

 The text has the following vocabulary:
 Unique Tokens:  19317
 Unique Words or Numbers:  19225
 Unique Words:  19032
 Unique Words (Less Stopwords):  19007
 Unique Complex Words (>  15  Characters):  24
 ____________________________________________________________

#Check Vocabulary Function on Genesis
vocab_size(text3)

____________________________________________________________
 SELECTED TEXT:
 <Text: The Book of Genesis>

 The text has the following vocabulary:
 Unique Tokens:  2789
 Unique Words or Numbers:  2776
 Unique Words:  2776
 Unique Words (Less Stopwords):  2616
 Unique Complex Words (>  15  Characters):  0
 ____________________________________________________________

Notice that Genesis has considerably fewer unique words than Moby Dick and that there are no complex words, at least > 15.

We can change this parameter to 10 words and compute again.

#Check Vocabulary Function on Genesis
#Change Complex Vocab Size to 10
vocab_size(text3, 10)

____________________________________________________________
 SELECTED TEXT:
 <Text: The Book of Genesis>

 The text has the following vocabulary:
 Unique Tokens:  2789
 Unique Words or Numbers:  2776
 Unique Words:  2776
 Unique Words (Less Stopwords):  2616
 Unique Complex Words (>  10  Characters):  64
 ____________________________________________________________

Plots of Top Complex Words for Genesis vs. Moby Dick¶

Perhaps we would like to view graphs of the top complex words in these texts:

Plots for Genesis¶

##### Top Figure: Frequency Per Word
##### Bottom Figure: Cumulative Frequency

#Command All Matplotlib Graphs to Appear in Inline in Notebook
%matplotlib inline

fdist3c = FreqDist(vocab_size(text3, 10, 1))
fdist3c.plot(25, title="Top Complex Words (> 10 Characters) in Genesis", cumulative=False)
fdist3c.plot(25, title="Top Complex Words (> 10 Characters) in Genesis", cumulative=True)

Plots for Moby Dick¶

##### Top Figure: Frequency Per Word
##### Bottom Figure: Cumulative Frequency

fdist1c = FreqDist(vocab_size(text1, 10, 1))
fdist1c.plot(25, title="Top Complex Words (> 10 Characters) in Moby Dick", cumulative=False)
fdist1c.plot(25, title="Top Complex Words (> 10 Characters) in Moby Dick", cumulative=True)