Before we begin the problemset, let us import the required modules and data.
#Import NLTK and Texts
from nltk import *
from nltk.book import *
from nltk.corpus import stopwords
Instructions: Use text9.index() to find the index of the word sunset. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.
To begin this problem, let us first clarify the name of "text9" and find the location of the specified keyword, "sunset."
#Title to Text9
text9
#Index Number for the Location of the Word "Sunsent"
text9.index("sunset")
Now that the location is found, simply set a range (by trial and error) around the keyword and use python's text functions to clean up the output into a nicely formatted sentence string.
#Isolating the words in Text9 and Turning the Tokens Into a Clean Sentence.
s0 = text9[621:644]
s1 = str(s0).replace("'", "").replace(".,", ".").replace(", ", " ")
s2 = s1.replace(" ,", ",").replace(" .", ".").replace("THE", "The")
#Print the Cleaned Sentence
print (s2)
Instructions: What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?
>>> sorted(set(w.lower() for w in text1))
>>> sorted(w.lower() for w in set(text1))
If we attempt to implement the following two NLTK lines, what do we get? First realize that if the lines are implemented individually, we will get a sorted list of tokens. To evaluate which has the larger value, we can use the len()
function.
First, let us assign variable names to these two sets, set1_1
for (set1, text1) and set2_1
for (set2, text1).
set1_1 = sorted(set(w.lower() for w in text1))
set2_1 = sorted(w.lower() for w in set(text1))
Now, let us print the number of items in each set.
print ("Number of items in Set1 =", len(set1_1))
print ("Number of items in Set2 =", len(set2_1))
from types import *
def bigger_set(selected_text, print_option=0):
text = selected_text
set1 = sorted(set(w.lower() for w in text))
set2 = sorted(w.lower() for w in set(text))
#Assert Error if Print Option Not Specified
#Depending on Future Usage, Printing May or May Not Be Desired
assert print_option == 1 or print_option == 0, \
"Print_option is not equal to 0 or 1. " \
"Please input either 0 or 1. 0 = Do not print (the default). 1 = Print results."
#Get Clean Title
clean_title0 = str(text).replace("<Text: ", "").replace(">", "")
try:
c2 = clean_title0.split(" by ")[0].upper()
except:
c2 = clean_title0.upper()
conclusion = "for the text: "+'\n'+" "+c2+"."
#Set Initial Values
x = 0 #Set1 is Bigger
y = 0 #Set2 is Bigger
z = 0 #Set1 and Set2 are Equal
if len(set1) > len(set2):
statement = " Set1 is greater than set 2,"
x +=1
elif len(set1) < len(set2):
statement = " Set 2 is greater than set 1,"
y +=1
elif len(set1) == len(set2):
statement = " Set 1 is equal in length to set 2,"
z +=1
if print_option == 1:
#Print Results
print ("__"*30, '\n',
"SELECTED TEXT:", '\n', str(selected_text),
'\n'*2, "Results of Text Analysis: ")
print (" Set1 had a length of ", len(set1), "tokens.", '\n',
"Set2 had a length of ", len(set2), "tokens.",
'\n'*2, "Thus:")
print (statement, conclusion)
print ("__"*30, '\n')
else:
pass
return x, y, z
bigger_set(text1, 1)
bigger_set(text3, 1)
texts = [text1, text2, text3, text4, text5, text6, text7, text8, text9]
set1_bigger = 0
set2_bigger = 0
sets__equal = 0
for text in texts:
set1_bigger += bigger_set(text, 0)[0]
set2_bigger += bigger_set(text, 0)[1]
sets__equal += bigger_set(text, 0)[2]
print ("__"*30, '\n', \
"Total corpora = ", len(texts), '\n',
"Set1 is bigger in ", set1_bigger, "cases.", '\n', \
"Set2 is bigger in ", set2_bigger, "cases.", '\n', \
"The sets are equal in ", set1_bigger, "cases.", '\n', "__"*30 \
)
Instructions: What is the difference between the following two tests: w.isupper() and not w.islower()?
If we look at these two commands,
>>> w.isupper()
and
>>> not w.islower()
at first, we might think they give the same result. In fact, w.isupper()
provides a list of all upper-case words, while not w.islower()
provides a list of all words not entirely lower-case. Thus we see words like "MY" or "TITLE" qualifies as w.isupper
while "TITLE", "The", and "End", as well as punctuation such as "]" or "." all qualify for not w.islower()
.
We can see this below:
#Sample 1
sample_isupper = [w for w in text1 if w.isupper()]
print ("Sample for 'w.isupper()':", '\n', sample_isupper[:10], '\n', \
"Total tokens: ", len(sample_isupper))
#Sample 2
sample_not_islower = [w for w in text1 if not w.islower()]
print ("Sample for 'not w.islower()':", '\n', sample_not_islower[:10], '\n', \
"Total tokens: ", len(sample_not_islower))
Thus, we can see that not only are the sets quite different in terms of content, but
not w.islower()
is a much larger set of terms.
Instructions: Write the slice expression that extracts the last two words of text2.
def last_two(text):
'''This function prints the last two words of a text.'''
#Restrict to only non-empty, alphabetic words
text_words = [w for w in text if w.isalpha()]
y = len(text_words)
x = y-2
last_two_words = text_words[x:y]
print (last_two_words)
last_two(text2)
Instructions: Review the discussion of looping with conditions in 4. Use a combination of for and if statements to loop over the words of the movie script for Monty Python and the Holy Grail (text6) and print all the uppercase words, one per line.
For this problem, it is easiest to write a simple function, which provides the requested results.
def print_upper(text):
"""Prints All Uppercase Words, One Per Line"""
for w in text:
#Get All Uppercase Words of All Length
if w.isupper():
print(w)
#To Print Uncomment the Code
#print_upper(text6)
Instructions: Define a function called vocab_size(text) that has a single parameter for the text, and which returns the vocabulary size of the text.
We can define a function evaluating vocabulary size of several types. For example, we might consider unique tokens (which include puntuation, words, and numbers, only words and numbers, just words, only words less common english 'stopwords' such as "a," "the," "I," or "we," or perhaps just complex words over a certain length.
def vocab_size(text, k=15, r=0):
"""This functions calculates and prints the results for a text's vocabulary
It returns vocabulary in terms of unique:
1. Tokens
2. Words or Numbers
3. Words Only
4. Words (less English stopwords)
5. Complex Words > Length k (default = 15)"""
#Default Option, Print
if r == 0:
#Unique or "u" Sets
u_tokens = len(set(text))
u_words_num = len(set([w for w in text if w.isalnum()]))
u_words = len(set([w for w in text if w.isalpha()]))
u_words_less_stopwords = len(set([w for w in text if w.lower() not in stopwords.words('english')]))
u_complex_words = len(set([w for w in text if w.isalnum() and len(w.lower()) > k ]))
print ("__"*30, '\n', "SELECTED TEXT:", '\n', str(text), '\n'*2, \
"The text has the following vocabulary: ", '\n', \
"Unique Tokens: ", u_tokens, '\n', \
"Unique Words or Numbers: ", u_words_num, '\n', \
"Unique Words: ", u_words,'\n', \
"Unique Words (Less Stopwords): ", u_words_less_stopwords, '\n', \
"Unique Complex Words (> ", k, " Characters): ", u_complex_words, '\n', \
"__"*30 \
)
#Option to Return Complex Words
elif r == 1:
ucw = [w.lower() for w in text if w.isalnum() and len(w.lower()) > k ]
return ucw
else:
pass
Let's try to implement this function on several texts:
#Check Vocabulary Function on Moby Dick
vocab_size(text1)
#Check Vocabulary Function on Genesis
vocab_size(text3)
Notice that Genesis has considerably fewer unique words than Moby Dick and that there are no complex words, at least > 15.
We can change this parameter to 10 words and compute again.
#Check Vocabulary Function on Genesis
#Change Complex Vocab Size to 10
vocab_size(text3, 10)
Perhaps we would like to view graphs of the top complex words in these texts:
#Command All Matplotlib Graphs to Appear in Inline in Notebook
%matplotlib inline
fdist3c = FreqDist(vocab_size(text3, 10, 1))
fdist3c.plot(25, title="Top Complex Words (> 10 Characters) in Genesis", cumulative=False)
fdist3c.plot(25, title="Top Complex Words (> 10 Characters) in Genesis", cumulative=True)
fdist1c = FreqDist(vocab_size(text1, 10, 1))
fdist1c.plot(25, title="Top Complex Words (> 10 Characters) in Moby Dick", cumulative=False)
fdist1c.plot(25, title="Top Complex Words (> 10 Characters) in Moby Dick", cumulative=True)