This is a lab assignment and so you will not be submitting it. However, the concepts and practice will help you on both the homework and exams so I encourage you to make a serious effort on it during class and consider finishing it outside of class.

I recommend making a folder for today’s lab in COURSES as you usually do.

Exercise 0

(This is a repeat of the demonstration at the start of class.)

Download the starter files for today’s lab and copy those files into your folder for today.

The zip that you downloaded contains:

  • the beginnings of an implementation that will let us count n-grams
  • A file with some testing code for
  • corpora/: a directory of text files written by various famous authors.

Your Task

Open and take a look at getOneGrams(), a function that should count all the one-grams (i.e., words, otherwise known as “unigrams”) in a given file and returns the counts as a Python dictionary. You’ll see that there is currently pseudocode where the main functionality should be. Turn that pseudocode into Python to complete the function. When you are done, running should print out the top 10 most frequent words in Hamlet. Note: do not remove the code with the weird punctuation !@#$%^&*()_+-=;:",./<>?\\, that is getting rid of punctuation for you.

You should get:

the 1097
and 899
to 742
of 658
you 550
i 536
my 514
a 511
it 415
in 414

Exercise 1

a. Change the code to print out the top 10 words from A Midsummer Night’s Dream. Despite being a very different play, the words won’t have changed much. While we can learn a surprising amount from these small “function” words, they do tend to drown out more interesting topic/content words in frequency analysis. An easy way to address this issue is to simply ignore such words when making our counts. Words that a researcher removes before performing analysis are typically called stopwords.

b. Add another parameter to the getOneGrams() function called ignoreStopwords. This parameter should contain a Boolean value (True or False) indicating whether or not the function should ignore stopwords or consider them as it performs its counting. To implement this, you will need to:

  • Check whether or not ignoreStopwords is True, using a conditional statement
  • Read in all of the words from corpora/stopwords.txt and store them in a list (take a glance inside this file to see what it contains)
  • For each word encountered in the given file, check to see whether or not that word is in the stopword list before adding its count to the oneGramCounts dictionary.

Be sure to add code to main() to test your new functionality. What words pop up now? Do they tell you more about what the play is about?

Exercise 2

For your second task of this lab, you will be implementing a similar function to getOneGrams() called getTwoGrams(). As the name suggests, this function will count all of the two-grams (or “bigrams”) contained in a given file, returning the counts as a dictionary.

Much of this function will do the same thing as getOneGrams(), but think about what extra variables you will need and how you will go about updating them. How will you keep track of not just the current word, but the previous word? How will you be able to make sure that you count two-grams that straddle across two lines of text?

Note: do not bother to exclude stopwords as you count bigrams.

As you implement this function, add code to the main() function to test that your function is working.

Exercise 3

The code you were given removes all punctuation before counting any n-grams. This lets us keep track of all occurrences of each word, regardless of whether or not they might come at the end of a sentence, or after a quotation mark. However, we can incorporate punctuation into our n-grams if we want to. In particular, terminal punctuation marks (periods, question marks, exclamation points) contain useful information. If, say, the word “goodbye” occurs frequently at the end of a sentence, that is something we can capture by treating ('goodbye', '.') as a bigram.

Update your getTwoGrams() function so that it counts periods, exclamation points, and question marks as their own words when counting bigrams. (We can use the word “token” to refer to a single countable unit that could be a word or a punctuation mark.) This should be doable just by adding/changing a few lines of code!