Thursday, September 6, 2012

So, Jon, how's the dissertation coming along?

I fired off the first chapter for initial review very recently, thank you.

What a slog! I actually started in the early part of 2012 and had just a terrible time entering into the topic. I'm not exactly sure why the writing was so difficult early on, but I suppose it had something to do with my ambition outstripping my ability: I had so many things to write about and comment upon (and research) that I lost sense of how to build a coherent argument out of everything.

Quick stats:
  • 44 total pages. That will grow by 10-20 percent after comments come back.
  • About 11,500 words.
  • 22 footnotes.
  • 47 works cited entries.
  • Six figures: two images, one reproduction of a diagram, and three original figures.
  • Top words:
    • Text - 86 times.
    • HMMs - 72 times.
    • Markov - 67 times.
    • Textual - 56 times.
    • Probability - 32 times.
  • Favorite words:
    • Allophones.
    • Consilience. 
    • Deformity.
    • Dimensions.
  •  Favorite 3-word phrases:
    • Law of large.
    • Literary and textual.
    • Andrei Andreevich Markov.
I finally started to make progress when I focused on the section I absolutely knew had to be in the chapter: the explanation of hidden Markov models. Once I set that section down, I began to see the shape of other sections. Then, I couldn't stop writing.

Lesson learned. For chapter 2, I'll not be fancy. I'll just get to writing.

Wednesday, September 7, 2011

Tuesday, August 16, 2011

The Hidden Meaning of Pronouns

Scientific American has a sensationally titled interview with psychologist James Pennebaker on language use.

The entire interview is worth reading, but the following bit should interest anyone with a bent toward computational analysis of language:
Historians and biographers should jump on this new technology. The recent release of the Google Books Project should be required reading for everyone in the humanities. For the first time in the history of the world, there are methods by which to analyze tremendously large and complex written works by authors from all over the world going back centuries. We can begin to see how thinking, emotional expression, and social relations evolve as a function of world-wide events. The possibilities are breathtaking.

In my own work, we have analyzed the collected works of poets, playwrights, and novelists going back to the 1500s to see how their writing changed as they got older. We’ve compared the pronoun use of suicidal versus non-suicidal poets. Basically, poets who eventually commit suicide use I-words more than non-suicidal poets.

The analysis of language style can also serve as a psychological window into authors and their relationships. We have analyzed the poetry of Elizabeth Barrett and Robert Browning and compared it with the history of their marriage. Same thing with Ted Hughes and Sylvia Plath. Using a method we call Language Style Matching, we can isolate changes in the couples’ relationships.
h/t Kim Salazar

Friday, August 5, 2011

Three Questions to Ask about Hidden Markov Models

Continuing through Chapter 9 of Foundations of Statistical Natural Language Processing (1999) by Christopher D. Manning and Hinrich Schütze.

Manning and Schütze describe three fundamental questions to ask about Hidden Markov Models (HMMs):
  1. Given a certain HMM (knowing its state transition probabilities, symbol emission probabilities, and initial state probabilities), how do we efficiently compute how likely a certain observation is?
  2. Given an observation sequence and a certain HMM, how do we choose a state sequence that best explains the observations?
  3. Given an observation sequence and a space of possible HMMs found by varying the model parameters (e.g., state transition probabilities, symbol emission probabilities, and initial state probabilities), how do we find the HMM that best explains the observed data?
Future posts will address how to approach these questions.

* * * * *

In other news, I've intermittently been playing with Python and the Natural Language Processing Toolkit.

Monday, July 11, 2011

The Return of Pierre Menard

In the guise of Sean "Puffy" Combs. This is a rather old item from my favorite news source, The Onion:
NEW YORK—Noted rapper/producer Sean "Puffy" Combs released his hotly anticipated new single Tuesday, "Tha Kidd (Is Not My Son)," which samples Michael Jackson's 1983 smash "Billie Jean" in its entirety and adds nothing. "When I was in the studio mixing and recording, I decided 'Tha Kidd' would work best if I kept all the music and vocals from the original version and then didn't rap over it," Combs said. "So what I did is put in a tape with 'Billie Jean' on it, and then I hit record. The thing turned out great." Combs' current number-one hit, "Eye Of The Tiger," is dedicated to slain rapper Notorious B.I.G.

Friday, July 1, 2011


No kidding. Rajesh Rao's work on the Indus script is the direct inspiration for what I seek to do in my dissertation on Old English texts.

Monday, June 6, 2011

Python for Newbies

The Python folks have a page with some excellent links for non-programmers. I've listed the ones that most interest me. Notice the links for children at the end:
  • How to Think Like a Computer Scientist - 2nd edition Allen Downey's open source textbook has a Python version, written with Jeff Elkner. It's also available in book form. It was updated and current version is 2nd edition.

  • The Programming Historian From the "About This Book" page: "This book is a tutorial-style introduction to programming for practicing historians. We assume that you're starting out with no prior programming experience and only a basic understanding of computers. More experience, of course, won't hurt. Once you know how to program, you will find it relatively easy to learn new programming languages and techniques, and to apply what you know in unfamiliar situations."

  • Learning to Program An introduction to programming for those who have never programmed before, by Alan Gauld. It introduces several programming languages but has a strong emphasis on Python.

  • A Byte of Python, by Swaroop C.H., is also an introductory text for people with no previous programming experience.

  • One Day of IDLE Toying A very gentle introduction to the IDLE development environment that comes with Python. This tutorial by Danny Yoo has been translated into nine different languages.

  • Free Python video lectures are also available as a course titled Intro to programming with Python and Tkinter, Unix users can view the video using mplayer once you have downloaded the files. Windows users will need to have a DivX player, available from (One user reports success viewing the videos on OS X 10.4 using the VLC player --

  • A Non-Programmer's Tutorial for Python 3 on Wikibooks.

  • Learning Python (for the complete nOOb) by Derrick Wolters. A beginner's tutorial to learn how to program in Python.

  • Beginning Python for Bioinformatics by Patrick O'Brien. An introduction to Python aimed at biologists that introduces the PyCrust shell and Python's basic data types.

  • Two courses from the Pasteur Institute are aimed at biologists but are useful to anyone wanting to learn Python. Both tutorials are quite extensive, covering data types, object-oriented programming, files, and even design patterns.
  • Python Tutorial This tutorial is part of Python's documentation set and is updated with each new release. It's not written with non-programmers in mind, but skimming through it will give you an idea of the language's flavor and style.

  • Invent Your Own Computer Games with Python, 2nd Ed, by Al Sweigart is a free e-Book that teaches complete beginners how to program by making games.

  • LiveWires A set of Python lessons used during 1999, 2000, 2001 and 2002 children's summer camps in Britain by Richard Crook, Gareth McCaughan, Mark White, and Rhodri James. Aimed at children 12-15 years old.

  • Guido van Robot A teaching tool in which students write simple programs using a Python-like language to control a simulated robot. Field-tested at Yorktown High School, the project includes a lesson plan.

  • PythonTurtle A learning environment for Python suitable for beginners and children, inspired by Logo. Geared mainly towards children, but known to be successful with adults as well.

My favorite resource right now is a video series by Bucky Roberts.