If I were to publish a standard concordance (software freely available) of all the words, each entry would have a few words before and a few after my Entry word. If your computer is clever enough, it could put together the text of The Hobbit from all those overlapping words as you can see:
- in a HOLE in the ground
- a hole IN the ground there
- hole in THE ground there lived.
This is my understanding of scholarly fair use: I may chop up the words and write about them, but not in a way that your computer could put the text back together. My idea was to chop up the text approximately into phrases with no overlap between them. You may know that “in a hole” and “in the ground” are both in paragraph [01.001], but you don’t know in what order. I marked up my hand-typed copy with [paragraph number] xx at the start of each paragraph and xx where I wanted to chop apart phrases. Chopping apart phrases was a story in itself, I’m sure a post will come later.
Given that text preparation, my son wrote a Python script to make the concordance and index. For your own copy of the script, which he publishes under a Lesser General Public License, click here. You’ll find a Read Me, instructions, the concordance script, and others which he created for this project.