The terminal where it happens - basic text analysis using Unix/Linux utilities
/ 7 min read
This post was originally a Twitter thread under the hashtag #7DaysOfUnix. I kept running into the character limit and wanted to write it down in longer form, so here we are. Note that I’m using MacOS (Darwin), but all of these should work on Linux or any other flavor of Unix.
Unix/Linux utilities (henceforth referred to only as Unix utilities) are one of my favorite ways to do basic data analysis. When asked for some analysis, if the data is not in an easily queryable database, my first inclination is almost always to crack my knuckles and throw an army of Unix tools at the problem.
Introduction to some Unix commands
For this particular demonstration, I’m going to be listing out the top 10 words in the song “My Shot” from the musical Hamilton. Copy them down from genius.com and save it to a text file.
Okay, let’s look at the directory contents:
ls
Show me the contents of the file:
cat my-shot.txt
Here’s something very important about Unix. The Unix philosophy is all about minimalist, modular software development. Unix is full of small composable single-purpose tools which you glue together to build complex workflows. You can send the output of almost any command as the input to any other command using “pipes”. Pipes are a way for one process to communicate with another process. If you want to read more about pipes, this Wikipedia article has a lot of information.
For example, how many lines, words, and chars are in this song? You can directly use the wc
utility for that.
$ wc my-shot.txt
233 1160 6513 my-shot.txt
But you can also cat
out the contents of the file and pipe them over to wc
.
$ cat my-shot.txt | wc
233 1160 6513
What! Okay that’s cool but let’s do something useful. How often in the song do they say that one line?
$ cat my-shot.txt | grep "I am not throwing away my shot"
I am not throwing away my shot!
I am not throwing away my shot!
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
And I am not throwing away my shot
I am not throwing away my shot
How many lines is that? Ask our friend wc
again.
$ cat my-shot.txt | grep "I am not throwing away my shot" | wc
10 71 316
What about just the word “shot”?
$ cat my-shot.txt | grep "shot" | wc
32 221 1028
This is a good time to say that if you want to know the details of any command or utility, use the man
(manual) command. Example-
man wc
Recap: So far, we have learned about pipes, how to look up documentation with the man
command, and how to search through text output with grep
.
Pull up that grep
again, you may have noticed that we got a line we didn’t want.
$ cat my-shot.txt | grep "I am not throwing away my shot"
I am not throwing away my shot!
I am not throwing away my shot!
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
And I am not throwing away my shot <---- this one
I am not throwing away my shot
Let’s quickly clean that up.
$ cat my-shot.txt | grep -i "^I am not throwing away my shot"
I am not throwing away my shot!
I am not throwing away my shot!
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
I am not throwing away my shot
The ^
tells grep
to search from the beginning of a sentence. The -i
ensures that the search is case insensitive. Take a quick peek at the examples section of the manual.
man grep
Onward. I want to know how many times the word “shot” appears in the song. You know how to do this! Use the Unix philosophy Luke. Break the problem up into different pieces and glue them together.
What this means is: 1) Read from the file, 2) search for lines with the word “shot”, and 3) count them:
$ cat my-shot.txt | grep -i "shot" | wc
37 226 1058
It appears 226 times in 37 lines!
Quick tangent- let’s learn about a new utility. head
displays the first lines of a file. By default it shows the top 10 lines, but is configurable. You can see what other options it has by looking up the man page for head
.
man head
How many different words are in this song?
Back to our regular scheduled programming- what about other words in the song? You know how to figure out the number of instances of one word, but what if I want to know how many different words are in this song?
Let’s start by breaking up our sentences into words. The tr
command is SUPER useful if you want to translate one set of characters into another. Here I’m going to translate spaces into newlines. But to keep our console output manageable I’m going to tell head
only to show us 10 lines.
$ tr ' ' '\n' < my-shot.txt | head
[HAMILTON]
I
am
not
throwing
away
my
shot!
I
am
Oh that’s trippy. Now I want to sort all the words alphabetically to group them together. Sorting is easily accomplished using the…wait for it… sort
command.
tr ' ' '\n' < my-shot.txt | sort | head
But what on earth, it’s just blank. (Spoiler- nope, all the blank lines just sorted up to the top!) Let’s do some cleanup and remove the empty lines.
The sed
command (stream editor) takes patterns and we’re going to tell it here to delete anything that has no content between start and end.
$ tr ' ' '\n' < my-shot.txt | sed '/^$/d' | sort | head
&
&
&
&
&
&
&
'Onarchy?
'anarchy?'
'cause
And hey, the empty lines are gone!
How many different words are in this song? The uniq
command to the rescue! It does what you think it does.
$ tr ' ' '\n' < my-shot.txt | sed '/^$/d' | sort | uniq | head
&
'Onarchy?
'anarchy?'
'cause
'em
'n
'onarchy?
(And
(He
(Rise
And you already know how to count how many unique words we have in this song:
$ tr ' ' '\n' < my-shot.txt | sed '/^$/d' | sort | uniq | wc
517 517 356
So far we have figured out how to go from a file of song lyrics to being able to tell how many unique words are in that song. Next, let’s look through the man pages of uniq
.
man uniq
Hey that’s cool. The man page tells us that the uniq
command has a -c
flag which will tell you how many times a word appeared.
$ tr ' ' '\n' < my-shot.txt | sed '/^$/d' | sort | uniq -c
7 &
1 'Onarchy?
1 'anarchy?'
1 'cause
1 'em
2 'n
1 'onarchy?
1 (And
1 (He
4 (Rise
Kind of useful, but I can think of how we can make it way more useful. Since every line is now prefixed with a number, we can tell sort to them all of these again numerically. And then to make it even more useful, tell it to sort
them in reverse.
$ tr ' ' '\n' < my-shot.txt | sed '/^$/d' | sort | uniq -c | sort -nr | head
45 I
35 a
32 my
28 I'm
24 to
24 the
23 away
21 throwing
21 not
18 shot
There you have it. The top 10 most frequently occurring words in the song “My Shot”. If parties were a thing, you could totally drop this in conversation to impress bystanders.
If you’re a data scientist, Unix tools are invaluable for quickly spotting patterns or to just clean up your data. And honestly SO MUCH of data analysis / data science is cleaning up your data sources. I’d love to know if this post has been of some help to you. Jump over to Twitter and let me know!