Word Clouds in Tableau

The weird thing about Tableau 9 (or 10) is that it will let you make word clouds of phrases, but not of words (please contact me if you’ve found a better way).

At work we were allowed to use a (very crippled by IT;  you had to manually download Eggs and Wheel files) Python distribution called Enthought Canopy.  I somehow managed to get the Natural Language Toolkit installed with the corpuses (that’s a story for another day).  In Tableau I exported a crosstab CSV with all of the Record IDs in one column and all the phrases in question in the next column. I then ran this script that I wrote:

import nltk
import csv
from nltk.corpus import stopwords

#set utf8 as default
import sys  

from string import punctuation
def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)

#list of words we don't care about (ie. is, a, the, of, etc.)
#if you don't specify what language, it does it for any language (assuming you have those files installed)
fillerWords = stopwords.words("english")

#open files
r=open('input.csv', 'r')
w=open('output.csv', 'wb')

reader = csv.reader(r, dialect='excel')
writer = csv.writer(w, delimiter=',', dialect='excel-tab')

#skip the header in input, but create header in output
writer.writerow( ("Record ID", "Verbs" , "Nouns") )

#for each line in input file...
for line in reader:
    print line[0]
    incident_num = line[0]	
    sentence = strip_punctuation(line[1].decode('ascii', 'replace'))  #decode to ascii to get past stupid utf8 errors, then strip punctuation away

    #strip out filler words
    stripped_sentence = ' '.join([word for word in sentence.split() if word not in fillerWords])

   #tokenize the sentence and get tags
    tokens = nltk.word_tokenize(stripped_sentence)
    text = nltk.Text(tokens)
    tags = nltk.pos_tag(text)

   #regex for nouns and verbs
    noun_pattern = "NP: {<DT>?<JJ>*<NN>}"
    verb_pattern = "V: {<V.*>}"

  #build ntlk tree for nouns
    NPChunker = nltk.RegexpParser(noun_pattern)
    noun_result = NPChunker.parse(tags)

  #build ntlk tree for verbs
    VerbChunker = nltk.RegexpParser(verb_pattern)
    verb_result = VerbChunker.parse(tags)

  #noun init
    nouns = []

  #traverse tree looking for noun labels, append to noun array
    for subtree in noun_result.subtrees(filter=lambda t: t.label() == "NP"):
      nouns.append(" ".join([a for (a,b) in subtree.leaves()]))
  #verb init
    verbs = []

  #traverse tree looking for verb labels, appent to verb array
    for subtree in verb_result.subtrees(filter=lambda t: t.label()=="V"):
      verbs.append(" ".join([a for (a,b) in subtree.leaves()]))

    #actually write the row to output file
    writer.writerow( (line[0] , " | ".join(verbs) ,  " | ".join(nouns)) )

#close the files

It spit out a CSV file that I then re-imported into Tableau and it had separated verbs and nouns in 2 other columns.  From there it was easy to build the word clouds below.

The zip file I’m sharing has example input and output files, as well as this script.   Enjoy!

