The Sword of Pretty Thoughts

The (Creative) Life and Times of J.G. Jerome

Word Clouds in Tableau

The weird thing about Tableau 9 (or 10) is that it will let you make word clouds of phrases, but not of words (please contact me if you’ve found a better way).

At work we were allowed to use a (very crippled by IT;  you had to manually download Eggs and Wheel files) Python distribution called Enthought Canopy.  I somehow managed to get the Natural Language Toolkit installed with the corpuses (that’s a story for another day).  In Tableau I exported a crosstab CSV with all of the Record IDs in one column and all the phrases in question in the next column. I then ran this script that I wrote:

import nltk
import csv
from nltk.corpus import stopwords



#set utf8 as default
import sys  
reload(sys)  
sys.setdefaultencoding('utf8')


from string import punctuation
def strip_punctuation(s):
    return ''.join(c for c in s if c not in punctuation)


#list of words we don't care about (ie. is, a, the, of, etc.)
#if you don't specify what language, it does it for any language (assuming you have those files installed)
fillerWords = stopwords.words("english")



#open files
r=open('input.csv', 'r')
w=open('output.csv', 'wb')

reader = csv.reader(r, dialect='excel')
writer = csv.writer(w, delimiter=',', dialect='excel-tab')

#skip the header in input, but create header in output
next(reader)
writer.writerow( ("Record ID", "Verbs" , "Nouns") )


#for each line in input file...
for line in reader:
    print line[0]
    incident_num = line[0]	
    sentence = strip_punctuation(line[1].decode('ascii', 'replace'))  #decode to ascii to get past stupid utf8 errors, then strip punctuation away

    #strip out filler words
    stripped_sentence = ' '.join([word for word in sentence.split() if word not in fillerWords])

   #tokenize the sentence and get tags
    tokens = nltk.word_tokenize(stripped_sentence)
    text = nltk.Text(tokens)
    tags = nltk.pos_tag(text)


   #regex for nouns and verbs
    noun_pattern = "NP: {<DT>?<JJ>*<NN>}"
    verb_pattern = "V: {<V.*>}"

  #build ntlk tree for nouns
    NPChunker = nltk.RegexpParser(noun_pattern)
    noun_result = NPChunker.parse(tags)


  #build ntlk tree for verbs
    VerbChunker = nltk.RegexpParser(verb_pattern)
    verb_result = VerbChunker.parse(tags)


  #noun init
    nouns = []

  #traverse tree looking for noun labels, append to noun array
    for subtree in noun_result.subtrees(filter=lambda t: t.label() == "NP"):
      nouns.append(" ".join([a for (a,b) in subtree.leaves()]))
 
  #verb init
    verbs = []

  #traverse tree looking for verb labels, appent to verb array
    for subtree in verb_result.subtrees(filter=lambda t: t.label()=="V"):
      verbs.append(" ".join([a for (a,b) in subtree.leaves()]))

    #actually write the row to output file
    writer.writerow( (line[0] , " | ".join(verbs) ,  " | ".join(nouns)) )


#close the files
r.close()
w.close()

It spit out a CSV file that I then re-imported into Tableau and it had separated verbs and nouns in 2 other columns.  From there it was easy to build the word clouds below.

The zip file I’m sharing has example input and output files, as well as this script.   Enjoy!

Photo Credit: Pixabay.com

Next Post

Previous Post

© 2024 The Sword of Pretty Thoughts

Theme by Anders Norén