Tokenization

It’s been a while since our last blog post in this series, we’re terribly sorry that things are happening a bit slower than we would’ve liked. It’s mainly down to other commitments we’ve had, hopefully we’ll be able to get back on track!

This post is going to be a swift post about the next steps in the process of our sentiment analysis project and how we can improve upon our previous iteration. So, where did we get to last time? We managed to do some basic analysis using a dictionary analysis of each word of the statement. Although it was basically successful it struggled with punctuation – words were being split on whitespace so some of the terms would contain the punctuation in addition to the word to be analysed. This meant that the word wouldn’t be found in the dictionary and therefore provide a poor analysis.

So, how do we fix this? Well, we need to improve the ‘tokenization’ of our string. A Tokenization is the process of splitting the string into its independent tokens which in this case are the words of the sentence.

So, for example,

"The quick brown fox jumps over the lazy dog"

Becomes:

<sentence>
   <word>The</word>
   <word>quick</word>
   <word>brown</word>
   <word>fox</word>
   <word>jumps</word>
   <word>over</word>
   <word>the</word>
   <word>lazy</word>
   <word>dog</word>
</sentence>

Tokenization can also be used to split sentences in a document – the token is simply the smallest possible unit that you’re aiming to split the input into.

"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."

Becomes:

<paragraph>
   <sentence>"Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</sentence>
   <sentence>Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</sentence>
   <sentence>Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.</sentence>
   <sentence>Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."</sentence>
</paragraph>

The process we’re currently using is a very naive process of tokenization. Although the Java util package includes a tokenizer we’re going to be using the tokenizer that’s part of the Standford NLP library. We’re using the Standford tokenizer because it’s part of the ecosystem we’re using for the rest of the project and we feel that it’s slightly more advanced than the Java util version.

That’s it for now!

~23Squared