The Cake is a Lie

So, this is the first proper post with some ‘actual’ code in our series of sentiment analysis blogs.

In the first introductory post we discussed natural language processing as a method to analyse the positive or negative-ness of a block of text. That certainly is our end goal but this post is all about the simple, brute force implementation. We’re going to start here and gradually build upon this to improve the analysis both in terms of accuracy and flexibility.

The Theory

The simplest approach to analysing whether a statement is positive or negative is just to count the number of positive and negative words and then compare them. If there’s more positive than negative then we can assume that the statement is overall positive and vice-versa for negative counts. If they’re equal then we can probably say the statement is neutral.

The Implementation

The implementation can be split into two part. First, there’s the code that reads in the entry (command line only at this stage) and investigates every word to see if it’s positive or negative.

private void analyze(String text) {
        List<String> words = Arrays.asList(text.split(" "));

        long positiveCount = words.stream().filter(library::isPositive).count();
        long negativeCount = words.stream().filter(library::isNegative).count();

        if(positiveCount > negativeCount) {
            logger.info("Text is POSITIVE");
        } else if(negativeCount > positiveCount){
            logger.info("Text is NEGATIVE");
        } else {
            logger.info("Text is NEUTRAL");
        }
    }

The iteration can be tidied up but we’ve left it like this for clarity.

Simple steps:

  1. Split on spaces
  2. Count positive words
  3. Count negative words
  4. Compare positive word count to negative word count

The second part is the library class. This is a simple class that loads a list of positive and negative words and provides accessor methods to the in-memory word lists. The class is as follows:

public class Library {

    private final Configuration configuration;
    private final List<String> positiveWords;
    private final List<String> negativeWords;

    public Library(Configuration configuration) throws IOException {
        this.configuration = configuration;
        positiveWords = loadWords(configuration.getPositiveWordList());
        negativeWords = loadWords(configuration.getNegativeWordList());
    }

    private List<String> loadWords(String fileName) throws IOException {
        ClassLoader classLoader = getClass().getClassLoader();
        File file = new File(classLoader.getResource(fileName).getFile());

        return Files.readAllLines(file.toPath());
    }

    public boolean isPositive(String word) {
        String lowerCaseWord = word.toLowerCase();
        return positiveWords.contains(lowerCaseWord);
    }

    public boolean isNegative(String word) {
        String lowerCaseWord = word.toLowerCase();
        return negativeWords.contains(lowerCaseWord);
    }
}

Note that we’re converting the input word to lowercase because the word lists are all lower case.

We’re using Cliche to provide a quick ‘n’ dirty menu system, whilst the documentation isn’t great it’s a pretty handy little library!

The Output

Input: "The cake was excellent" Output: BasicTextAnalyzer - Text is POSITIVE

Input: "The cake was poor"Output: BasicTextAnalyzer - Text is NEGATIVE`

Input: "The cake was substandard but still pleasant"Output: BasicTextAnalyzer - Text is NEUTRAL

The Drawbacks

So, this is a very simple solution and it’s somewhat successful. There are however, a few drawbacks. Take the following situations:

“The cake, well, it was great…”

The punctuation around the qualifiers will trip up the current analysis system.

“The cake itself was really bad but the icing was OK”

The current implementation would return a neutral statement. It doesn’t take into account the use of ‘really’ to modify the negative qualifier. It also ignores that the usage of ‘OK’ only referred to the icing, not the cake.

“The cake was so tasty it could be called evil”

Once again the current implementation fails to take into account the context of the word evil in this statement; it’s being use in an unexpected manner to describe the cake as good.

“The cake was wicked”

This statement uses slang to describe the cake positively. The system should be able to cope with colloquialisms, especially given that these are commonly used in internet comments.

To sum up, this is by no means a complete piece of work but it does provide a reasonable analysis of some, very basic phrases.

For the next post we’re going to try and improve upon this implementation and address the drawbacks above.

Until then, stay frosty…

~23Squared