Natural language processing and sentiment analysis

Sentiment Analysis is a common NLP task that Data Scientists need to perform. This is a straightforward guide to learn the basics of NLP and to create a basic movie review classifier in Python.

Prerequisites :

You have to install packages like NLTK and Scikit-learn . You can install both of them using pip installer if it is not installed in your Python installation.

NLP Pyramid :

Natual language processing Pyramid

1- Morphology : The first stage consist on analysing how words are formed, what is their origin, how does their form change depending on the context. In most cases we are talking about :

  • Prefixes/suffixes-lemmatization (the base form of a word)
  • Singularization/pluralization
  • Gender detection …

2- Syntax : Then the next stage, syntactical analysis, will be about different relations between words in the sentence. Syntax usually works on sentences, where a sentence is a sequence of words. For example, we can know that there are some objects and subjects and so on. Here are few syntactical analysis tasks:

  • Part-of-speech tagging (assigning tags to words: Noun/Verb/etc …)
  • Building Syntax Trees

3- Semantics : Now the next stage, once we know some synthetic structures, would be about semantics. So semantics is about the meaning. For example:

  • Named Entity Extraction
  • Semantic Role Labelling etc …

4- Pragmatics : We are going higher and higher in our abstraction, going from just some symbols to some meanings and the pragmatics would be the highest level of this abstraction. Pragmatics analyses the text as a whole. It’s about determining underlying narrative threads, topics, references. Here are some known problems :

  • Topic segmentation
  • Summarization etc…

For morphological and syntactical analysis, you might try using analytical library which is a really convenient tool in Python. So please feel free to investigate it.

Stanford parser inter alia is a parser for synthetic analysis that provides different options and has really lots of different models built in.

Now Gensim and MALLET would be about more high level abstractions.For example, you can do subclassification problems there or you can think about semantics.

Text preprocessing :

How to delimit a word? In most language we can split a sentence hence separate words using spaces and punctuations.

But, there are exceptions:

  • In german there are compound words …

It could be more difficult in German, because there are compound words which are written without spaces at all like “Betäubungsmittelverschreibungsverordnung”. It’s a mesmerizing word that is rather difficult to read. This lengthy word translated to english becomes 7 words “regulation requiring a prescription for an anesthetic.”

So for the analysis of german text, it could be beneficial to split that compound word into separate words because every one of them actually makes sense.

  • In japanese there is no space :

It doesn’t have spaces at all, but people can still read it. It’s not a problem for a human being as you can probably yourself read the sentence in english below but it can be challenging for a machine to make sense of it.


Tokenization :

You can think of a token as a useful unit for further semantic processing. It can be a word, a sentence, a paragraph or anything else.

Let’s look at the example of simple whitespaceTokenizer. What it does, it’s splits the input sequence on white spaces, that could be a space or any other character that is not visible.

What is the problem here? The problem is that the last token “it?”, actually does have the same meaning as the token “it” without the question mark but still there are two different tokens. And that might be not a desirable effect.

We might want to merge these two tokens because they have essentially the same meaning, as well as for the “spaces,”, it is the same token as “spaces”. So let’s try to split by punctuation also.

The problem here, is that we have a token that refers to apostrophes and we have “isn”, and “t” as separate tokens as well. These tokens don’t have much meaning and it doesn’t make sense to analyze them. It only makes sense when it is combined with apostrophe or the previous word.

So, actually, we can come up with a set of rules or heuristics which you can find in TreeBanktokenizer and it actually uses the grammar rules of English language to tokenize and enable further analysis.

So, the apostrophe is left untouched which actually makes much more sense, because “n’t” means not like that negate the token before.

Token normalisation :

Stemming :

  • Stemming is a process of removing and replacing suffixes to get to the root form of the word, which is called the stem.
  • It usually refers to heuristic that chop off suffixes or replaces them.

For example:

Words having the same stem will have a similar meaning like “connections”, “connected”, ”connecting” and “connection” all have the same stem “connect”.

For “feet” it produces “feet”, so it doesn’t know anything about irregular forms. For “wolves”, it produces “wolv”, which is not a valid word, but it still can be useful for analysis.

Drawbacks : So the problems are obvious. It fails on the regular forms, and it produces non-words.

Lemmatization :

  • Usually refer to doing things properly with the use of vocabularies and morphological analysis.
  • Return the base or dictionary form of a word, which is known as the lemma.

For example :

This time when we have a word “feet”, is actually successfully reduced to the normalized form “foot”, becuase it knows about words of English language and all irregular forms.

But when you take “talked”, it becomes “talked”, nothing changes.

Drawbacks : the problem is lemmatizer actually doesn’t really use all the forms. So, for nouns, it might be like the normal form or lemma could be a singular form of that noun. And that might actually prevents you from merging tokens that have the same meaning.

Takeaway : We need to try stemming and lemmatization to choose and decide what works best for our task.

Feature extraction from the text :

And the first way to do that is bag of words:

Bag of words ( BOW ) :

We’re actually looking for marker words like “excellent” or “disappointed”, and we want to detect those words, and make decisions based on absence or presence of that particular word.

Let’s take all the possible words or tokens that we have in our documents. And for each token, let’s introduce a new feature or column that will correspond to that particular word.

So, that is a pretty huge metrics of numbers, and how we translate our text into a vector in that metrics or row in that metrics.

Example :

This process is called text vectorization, because we actually replace the text with a huge vector of numbers, and each dimension of that vector corresponds to a certain token in our database. You can actually see that it has some problems.

Problems :

  • The first one is that we lose word order, because we can actually shuffle over words, and the representation will stay the same. And that’s why it’s called bag of words, because it’s a bag they’re not ordered,and so they can come up in any order.
  • The counters ( number of occurence ) are not normalized.

Let’s solve these two problems, and let’s start with preserving some ordering. So how can we do that? Actually you can easily come to an idea that you should look at token as pairs, triplets, or different combinations. These approach is also called as extracting n-grams.

Let’s preserve some order ( N-grams ) :

One gram stands for one single token, two gram stands for a token pair and so forth. So let’s look how it might work.

So, this way, we preserve some local word order, and we hope that that will help us to analyze this text better. The problems are obvious though.

Problems : This representation can have too many features, because let’s say you have 100,000 words in your database, and if you try to take the pairs of those words, then you can actually come up with a huge number that can exponentially grow with the number of consecutive words that you want to analyze.

And to overcome that problem, we can actually remove some n-grams.

Remove some n-grams :

  • High frequency n-grams :

High frequency n-grams is seen in almost all of the documents, that would be articles and preposition, because they’re just there for grammatical structure and they don’t have much meaning.

These are called stop-words, they won’t help us to discriminate texts, and we can pretty easily remove them

  • Low frequency n-grams :

Here we are talking mainly about typos mistakes that people do when they write or rare n-grams that’s usually not seen in any other texts.

And both of them are bad for our model, because they lead the model to learn some dependencies that are actually not there hence lead to overfitting.

  • Medium frequency n-grams :

Those are the good n-grams, because they are not stop-words or typos and we actually need them. But the problem is that there’re a lot of medium frequency n-grams.

However, we can decide which medium frequency n-gram is better and which is worse based on frequency, the n-gram with smaller frequency can be more discriminating because it can capture a specific topic in the text.

And to use this idea we are going to talk about TF-IDF :


And there are different options how you can count that term frequency, what follows explain some of them :

The first and the easiest one is binary. You can actually take zero or one based on the fact whether that token is absent in our text or it is present.

Then, a different option is to take just a raw count of how many times we’ve seen that term in our document, and let’s denote that by f.

Then, you can take a term frequency, so you can actually look at all the counts of all the terms that you have seen in your document and you can normalize those counters to have a sum of one.

And, one more useful scheme is logarithmic normalization. You take the logarithm of those counts and it actually introduces a logarithmic scale for your counters and that might help you to solve the task better.

Inverse document frequency ( IDF ) :

Lets denote by capital N : the total number of documents in our corpus, and our corpus is a capital D : that is the set of all our documents.

Now, let’s look at how many documents are there in that corpus that contain a specific term. Then you would take that number of documents where the term appears and divide by the total number of documents, and you have a frequency of those of that term in our documents.

But if you want to take inverse argument frequency then you just swap the up and down of that ratio and you take a logarithm of that and that thing, we will call inverse document frequency.

We instantiate that TF-IDF vectoriser and it has some useful arguments that you can pass to it, like min-df, which stands for minimum document frequency that is essentially a cutoff threshold for low frequency n-grams because we want to throw them away.

And we can actually threshold it on a maximum number of documents where we’ve seen that token and this is done for stripping away stop words.

So, if we have vectorized our text we get something like this.

So you can replace counters with TF-IDF values and that usually gives you a performance boost as well.

Linear model for sentiment classification ( Classifying IMDb Movie Reviews ) :

A negative review has a score ≤ 4 out of 10, and a positive review has a score ≥ 7 out of 10.

The dataset is divided into training and test sets. The training set is the same 25,000 labeled reviews.

The sentiment classification task consists of predicting the polarity (positive or negative) of a given text.

After downloading the files from the link above, we are going to read the train and test files .

The raw text is pretty messy for these reviews so before we can do any analytics we need preprocess the texts.

After preproccessing the text we need to vectorize it, here for learning purposes we are going to keep it as simple as possible.

And now we are going to build our classification model, logistic regression is a good baseline model for us to use for several reasons:

  • They’re easy to interpret
  • Linear models tend to perform better on sparse datasets like this one

The targets/labels we use will be the same for training and testing because both datasets are structured the same, where the first 12.5k are positive and the last 12.5k are negative.

And that’s it, a very simple logitistical regression classifier with pretty good accuracy .

Let’s look at the 5 most discriminating words for both positive and negative reviews. We’ll do this by looking at the largest and smallest coefficients, respectively.

How to improve our model :

As you can probably guess by now, to get better performance out of our classifier, we can use:

  • Text Processing: Stemming/Lemmatizing .
  • n-grams: Instead of just single-word tokens (1-gram/unigram) we can also include word pairs.
  • Representations: Instead of simple, binary vectors we can use word counts or TF-IDF to transform those counts.
  • Algorithms: In addition to Logistic Regression, we can use Support Vector Machines (SVM).

Conclusion :

I am still new on Medium so do share your thoughts, questions. I welcome feedback and constructive criticism regarding this article. Happy learning!

References and further reading :

Committed lifelong learner. I am passionate about machine learning, data engineering and currently working as a datascientist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store