Natural language processing and sentiment analysis

Sentiment Analysis is a common NLP task that Data Scientists need to perform. This is a straightforward guide to learn the basics of NLP and to create a basic movie review classifier in Python.

Prerequisites :

Before we begin, please set up the Python environment on your machine. Head over to their official page here to install if you have not done so.

NLP Pyramid :

To get a large overview of text classification and NLP in general, we must talk about the NLP pyramid and the different stages of analysis for a text.

Natual language processing Pyramid
  • Singularization/pluralization
  • Gender detection …
  • Building Syntax Trees
  • Semantic Role Labelling etc …
  • Summarization etc…

Text preprocessing :

What is a text? You can think of a text as a sequence of words and each word is a meaningfull sequence of characters.

Tokenization :

Tokenization is the process of splitting an input text into meaningful chunks and that chunk is actually called token.

Token normalisation :

We may want the same token for different forms of the word We can have different examples like “talk”, “talks” or “talked”, then maybe it’s all about the talk, we want to merge this token into a single one “talk”. And the process of normalizing the words is called stemming or lemmatization

  • It usually refers to heuristic that chop off suffixes or replaces them.
  • Return the base or dictionary form of a word, which is known as the lemma.

Feature extraction from the text :

In this part we are going to transform tokens into features.

Bag of words ( BOW ) :

  • The counters ( number of occurence ) are not normalized.

TF-IDF :

Term-frequency ( TF ) : We will denote it as TF and that is the frequency for term t. The term is an n-gram, token, or anything like that in a document d.

https://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://en.wikipedia.org/wiki/Tf%E2%80%93idf

Linear model for sentiment classification ( Classifying IMDb Movie Reviews ) :

We can actually download the IMDB movie reviews dataset, it is freely available here. It contains a collection of 50,000 reviews from IMDB, 25,000 positive (label 1) and 25,000 negative reviews (label 0).

  • Linear models tend to perform better on sparse datasets like this one
  • n-grams: Instead of just single-word tokens (1-gram/unigram) we can also include word pairs.
  • Representations: Instead of simple, binary vectors we can use word counts or TF-IDF to transform those counts.
  • Algorithms: In addition to Logistic Regression, we can use Support Vector Machines (SVM).

Conclusion :

I hope this was useful to those who just started learning about natural language processing and text classification, keep in mind this is just an introductory article, more advanced notions are not talked about for simplicity sake, to see more you can check the references further below.

References and further reading :

Committed lifelong learner. I am passionate about machine learning, data engineering and currently working as a datascientist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store