Skip to article frontmatterSkip to article content

Logistic Regression

Sentiment analysis is the process of analyzing digital text to determine the emotional tone of the message - positive, negative, or neutral. Essentially, it helps us understand the writer’s attitude toward a particular topic or product, and it could be helpful for a lot of applications - like processing customer reviews or social media comments.

This example demonstrates how to implement a simple sentiment classifier using logistic regression. It’s surprising how well it performs for this class of tasks for a relatively simple model.

Data Preparation

Let’s start with Standford’s IMGDB sentiment dataset. It contains 50,000 movie reviews, tagged either as positive or negative.

from datasets import load_dataset
ds = load_dataset('stanfordnlp/imdb', split='train+test')
train, test = ds.train_test_split(test_size=0.2, seed=0).values()
display(train.to_pandas())

x_train = train['text']
y_train = train['label']
x_test = test['text']
y_test = test['label']
Loading...

Building and Training the Model

Alright, now that we’re done with the data, it’s time to build the classification pipeline. The first step would be vectorization - the process of turning strings to manipulate them mathematically. The easiest way to do so is to use a count vectorizer.

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

The idea is pretty simple.

First, it scans all the text to build a vocabulary of all the unique words it finds. Then, for each sentence, it creates a numerical list (vector) where each number describes how many times a specific word from that dictionary appears in that sentence.

example_vector = vectorizer.fit_transform(['Hello World! Hello!', 'Bye World']).toarray()
display(vectorizer.get_feature_names_out(), example_vector)
array(['bye', 'hello', 'world'], dtype=object)
array([[0, 2, 1], [1, 0, 1]])

Sounds easy, right? Now, we may define a thing called scaler.

It’s mostly optional, but it could help us rescale numerical features (like word counts) so they have a similar range. It could potentially help the model to learn better by preventing features with naturally larger values from unfairly dominating the learning process, ensuring all features contribute more equally.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)

Finally, let’s define a classifier (no fancy configuration here yet) and stick everything into an elegant pipeline. That is going to be our final model architecture.

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('scaler', scaler),
    ('classifier', classifier),
])

We could start training our model right away, but...

It would be not ideal in terms of its hyperparameters - those values that define how our pipeline works. These are settings we choose before training, like regularization strength or how our text vectorizer processes words. They significantly control how the model learns and how well it ultimately performs.

Manually trying every possible combination of these hyperparameters would be incredibly tedious. Instead, we can use automated hyperparameter tuning techniques. One such technique is randomized search - it randomly samples different combinations of hyperparameters from a pre-defined grid.

We may tune the following parameters:

  • Classifier C: Regularization strength of the LogisticRegression classifier. Smaller values make it stronger (less prone to overfitting), and bigger - weaker (able to capture more nuances in noisy data).
  • Vectorizer ngram_range: This is crucial for capturing context! Instead of just looking at individual words (unigrams), n-grams allow us to consider sequences of words as single features. Using n-grams beyond unigrams often significantly improves performance in text tasks by providing more contextual information to the model, but it also increases the vocabulary size.
  • Vectorizer max_df: Maximum document frequency - ignore terms that appear in more than ‘max_df’ documents. Smaller values exclude more common terms (good for noise reduction), but too small may result in losing important common signals (underfitting).
  • Vectorizer min_df: Minimum document frequency - ignore terms that appear in fewer than ‘min_df’ documents. Smaller values may lead to huge noisy vocabularies, and bigger ones may result in losing specific signals.
param_grid = {
    'classifier__C': [0.1, 1, 10],
    'vectorizer__ngram_range': [(1, 1), (1, 2)], 
    'vectorizer__max_df': [0.85, 0.90, 0.95, 1.0],
    'vectorizer__min_df': [1, 2, 3, 5],
}

Everything is ready - let’s train our model now.

%%capture --no-stdout
from joblib import parallel_backend
from sklearn.model_selection import RandomizedSearchCV
cv = RandomizedSearchCV(pipeline, param_grid, random_state=0, n_jobs=-1, verbose=3)
cv.fit(x_train, y_train)
Output
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 2/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.874 total time=   5.7s
[CV 1/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.869 total time=   5.7s
[CV 3/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.873 total time=   5.8s
[CV 4/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.871 total time=   5.7s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.872 total time=   5.5s
[CV 5/5] END classifier__C=0.1, vectorizer__max_df=1.0, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.871 total time=   5.7s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.877 total time=   5.5s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.876 total time=   5.7s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.866 total time=   6.4s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.867 total time=   6.4s
[CV 1/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.870 total time=   6.5s
[CV 2/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.876 total time=   6.5s
[CV 3/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.874 total time=   6.6s
[CV 4/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.872 total time=   6.7s
[CV 5/5] END classifier__C=0.1, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.872 total time=   6.1s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.874 total time=   6.0s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.875 total time=   6.1s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.876 total time=   6.3s
[CV 1/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.901 total time=  19.0s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.874 total time=   8.1s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=3, vectorizer__ngram_range=(1, 1);, score=0.870 total time=   8.2s
[CV 2/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.907 total time=  22.7s
[CV 3/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.906 total time=  22.6s
[CV 5/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.900 total time=  22.7s
[CV 4/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 2);, score=0.904 total time=  22.9s
[CV 1/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.873 total time=   6.9s
[CV 2/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.878 total time=   7.1s
[CV 3/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.876 total time=   7.0s
[CV 4/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.870 total time=   7.7s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.868 total time=   7.5s
[CV 5/5] END classifier__C=0.1, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 1);, score=0.873 total time=   8.2s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.897 total time=  32.0s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.869 total time=   7.7s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.906 total time=  32.3s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.872 total time=   7.9s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.870 total time=   7.7s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.905 total time=  32.7s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.9, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.870 total time=   8.3s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.897 total time=  33.5s
[CV 1/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.874 total time=   6.6s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.95, vectorizer__min_df=1, vectorizer__ngram_range=(1, 2);, score=0.896 total time=  30.8s
[CV 2/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.876 total time=   6.4s
[CV 3/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.878 total time=   6.5s
[CV 4/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.873 total time=   5.8s
[CV 5/5] END classifier__C=1, vectorizer__max_df=0.95, vectorizer__min_df=5, vectorizer__ngram_range=(1, 1);, score=0.872 total time=   5.8s
[CV 1/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.868 total time=   5.8s
[CV 2/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.877 total time=   5.6s
[CV 3/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.873 total time=   4.2s
[CV 4/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.874 total time=   4.0s
[CV 5/5] END classifier__C=10, vectorizer__max_df=0.85, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1);, score=0.863 total time=   3.9s

Result

from sklearn.metrics import classification_report
prediction = cv.best_estimator_.predict(x_test)
print(classification_report(y_test, prediction, target_names=ds.features['label'].names))
              precision    recall  f1-score   support

         neg       0.91      0.90      0.90      5025
         pos       0.90      0.91      0.90      4975

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

from pandas import DataFrame
best_params_table = DataFrame.from_dict(cv.best_params_.items())
display(best_params_table.style.hide(axis=0).hide(axis=1))
Loading...

Conclusion

Our logistic regression model reached an accuracy of 90%.

This demonstrates the effectiveness of classical machine learning techniques for the type of text classification task. One of the key factors contributing to this performance was the use of n-grams for local context processing.

While being highly effective, future improvements could involve exploring more complex vectorization techniques, experimenting with more advanced text pre-processing (like stemming or lemmatization), or even moving to deep learning models for sequence understanding.