Sentiment analysis of e-commerce product reviews with around 94% accuracy

5 min readFeb 13, 2021

Our goal is to mine the opinion(positive, negative) out of text-reviews. we will see how to build a great deep learning model that accurately mine opinion of new future reviews.

As we know in order to train a supervised model we need labeled data. Here we will be using ‘Women’s E-Commerce Clothing Reviews’ data set from Kaggle having around 20k text reviews along with ratings.

Let’s have a quick look at the data

df.head(2)

Columns we are interested in are ‘Review Text’, ‘Rating’, and ‘Recommended IND’, so we can drop other columns that we don’t need.

Our goal is to build a model that can predict whether the user recommends a particular product/service or not, in other words, we want to predict whether the user’s review is positive or negative.

We can combine rating and recommended columns to derive labeled data with the sentiment. we can consider reviews with ratings 1 and 2 as negative and 5 as positive. Reviews with rating 3 and 4 are difficult to label, but we can use the logic that if a person is giving 3/4 star and not recommending the product to other then it must be negative review and if he/she is recommending the product to other then it means the review is positive but he/she sees some scope of improvement. using this logic we can come up with a target column-sentiment having values ‘POSITIVE’ (1) and “NEGATIVE’(0)

we can see that the data set is not balanced, we have more positive reviews than negative reviews. we need to balance the data set otherwise our deep learning model will be biased towards the positive class.

Considering our data set size, we can use both oversampling(duplicated negative reviews) and under-sampling(removed some positive reviews) to get the balance.

Next, we can handle missing values in the dataset, regardless of the problem and model it is usually a good idea to handle missing values.

I have created the above visualization using the ‘missingno’ library. white lines represent missing values for the particular column. in our case, it is text(review). As we can see only a few records have missing values hence we can simply drop those records.

Removing stopwords can potentially increase the performance and accuracy since there are only related, limited and important tokens left to process. However, the list of stopwords is not common to all the problems. Any word can be a stopword depending on the problem you are solving. In our case, we will remove English stopwords except for negations. we will not remove negative words such as ‘no’, ‘not’, ‘neither’, ‘nor’, etc because it plays an important role in deciding the overall sentiment of the statement, and removing them might degrade your model-performance.

we have done enough preprocessing required for this problem, we can proceed with other components of the machine learning pipeline that is to split the dataset in the training and testing part, tokenization, vectorization, and finally to build the deep learning model.

Here is a sklearn code to split the dataset

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25, random_state=1000)

Next, After tokenizing we transform each text in texts to a sequence of integers using the Keras’ texts_to_sequences method.

sequences=tokenizer.texts_to_sequences(x_train)

Next transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps).

seq_Matrix = sequence.pad_sequences(sequences,maxlen=max_len,truncating='post')

heremax_lenis the length of the longest sequence in the list. Sequences that are shorter than num_timesteps are padded with zero until they are maxlen long.

Finally, it’s time to build our powerful LSTM deep learning model. Long short term memory( LSTM ) was created as the solution to the short-term memory problem of normal RNN. They have internal mechanisms called gates that can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions

def RNN():
    model=tf.keras.Sequential([
         tf.keras.layers.Embedding(input_dim=vocab_size,input_length=max_len,output_dim=64,trainable=True),
        tf.keras.layers.Dropout(0.2),

        tf.keras.layers.LSTM(100),
        tf.keras.layers.Dense(128,activation='relu',kernel_regularizer=L1L2(l1=0,l2=0.001)),
        #tf.keras.layers.Dense(128,activation='relu'),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1, activation='sigmoid')
        
        
    ])
    return model

The first layer is an Embedding layer which we used here to derive word embedding of training records. Word embedding provides a dense representation of words and it’s meaning. Here output_dim is nothing but a dimension of a dense embedding.

We have used drop out layers to generalize the models and to avoid overfitting the model on the training data. It randomly sets input units to 0 which leads to the removal of edges from the dense network and makes it a lighter one.

The third layer is LSTM with 100 dimensions. The fourth layer is a dense layer that will ultimately build a deep neural network using the features produced by the LSTM layer. in this layer, we are using relu as an activation function and L1 L2 regularization.

The final layer is a neural network layer with a sigmoid activation function which will classify the text into binary classes i.e positive and negative classes

Once we are done with the model building we can proceed with compiling and fitting the model using an applicable optimizer, here I have used adam optimizer you can read more about it on the internet.

Here is the summary of the 5 epochs that I run while training.

Epoch 1/5
4020/4020 [==============================] - 351s 87ms/step - loss: 0.4415 - accuracy: 0.8046 - val_loss: 0.2579 - val_accuracy: 0.9035
Epoch 2/5
4020/4020 [==============================] - 350s 87ms/step - loss: 0.2133 - accuracy: 0.9265 - val_loss: 0.2368 - val_accuracy: 0.9136
Epoch 3/5
4020/4020 [==============================] - 352s 87ms/step - loss: 0.1448 - accuracy: 0.9504 - val_loss: 0.2240 - val_accuracy: 0.9292
Epoch 4/5
4020/4020 [==============================] - 352s 87ms/step - loss: 0.1003 - accuracy: 0.9682 - val_loss: 0.2154 - val_accuracy: 0.9325
Epoch 5/5
4020/4020 [==============================] - 352s 88ms/step - loss: 0.0754 - accuracy: 0.9748 - val_loss: 0.2109 - val_accuracy: 0.9405

As you can see we have achieved an amazing accuracy of around 94%.

But accuracy is not the right matrix to evaluate our model in because we have imbalanced classes. We should measure precision and recall and that can be done easily using sklearn’s classification report. let’s run it on our test data.

precision    recall  f1-score   support

           0      0.917     0.969     0.942      4099
           1      0.968     0.916     0.941      4275

    accuracy                          0.941      8374
   macro avg      0.942     0.942     0.941      8374
weighted avg      0.943     0.941     0.941      8374

Hureeey, We have achieved around 94% recall, precision, and f-1score and that’s pretty impressive!.

you can start predicting manual data for the fun.

Kudos to us. Happy learning.

Follow me for more such interesting and upcoming state of the art models.

Let’s connect on Linkedin to build such amazing projects together. Here is my linked in profile: https://www.linkedin.com/in/biren162/

Image credit: google image

Sentiment analysis of e-commerce product reviews with around 94% accuracy

Written by Birenpatel