This notebook runs with the Python 3.9 kernel and can be easily downloaded to run, test, and modify locally. It uses data from a CSV file containing 40000 Amazon reviews. It currently uses text input to predict the category of a given review, but this can be changed by replacing "Cat1" references to any other column, such as Score or other classes Cat2 or Cat3.

1. Installation

These libraries will need to be installed on the machine to be used later

pip install scikit-learn numpy pandas

Requirement already satisfied: scikit-learn in /opt/homebrew/lib/python3.10/site-packages (1.2.2)
Requirement already satisfied: numpy in /opt/homebrew/lib/python3.10/site-packages (1.24.3)
Requirement already satisfied: pandas in /opt/homebrew/lib/python3.10/site-packages (2.0.1)
Requirement already satisfied: scipy>=1.3.2 in /opt/homebrew/lib/python3.10/site-packages (from scikit-learn) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in /opt/homebrew/lib/python3.10/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/homebrew/lib/python3.10/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/xiaoa1/Library/Python/3.10/lib/python/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.10/site-packages (from pandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /opt/homebrew/lib/python3.10/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /Users/xiaoa1/Library/Python/3.10/lib/python/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Note: you may need to restart the kernel to use updated packages.

2. Import Libraries

This notebook uses numpy, pandas, and scikit-learn for data analysis

import numpy as np
import pandas as pd
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.svm import LinearSVC

3. Load and Preprocess Data

The dataset used is train_40k.csv from this Kaggle page. Loading the dataset into a pandas dataframe (df) allows for preprocessing by focusing on relevant columns (review text and rating) so that the data can be cleaned up by removing punctuation.

df_original = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/train_40k.csv')

columns = ['Text', 'Cat1']

df = shuffle(df_original[columns])

df.Cat1.value_counts()

p = re.compile(r'[^\w\s]+')

df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]

df.apply(lambda x: x.astype(str).str.lower())

4. Split the Dataset

scikit-learn has a feature to split data for training and testing. The test size and random state can be adjusted to increase accuracy.

x,y = df.Text, df.Cat1
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=4000)

5. Building a Pipeline

Pipelines are an iterative way to build a model to accomodate variations in data and improve accuracy. Because datasets can be so large now, its beneficial for data to be sent through a pipeline so that it can be manipulated

pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), stop_words='english', sublinear_tf=True)), 
                     ('chi', SelectKBest(chi2, k=10000)),
                     ('clf', LinearSVC(C=1.0, penalty='l1',max_iter=3000, dual=False))
                    ])

In this case TfidfVectorizer converts text into numerical feature vectors using the TF-IDF (Term Frequency-Inverse Document Frequency) representation

ngram_range=(1,2): Counts both single words and pairs of words as features
stop_words='english': Excludes common English stop words that don't carry much info
sublinear_tf=True: Applies a sublinear transformation to the TF (Term Frequency) representation, which helps with frequent words.

SelectKBest selects the most informative features based on the chi-squared test as the scoring function

k=10000: Top 10,000 features with the highest chi-squared statistics will be selected

LinearSVC is the Linear Support Vector Classifier, which is a linear model for classification

C=1.0: Regularization strength, higher numbers mean stronger regularization
penalty='l1': Specifies L1 regularization penalty type
max_iter=3000: Max number of iterations for convergence
dual=False disables dual formulation when the number of samples is greater than the number of features

Pipeline is the main class that connects the components. Each component is specified as a tuple of (name, estimator), where the name is a string and the estimator is an instance of a transformer or classifier.

6. Model Training

This fits the pipeline to our data

model = pipeline.fit(train_x, train_y)

7. Model Evaluation

Using the score function to rate the accuracy of the model

print('accuracy score: '+ str(model.score(test_x, test_y)))

accuracy score: 0.8405

8. Prediction

Now we can use the predict function to input our own written Amazon review and see what the predicted category is!

print(model.predict(['price is good but tasted a bit stale'])) # type your review here to see the predicted rating! exclude punctuation

['grocery gourmet food']

	Text	Cat1
12858	this game is the most fun ive had in a while w...	toys games
172	the peanut butter chocolate chip harvest power...	health personal care
31558	this is a great product it goes on smoothly an...	beauty
13657	being from czech republic i grew up on goosebe...	grocery gourmet food
26058	i bought the leapfrog musical table after lett...	toys games
...	...	...
34990	everything perfect came quickly as pictured ...	baby products
24938	we purchased this gate after reading the revie...	baby products
33333	i hate multiblade razors because they give me ...	health personal care
23287	i started out the year with a vow to watch my ...	grocery gourmet food
32843	great bag for baby and parent very durable eas...	baby products