Natural Language Processing
This notebook runs with the Python 3.9 kernel and can be easily downloaded to run, test, and modify locally. It uses data from a CSV file containing 40000 Amazon reviews. It currently uses text input to predict the category of a given review, but this can be changed by replacing "Cat1" references to any other column, such as Score or other classes Cat2 or Cat3.
pip install scikit-learn numpy pandas
import numpy as np
import pandas as pd
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2, SelectKBest
from sklearn.svm import LinearSVC
3. Load and Preprocess Data
The dataset used is train_40k.csv from this Kaggle page. Loading the dataset into a pandas dataframe (df) allows for preprocessing by focusing on relevant columns (review text and rating) so that the data can be cleaned up by removing punctuation.
df_original = pd.read_csv('/Users/xiaoa1/VSCode/ds/_notebooks/data/train_40k.csv')
columns = ['Text', 'Cat1']
df = shuffle(df_original[columns])
df.Cat1.value_counts()
p = re.compile(r'[^\w\s]+')
df['Text'] = [p.sub('', x) for x in df['Text'].tolist()]
df.apply(lambda x: x.astype(str).str.lower())
x,y = df.Text, df.Cat1
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=4000)
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range=(1,2), stop_words='english', sublinear_tf=True)),
('chi', SelectKBest(chi2, k=10000)),
('clf', LinearSVC(C=1.0, penalty='l1',max_iter=3000, dual=False))
])
In this case TfidfVectorizer converts text into numerical feature vectors using the TF-IDF (Term Frequency-Inverse Document Frequency) representation
- ngram_range=(1,2): Counts both single words and pairs of words as features
- stop_words='english': Excludes common English stop words that don't carry much info
- sublinear_tf=True: Applies a sublinear transformation to the TF (Term Frequency) representation, which helps with frequent words.
SelectKBest selects the most informative features based on the chi-squared test as the scoring function
- k=10000: Top 10,000 features with the highest chi-squared statistics will be selected
LinearSVC is the Linear Support Vector Classifier, which is a linear model for classification
- C=1.0: Regularization strength, higher numbers mean stronger regularization
- penalty='l1': Specifies L1 regularization penalty type
- max_iter=3000: Max number of iterations for convergence
- dual=False disables dual formulation when the number of samples is greater than the number of features
Pipeline is the main class that connects the components. Each component is specified as a tuple of (name, estimator), where the name is a string and the estimator is an instance of a transformer or classifier.
model = pipeline.fit(train_x, train_y)
print('accuracy score: '+ str(model.score(test_x, test_y)))
print(model.predict(['price is good but tasted a bit stale'])) # type your review here to see the predicted rating! exclude punctuation