Developing a Semantic Classifier


Some Background

Lately I've developed some interest in semantic analysis of unstructured text data. I wished to test my text analysis pipeline on temporal natural language text data, but I couldn't find any in Finnish. I chose Nice city centre for visitors -survey results as an alternative, that contains citizen feedback about city's centre development.

Over time, several sophisticated methods have been developed to process, embed, and predict meaning from natural language, be it vocal or text data. I've decided to implement simple text preprocessing methods on Finnish text data, where I'll use Voikko module for lemmatization to create a training dataset for predicting simple semantic meaning from natural language text data. I'll be testing several different machine learning models on my dataset and compare their classification results. I'll omit more robust neural network models, since they require more computing power and my dataset is too small to be applicable for neural network model training.

I've read and labelled the citizen feedback survey's results as follows:

Summary Feedback type
1 Positive
2 Negative
3 Mixed

It is quite possible to add semantically mixed results also into training dataset. Nevertheless I've decided to omit mixed results, because they have major semantic overlap between semantically positive and negative results, which has negative result on model's prediction accuracy.

Import, Inspect And Preprocess Text Data

Here I'll import my curated feedback dataset and use Natural Language ToolKit (NLTK) to tokenize and filter the text, and use Voikko package to lemmatize, i.e. turn words from inflected to their basic form.

import pandas as pd

df = pd.read_excel('kiva_keskusta_kavelijoille_syksy2014.xlsx')

What does the dataframe look like?

df.head()
QUESTION SUBJECT COMMENT Summary
0 Hyvä paikka kohtaamiselle ja kaupunkielämälle Mukavat nurmikot loikoilla ja katsella... Mukavat nurmikot loikoilla ja katsella kaupunk... 1
1 Keskustan tinkimätön helmi Hyvä katsomo kaupunkitapahtumille Hyvä katsomo kaupunkitapahtumille 1
2 Keskustan tinkimätön helmi Mainio paikka piknikille Mainio paikka piknikille 1
3 Rauhaton tai meluisa paikka Meluisa ja ahdas Meluisa ja ahdas 2
4 Vaikea kulkea jalan Ahdas paikka. bussien odottajat tukkivat... Ahdas paikka. bussien odottajat tukkivat jalka... 2

The structure is as intended. I'll be using the COMMENT column as a feature for machine learning training and Summary column has the semantic labels. Let's see how these semantic labels are distributed in dataset.

import matplotlib.pyplot as plt
%matplotlib inline

plt.bar(['Negative', 'Positive', 'Mixed'], df.Summary.value_counts())
plt.show()

print("Let's filter the mixed feedbacks")
df = df[df['Summary']!=3]

plt.bar(['Negative', 'Positive'], df.Summary.value_counts())
plt.show()
Let's filter the mixed feedbacks

As it is to Finnish fashion, there is more negativity than positivity. There is some imbalance in the semantic labels, but the difference isn't so great that it will cause predictor model to overprefer one label over another.

Lemmatization

Here I'll implement Voikko module to remove word inflections from the text data. This way I'll have much less unique words to work with without losing much semantic meaning. Believe me, when it comes to the agglutinative nature of Finnish language, this method becomes really useful.

import libvoikko
from libvoikko import Voikko

Voikko.setLibrarySearchPath('C:\Voikko') # set Voikko word library search path 
lemmatizer = libvoikko.Voikko(u"fi")  # setup the lemmatizer function

def lemmatize(raw_review):
    wordlist = [text.split() for text in raw_review] # split the each citizen feedback
    filterlist = []
    lemmatized = []

    for words in wordlist: # remove all characters from a string that isn't a number or a letter
        words = [''.join(e for e in word.lower() if e.isalnum()) for word in words]
        filterlist.append(words) # add each filtered word list into filterlist

    for words in filterlist: # lemmatize each word from each citizen feedback
        # if the word isn't in the Voikko word list, then the word will be omitted 
        words = [lemmatizer.analyze(word)[0]['BASEFORM'] for word in words if lemmatizer.analyze(word)] 
        lemmatized.append(words) # add each lemmatized word list into a list of lemmatized feedbacks
    return lemmatized

lemmatized = lemmatize(df.COMMENT)

Once the words are lemmatized, I'll filter stop words (semantically not that meaningful) and too short words from the dataset. Usually, really short words cause just background noise, so it would be sensible to filter them away. However, some short words can have significant semantic meaning, such as ei (no), so those words shall be excepted. Here I'll use openly available NLTK's Finnish language stop word set and simple word length check to filter non-meaningful words.

import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

def review_the_words(raw_review):
    # set
    stopWords = set(stopwords.words('finnish')) # set Finnish stopwords for NLTK stopWord set
    # however, there may be some semantically meaningful words, so they will be removed from the stopWord set
    stopWords.remove('ei')
    stopWords.remove('eivät')
    stopWords.remove('emme')
    stopWords.remove('en')
    stopWords.remove('et')
    stopWords.remove('ette')
    
    # many short words can add background noise, so most of them will be filtered
    short_words = [w for w in raw_review if len(w) <= 3]
    
    # However, some of short words can have relecant semantic meaning (depending on what you want to predict)
    # Therefore these words shall be exceptions
    exceptions = ['ei', 'en', 'et', 'voi', 'tie', 'yli', 'jää', 'ohi']
    
    # filter exceptions from short words
    for word in exceptions:
        if word in short_words:
            short_words.remove(word)
            
    # use this function to filter 
    def hasNumbers(word):
        return any(char.isdigit() for char in word)
    
    meaningful_words = [w for w in raw_review if (w not in stopWords
                                                  and w not in short_words 
                                                  and hasNumbers(w) == False)] 
        
    return(meaningful_words)

df['corpus'] = [ review_the_words(text) for text in lemmatized]

Result:

df[['COMMENT', 'corpus', 'Summary']].head(10)
COMMENT corpus Summary
0 Mukavat nurmikot loikoilla ja katsella kaupunk... [mukava, nurmikko, loikoilla, katsella, kaupun... 1
1 Hyvä katsomo kaupunkitapahtumille [hyvä, katsomo, kaupunkitapahtuma] 1
2 Mainio paikka piknikille [mainio, paikka, piknikki] 1
3 Meluisa ja ahdas [meluisa, ahdas] 2
4 Ahdas paikka. bussien odottajat tukkivat jalka... [ahdas, paikka, bussi, odottaja, tukkia, jalka... 2
6 RAAUHALLINEN LIIKUA, AUTOJA EI PALJON [auto, ei, paljo] 1
7 MERI RAUHOITTAA [Meri, rauhoittaa] 1
8 pERINTEINEN HELMI, HIENO KIRKKO, AVARA PAIKKA [perinteinen, Helmi, hieno, kirkko, avara, pai... 1
9 kuilu meluisa, busseja ei ka saa pois [kuilu, meluisa, bussi, ei, saada, pois] 2
10 tunneli ahdas ja siellä epämääräistä sakkia jo... [tunneli, ahdas, siellä, epämääräinen, sakki, ... 2

If you speak Finnish, you can see that the semantic meaning isn't lost in lemmatization process and it is still quite easy to recognize positive and negative feedback. This will do perfectly as a training dataset for machine learning.

Term Frequency - Inverse Document Frequency Vectorization

Here, term frequency–inverse document frequency (tf–idf), will be implemented to calculate and normalize word frequencies in citizen feedbacks and turn them into numeric data. This dataset will be used as training dataset for machine learning models. tf-idf is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general (shamelessly copied from Wikipedia).

from sklearn.feature_extraction.text import TfidfVectorizer

# unigrams (single words) and bigrams (word pairs) will be used as features in the training dataset
# 2000 features will be included (2000 unique words and word pairs) 
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=1, norm='l2', encoding='utf-8', ngram_range=(0, 2), stop_words=None, max_features=2000)
features = tfidf.fit_transform(df.corpus.astype('str')).toarray()
labels = df.Summary
print(features.shape)
(1168, 2000)

Cross validation

Now that I have my training dataset, I'll test multiple different machine learning classification model accuracies in five subsamples and plot their prediction accuracy results. This method reveals how generally accurate the classification model is and compares the accuracies between different models.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import NuSVC

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import cross_val_score
plt.figure(figsize=[10,10])
models = [   
    NuSVC(gamma='scale', kernel='rbf', nu=0.8),
    RandomForestClassifier(n_estimators=500, max_depth=20, random_state=0),
    LinearSVC(),
    MultinomialNB(),
    LogisticRegression(solver='lbfgs', random_state=0),
    SGDClassifier(loss='hinge', penalty='l2', max_iter=2000, tol=1e-5),
    MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10, 5), random_state=1)
    
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))    
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])

import seaborn as sns

box = sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df, 
              size=8, jitter=True, edgecolor="gray", linewidth=2)
box.set_xticklabels(box.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title('Cross validation accuracy results')
plt.show()

Many models here seem interesting, but Naive Bayes' and logistic regression's results stand out. I'll look further into these two classifiers.

Splitting the dataset

Now I can split my dataset into training and test datasets. I'll use 0.25 ratio, that leaves roughly 876 for training and 292 for testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, df['Summary'], test_size=0.25, random_state = 0)

Train the models

%%time
NB = MultinomialNB().fit(X_train, y_train)
LR = LogisticRegression(solver='lbfgs', random_state=0).fit(X_train, y_train)
Wall time: 152 ms

Welp, that was fast... You'd think that training the models would be harder but it was done just like that.

Results

Confusion Matrix & Classification Report

One simple way to assess model's classification accuracy is to create a confusion matrix, which shows ground truth and classification model's predicted values. This way I can easily compare if the model has tendency to predict one label over the other. Also, Scikit-Learn's classification report function is really useful to assess precision, recall and f1-scores of the model. Let's see how Multinomial Naive Bayes compares to logistic regression classifier.

from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report


def prediction_results(model, X_test, y_test, labels):
    y_pred = model.predict(X_test)
    conf_mat = confusion_matrix(y_test, y_pred)
    fig, ax = plt.subplots(figsize=(6,6))
    sns.heatmap(conf_mat, annot=True, fmt='d', 
                xticklabels=[1, 2], yticklabels=[1, 2])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title(model.__class__.__name__)
    plt.show()
    print(metrics.classification_report(y_pred, y_test, target_names=labels))
    

prediction_results(NB, X_test, y_test, ['1', '2'])
prediction_results(LR, X_test, y_test, ['1', '2'])
              precision    recall  f1-score   support

           1       0.90      0.91      0.91       116
           2       0.94      0.93      0.94       176

    accuracy                           0.92       292
   macro avg       0.92      0.92      0.92       292
weighted avg       0.92      0.92      0.92       292

              precision    recall  f1-score   support

           1       0.81      0.93      0.86       102
           2       0.96      0.88      0.92       190

    accuracy                           0.90       292
   macro avg       0.88      0.91      0.89       292
weighted avg       0.91      0.90      0.90       292

Multinomial Naive Bayes has quite impressive results, whereas logistic regression tends to prefer more negative (2) labels, which makes it less generalizable. Now let's compare Naive Bayes and logistic regession receiver operating characteristic curves and area under the curve (AUC) values.

ROC curve

A ROC curve illustrates the predictive ability of a binary classifier system. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR is also known as sensitivity, recall or probability of detection in machine learning. FPR can be calculated as (1 − specificity).

AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. It has been commonly used in machine learning community for model comparison.

import numpy as np
from sklearn.metrics import auc
# requires sklearn version 0.22
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import StratifiedKFold

def plotROC(model, features, summary, splits):
    tprs = []
    aucs = []
    mean_fpr = np.linspace(0, 1, 100)
    cv = StratifiedKFold(n_splits=splits)
    sumArr = summary.array
    fig, ax = plt.subplots(figsize=(10,7))
    for i, (train, test) in enumerate(cv.split(features, sumArr)):
        model = model.fit(features[train], sumArr[train])
        viz = plot_roc_curve(model, features[test], sumArr[test],
                             name='ROC fold {}'.format(i),
                             alpha=0.3, lw=1, ax=ax)
        interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
        interp_tpr[0] = 0.0
        tprs.append(interp_tpr)
        aucs.append(viz.roc_auc)

    ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
        label='Chance', alpha=.8)

    mean_tpr = np.mean(tprs, axis=0)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    std_auc = np.std(aucs)
    ax.plot(mean_fpr, mean_tpr, color='b',
            label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
            lw=2, alpha=.8)

    std_tpr = np.std(tprs, axis=0)
    tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
    tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
    ax.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
                    label=r'$\pm$ 1 std. dev.')

    ax.set(xlim=[-0.05, 1.05], ylim=[-0.05, 1.05],
        title='Receiver operating characteristic of ' + model.__class__.__name__)
    ax.legend(loc='lower right')
    plt.show()
    
plotROC(MultinomialNB(), features, df.Summary, 6)
plotROC(LogisticRegression(solver='lbfgs', random_state=0), features, df.Summary, 6)

These results seem a bit too charitable, which may be due to cross validation subsampling with the entire dataset. Let's see what the proper AUC metrics function will show us, where I strictly use my training-test dataset split.

fpr, tpr, thresholds = metrics.roc_curve(y_test, NB.predict(X_test), pos_label=2)
print('Naive Bayes AUC: ' + str(round(metrics.auc(fpr, tpr),3)))

fpr, tpr, thresholds = metrics.roc_curve(y_test, LR.predict(X_test), pos_label=2)
print('Logistic Regression AUC: ' + str(round(metrics.auc(fpr, tpr),3)))
Naive Bayes AUC: 0.92
Logistic Regression AUC: 0.882

Now these seem to reflect classification report results much more appropriately and rests the case on which model is more robust for feedback classification!

Save and Reuse the Dataset

It can be useful to save the preprocessed dataset for further use. If I see some need to tweak my model and I don't want to preprocess my dataset again, this can come in handy.

df.to_csv('processedFeedbackDataset.tsv', sep='\t') # save the dataset as a TSV file

Pandas can have problems interpreting lists in dataframes, and may change them into string. Therefore, it can be useful to use literal_eval tool to transform these string values into lists.

import pandas as pd

# import the file
df = pd.read_csv('sanelutKoneoppimiseen.tsv', sep='\t', index_col='Unnamed: 0', encoding='utf-8')

from ast import literal_eval

df.corpus = df.corpus.apply(literal_eval)

Save and Reuse the Classification Model

Now I have trained my model and assessed it's predictive accuracy, I want to use it in the future. I'll save these models with pickle module so that I can use them to make predicitions with new data.

import pickle

# save the feature vectorisation and classification models
with open('FeedbackTFIDFmodel.sav', 'wb') as f:
    pickle.dump(tfidf, f)

with open('NBFeedbackClassifier.sav', 'wb') as f:
    pickle.dump(NB, f)
# load the feature vectorisation and classification models
filenameTFIDF = ("FeedbackTFIDFmodel.sav")
filenameClassifier = ("NBFeedbackClassifier.sav")

# load the classification model
loaded_transformer = pickle.load(open(filenameTFIDF, 'rb'))

# load the TFIDF model
loaded_model = pickle.load(open(filenameClassifier, 'rb'))

Now I have the feature vectorization and classification models that I previously developed in new variables. Let's test them on two sample texts, where the first one is negative and the latter one is quite positive.

# write the sample feedback in dict format so it can be transformed into dataframe
# here the first feedback is negative and latter is positive
d = {'text': ['Aivan hirveä ja vaarallinen puisto, autot melkein ajavat päälle ja ahdas sekä ahdistava',
              'Tämä on aivan ihana paikka, viihtyisä ja valoisa']}

df2 = pd.DataFrame(d)

# lemmatize sample feedback 
lemmatized = lemmatize(df2.text)

# filter stopwords and short words from sample feedback
df2['corpus'] = [ review_the_words(text) for text in lemmatized]
import numpy as np

# transform the sample feedback into vectorized feature format so it can be predicted by classification model
new_features = loaded_transformer.transform(df2.corpus.astype('str')).toarray()

Alright, the moment of truth. Let's see if my model can correctly predict the semantic meaning of my sample feedbacks.

loaded_model.predict(new_features)
array([2, 1], dtype=int64)

And it does! Of course, if I'd implement more varied text data, then some problems can arise, since this model was trained with quite small dataset. However, it seems quite adequate for rough estimation of semantic meaning of citizen feedback.

Discussion

In this little project of mine, I've tested out multiple machine learning models and in the end trained a Naive Bayes classification model that can predict semantic meaning from citizen feedback text data with 0.92 f1-score and 0.92 ROC-AUC score. However, the real magic here isn't in the classification model, but in the preprocessing steps, where I've implemented several methods to make natural language more machine readable without losing much semantic meaning. These preprocessing steps included tokenization, lemmatization, stop word and short word filtering, and vectorization of the text data. I've taken to heart the idea that most of the work in machine learning model development goes into dataset collection, curation and fine-tuning preprocessing steps. In the end, I've gained much more knowledge about creating a natural language processing pipeline, but also machine learning development in general.

This blog post is the last one on natural language processing and in the future I shall discuss about visualizing epidemiological data.

Back to main page