Lately I've developed some interest in semantic analysis of unstructured text data. I wished to test my text analysis pipeline on temporal natural language text data, but I couldn't find any in Finnish. I chose Nice city centre for visitors -survey results as an alternative, that contains citizen feedback about city's centre development.
Over time, several sophisticated methods have been developed to process, embed, and predict meaning from natural language, be it vocal or text data. I've decided to implement simple text preprocessing methods on Finnish text data, where I'll use Voikko module for lemmatization to create a training dataset for predicting simple semantic meaning from natural language text data. I'll be testing several different machine learning models on my dataset and compare their classification results. I'll omit more robust neural network models, since they require more computing power and my dataset is too small to be applicable for neural network model training.
I've read and labelled the citizen feedback survey's results as follows:
Summary | Feedback type |
---|---|
1 | Positive |
2 | Negative |
3 | Mixed |
It is quite possible to add semantically mixed results also into training dataset. Nevertheless I've decided to omit mixed results, because they have major semantic overlap between semantically positive and negative results, which has negative result on model's prediction accuracy.
Here I'll import my curated feedback dataset and use Natural Language ToolKit (NLTK) to tokenize and filter the text, and use Voikko package to lemmatize, i.e. turn words from inflected to their basic form.
import pandas as pd
df = pd.read_excel('kiva_keskusta_kavelijoille_syksy2014.xlsx')
What does the dataframe look like?
df.head()
The structure is as intended. I'll be using the COMMENT column as a feature for machine learning training and Summary column has the semantic labels. Let's see how these semantic labels are distributed in dataset.
import matplotlib.pyplot as plt
%matplotlib inline
plt.bar(['Negative', 'Positive', 'Mixed'], df.Summary.value_counts())
plt.show()
print("Let's filter the mixed feedbacks")
df = df[df['Summary']!=3]
plt.bar(['Negative', 'Positive'], df.Summary.value_counts())
plt.show()
As it is to Finnish fashion, there is more negativity than positivity. There is some imbalance in the semantic labels, but the difference isn't so great that it will cause predictor model to overprefer one label over another.
Here I'll implement Voikko module to remove word inflections from the text data. This way I'll have much less unique words to work with without losing much semantic meaning. Believe me, when it comes to the agglutinative nature of Finnish language, this method becomes really useful.
import libvoikko
from libvoikko import Voikko
Voikko.setLibrarySearchPath('C:\Voikko') # set Voikko word library search path
lemmatizer = libvoikko.Voikko(u"fi") # setup the lemmatizer function
def lemmatize(raw_review):
wordlist = [text.split() for text in raw_review] # split the each citizen feedback
filterlist = []
lemmatized = []
for words in wordlist: # remove all characters from a string that isn't a number or a letter
words = [''.join(e for e in word.lower() if e.isalnum()) for word in words]
filterlist.append(words) # add each filtered word list into filterlist
for words in filterlist: # lemmatize each word from each citizen feedback
# if the word isn't in the Voikko word list, then the word will be omitted
words = [lemmatizer.analyze(word)[0]['BASEFORM'] for word in words if lemmatizer.analyze(word)]
lemmatized.append(words) # add each lemmatized word list into a list of lemmatized feedbacks
return lemmatized
lemmatized = lemmatize(df.COMMENT)
Once the words are lemmatized, I'll filter stop words (semantically not that meaningful) and too short words from the dataset. Usually, really short words cause just background noise, so it would be sensible to filter them away. However, some short words can have significant semantic meaning, such as ei (no), so those words shall be excepted. Here I'll use openly available NLTK's Finnish language stop word set and simple word length check to filter non-meaningful words.
import pandas as pd
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
def review_the_words(raw_review):
# set
stopWords = set(stopwords.words('finnish')) # set Finnish stopwords for NLTK stopWord set
# however, there may be some semantically meaningful words, so they will be removed from the stopWord set
stopWords.remove('ei')
stopWords.remove('eivät')
stopWords.remove('emme')
stopWords.remove('en')
stopWords.remove('et')
stopWords.remove('ette')
# many short words can add background noise, so most of them will be filtered
short_words = [w for w in raw_review if len(w) <= 3]
# However, some of short words can have relecant semantic meaning (depending on what you want to predict)
# Therefore these words shall be exceptions
exceptions = ['ei', 'en', 'et', 'voi', 'tie', 'yli', 'jää', 'ohi']
# filter exceptions from short words
for word in exceptions:
if word in short_words:
short_words.remove(word)
# use this function to filter
def hasNumbers(word):
return any(char.isdigit() for char in word)
meaningful_words = [w for w in raw_review if (w not in stopWords
and w not in short_words
and hasNumbers(w) == False)]
return(meaningful_words)
df['corpus'] = [ review_the_words(text) for text in lemmatized]
df[['COMMENT', 'corpus', 'Summary']].head(10)
If you speak Finnish, you can see that the semantic meaning isn't lost in lemmatization process and it is still quite easy to recognize positive and negative feedback. This will do perfectly as a training dataset for machine learning.
Here, term frequency–inverse document frequency (tf–idf), will be implemented to calculate and normalize word frequencies in citizen feedbacks and turn them into numeric data. This dataset will be used as training dataset for machine learning models. tf-idf is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general (shamelessly copied from Wikipedia).
from sklearn.feature_extraction.text import TfidfVectorizer
# unigrams (single words) and bigrams (word pairs) will be used as features in the training dataset
# 2000 features will be included (2000 unique words and word pairs)
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=1, norm='l2', encoding='utf-8', ngram_range=(0, 2), stop_words=None, max_features=2000)
features = tfidf.fit_transform(df.corpus.astype('str')).toarray()
labels = df.Summary
print(features.shape)
Now that I have my training dataset, I'll test multiple different machine learning classification model accuracies in five subsamples and plot their prediction accuracy results. This method reveals how generally accurate the classification model is and compares the accuracies between different models.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.svm import NuSVC
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import cross_val_score
plt.figure(figsize=[10,10])
models = [
NuSVC(gamma='scale', kernel='rbf', nu=0.8),
RandomForestClassifier(n_estimators=500, max_depth=20, random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(solver='lbfgs', random_state=0),
SGDClassifier(loss='hinge', penalty='l2', max_iter=2000, tol=1e-5),
MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(10, 5), random_state=1)
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
import seaborn as sns
box = sns.boxplot(x='model_name', y='accuracy', data=cv_df)
sns.stripplot(x='model_name', y='accuracy', data=cv_df,
size=8, jitter=True, edgecolor="gray", linewidth=2)
box.set_xticklabels(box.get_xticklabels(), rotation=45, horizontalalignment='right')
plt.title('Cross validation accuracy results')
plt.show()
Many models here seem interesting, but Naive Bayes' and logistic regression's results stand out. I'll look further into these two classifiers.
Now I can split my dataset into training and test datasets. I'll use 0.25 ratio, that leaves roughly 876 for training and 292 for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, df['Summary'], test_size=0.25, random_state = 0)
%%time
NB = MultinomialNB().fit(X_train, y_train)
LR = LogisticRegression(solver='lbfgs', random_state=0).fit(X_train, y_train)
Welp, that was fast... You'd think that training the models would be harder but it was done just like that.
One simple way to assess model's classification accuracy is to create a confusion matrix, which shows ground truth and classification model's predicted values. This way I can easily compare if the model has tendency to predict one label over the other. Also, Scikit-Learn's classification report function is really useful to assess precision, recall and f1-scores of the model. Let's see how Multinomial Naive Bayes compares to logistic regression classifier.
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
def prediction_results(model, X_test, y_test, labels):
y_pred = model.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(6,6))
sns.heatmap(conf_mat, annot=True, fmt='d',
xticklabels=[1, 2], yticklabels=[1, 2])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title(model.__class__.__name__)
plt.show()
print(metrics.classification_report(y_pred, y_test, target_names=labels))
prediction_results(NB, X_test, y_test, ['1', '2'])
prediction_results(LR, X_test, y_test, ['1', '2'])
Multinomial Naive Bayes has quite impressive results, whereas logistic regression tends to prefer more negative (2) labels, which makes it less generalizable. Now let's compare Naive Bayes and logistic regession receiver operating characteristic curves and area under the curve (AUC) values.
A ROC curve illustrates the predictive ability of a binary classifier system. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR is also known as sensitivity, recall or probability of detection in machine learning. FPR can be calculated as (1 − specificity).
AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. It has been commonly used in machine learning community for model comparison.
import numpy as np
from sklearn.metrics import auc
# requires sklearn version 0.22
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import StratifiedKFold
def plotROC(model, features, summary, splits):
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
cv = StratifiedKFold(n_splits=splits)
sumArr = summary.array
fig, ax = plt.subplots(figsize=(10,7))
for i, (train, test) in enumerate(cv.split(features, sumArr)):
model = model.fit(features[train], sumArr[train])
viz = plot_roc_curve(model, features[test], sumArr[test],
name='ROC fold {}'.format(i),
alpha=0.3, lw=1, ax=ax)
interp_tpr = np.interp(mean_fpr, viz.fpr, viz.tpr)
interp_tpr[0] = 0.0
tprs.append(interp_tpr)
aucs.append(viz.roc_auc)
ax.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r',
label='Chance', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
ax.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc),
lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
ax.fill_between(mean_fpr, tprs_lower, tprs_upper, color='grey', alpha=.2,
label=r'$\pm$ 1 std. dev.')
ax.set(xlim=[-0.05, 1.05], ylim=[-0.05, 1.05],
title='Receiver operating characteristic of ' + model.__class__.__name__)
ax.legend(loc='lower right')
plt.show()
plotROC(MultinomialNB(), features, df.Summary, 6)
plotROC(LogisticRegression(solver='lbfgs', random_state=0), features, df.Summary, 6)
These results seem a bit too charitable, which may be due to cross validation subsampling with the entire dataset. Let's see what the proper AUC metrics function will show us, where I strictly use my training-test dataset split.
fpr, tpr, thresholds = metrics.roc_curve(y_test, NB.predict(X_test), pos_label=2)
print('Naive Bayes AUC: ' + str(round(metrics.auc(fpr, tpr),3)))
fpr, tpr, thresholds = metrics.roc_curve(y_test, LR.predict(X_test), pos_label=2)
print('Logistic Regression AUC: ' + str(round(metrics.auc(fpr, tpr),3)))
Now these seem to reflect classification report results much more appropriately and rests the case on which model is more robust for feedback classification!
It can be useful to save the preprocessed dataset for further use. If I see some need to tweak my model and I don't want to preprocess my dataset again, this can come in handy.
df.to_csv('processedFeedbackDataset.tsv', sep='\t') # save the dataset as a TSV file
Pandas can have problems interpreting lists in dataframes, and may change them into string. Therefore, it can be useful to use literal_eval tool to transform these string values into lists.
import pandas as pd
# import the file
df = pd.read_csv('sanelutKoneoppimiseen.tsv', sep='\t', index_col='Unnamed: 0', encoding='utf-8')
from ast import literal_eval
df.corpus = df.corpus.apply(literal_eval)
Now I have trained my model and assessed it's predictive accuracy, I want to use it in the future. I'll save these models with pickle module so that I can use them to make predicitions with new data.
import pickle
# save the feature vectorisation and classification models
with open('FeedbackTFIDFmodel.sav', 'wb') as f:
pickle.dump(tfidf, f)
with open('NBFeedbackClassifier.sav', 'wb') as f:
pickle.dump(NB, f)
# load the feature vectorisation and classification models
filenameTFIDF = ("FeedbackTFIDFmodel.sav")
filenameClassifier = ("NBFeedbackClassifier.sav")
# load the classification model
loaded_transformer = pickle.load(open(filenameTFIDF, 'rb'))
# load the TFIDF model
loaded_model = pickle.load(open(filenameClassifier, 'rb'))
Now I have the feature vectorization and classification models that I previously developed in new variables. Let's test them on two sample texts, where the first one is negative and the latter one is quite positive.
# write the sample feedback in dict format so it can be transformed into dataframe
# here the first feedback is negative and latter is positive
d = {'text': ['Aivan hirveä ja vaarallinen puisto, autot melkein ajavat päälle ja ahdas sekä ahdistava',
'Tämä on aivan ihana paikka, viihtyisä ja valoisa']}
df2 = pd.DataFrame(d)
# lemmatize sample feedback
lemmatized = lemmatize(df2.text)
# filter stopwords and short words from sample feedback
df2['corpus'] = [ review_the_words(text) for text in lemmatized]
import numpy as np
# transform the sample feedback into vectorized feature format so it can be predicted by classification model
new_features = loaded_transformer.transform(df2.corpus.astype('str')).toarray()
Alright, the moment of truth. Let's see if my model can correctly predict the semantic meaning of my sample feedbacks.
loaded_model.predict(new_features)
And it does! Of course, if I'd implement more varied text data, then some problems can arise, since this model was trained with quite small dataset. However, it seems quite adequate for rough estimation of semantic meaning of citizen feedback.
In this little project of mine, I've tested out multiple machine learning models and in the end trained a Naive Bayes classification model that can predict semantic meaning from citizen feedback text data with 0.92 f1-score and 0.92 ROC-AUC score. However, the real magic here isn't in the classification model, but in the preprocessing steps, where I've implemented several methods to make natural language more machine readable without losing much semantic meaning. These preprocessing steps included tokenization, lemmatization, stop word and short word filtering, and vectorization of the text data. I've taken to heart the idea that most of the work in machine learning model development goes into dataset collection, curation and fine-tuning preprocessing steps. In the end, I've gained much more knowledge about creating a natural language processing pipeline, but also machine learning development in general.
This blog post is the last one on natural language processing and in the future I shall discuss about visualizing epidemiological data.