Counting cells from immunocytochemistry images with DBSCAN¶

I remember one time a classmate of mine was doing his internship work at a research group. He was doing some antibody analyses on neuron cells and he had to count the number of cells in a immunocytochemistry image by hand. As I watched him count each individual cell I felt that there has to be a more efficient way. Granted, there are already programs which are utilized to automatically count the number of objects in a given image, but I wanted to try and create something like that myself with tools and packages available in Python.

My first attempts produced no results and I put this little project on hold. However, with a recent machine learning course I was attending, a mention of density based clustering algorithm: DBSCAN, caught my interest. There were also several articles that utilized DBSCAN in segmenting images [1, 2]. Therefore, I figured that I could do something similar with DBSCAN and use it to cluster and count the number of cells in a immunocytochemistry image.

About Immunocytochemistry¶

Immunocytochemistry is a laboratory method used to visualize the localization of a specific protein in cells by using a specific primary antibody. While in Western blot, the primary antibody has to be left to incubate with the sample overnight to ensure maximal binding to the target protein, in immunocytochemistry, this can last just half an hour. Once a secondary antibody with a conjugated fluorophore is bound to the primary antibody, visualization of the protein under a fluorescence microscope is possible. I used to do this method occasionally when I was an intern at the molecular genetics of Alzheimer's disease research group, and I still have the protocol saved on my hard drive.

Methode	Chemikalien	Zeit/Temperatur
Seeding	in 100 µl Medium 50.000 cells / Well	Over night / 37°C
Washing	200 µl PBS	1 x 5 Min.
Fixing	80 µl 4% PFA	15 Min.
Permeabilise	200 µl PBS-T 0,1%	3 x 5 Min.
Blocking	50 µl 5% Blocking Solution	20 Min.
Primary AB	50 µl primary AB in 5% Blocking Solution	30 Min.
Washing	200 µl PBS-T	3 x 5 Min.
Secondary AB	50 µl secondary AB (1:333) in 5% Blocking Solution	30 Min.
Washing	200 µl PBS-T	3 x 5 Min.
DAPI	DAPI (1:400) in 5% Blocking Solution	10 Min.
Washing	200 µl PBS	3 x 5 Min.

There is a mention in that protocol of 4′,6-diamidino-2-phenylindole, or DAPI, which is a fluorescent stain that binds strongly to DNA and is used to discover cell nuclei in immunohistochemistry images. It stains the cell nucleus bright blue, although the fluorescence microscopy method I used added the color artificially. Here's an example image of what fluorescence microscopy produces:

The blue dots are cell nuclei and red color is astrocyte cell membrane.

Introduction to DBSCAN¶

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm that defines regions with sufficiently high instance density into clusters. For each instance, the number of instances is counted within a determined radius ε and if an instance has the specified minimum number of instances within that given radius, it is considered a core instance. All instances in vicinity of a core instance belong to the same cluster. This may include other core instances, but also border points, which don’t have the minimum number of instances in their ε radius to be considered a core instance. Any instance that is not a core instance and does not have one in its radius is considered an anomaly.

DBSCAN flowchart

Now do you still recall those blue DAPI stained cell nuclei? Since in general every cell has a single nucleus, not counting skeletal muscle cells or lobed nucleus of neutrophils, we can count the number of cells by clustering the nuclei with DBSCAN! All we need to do is to color filter the immunocytochemistry image and extract blue color from it, which represents DAPI staining. After that, we do some image pre-processing, vectorize image data and feed it to the DBSCAN, that counts the number of clusters!

Flowchart of the cell counting process.

Import packages for image import and pre-processing¶

import cv2 # We'll use openCV2 for image processing
import numpy as np
import os
import matplotlib.pyplot as plt
from matplotlib import colors

# We'll also use simple jupyter widget UI to assess optimal threshold values
from ipywidgets import interact_manual
import ipywidgets as widgets

Get image files¶

path = os.getcwd() + '\ICC' # I have stored my images in folder named ICC
dirs = os.listdir( path )

flags = [i for i in dirs if (i.startswith('kuva'))]
flags

['kuva1.png',
 'kuva2.png',
 'kuva3.png',
 'kuva4.tiff',
 'kuva5.tiff',
 'kuva6.tiff',
 'kuva7.tiff',
 'kuva8.tiff']

The image type doesn't really matter, since we'll transform them into cvt format

rescale and inspect images¶

Here we'll standardize all images to 200x200 resolution, but this may change the aspect-ratio and cause distortion in the data

ICCs = []

def rescale_images(path, flags, width, height):
    fig, axs = plt.subplots(2,4, figsize=(18, 8))
    axs = axs.ravel()
    images = []
    
    for i in range(len(flags)):
        img = cv2.imread(path + '/' + flags[i]) # fetch images
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        dim = (width, height)
        img = cv2.resize(img, dim, interpolation = cv2.INTER_AREA) # resize the image
        images.append(img)
        axs[i].imshow(img) # plot the images
        
    return images
    
ICCs = rescale_images(path, flags, 200, 200)

Image characterization¶

Methods and code repurposed from here and here. You can run following code cell if you are interested about the image attributes. The information may be useful in following image manipulation processes.

fig=plt.figure(figsize=(10,15))

for i in range(len(ICCs)):
    pixel_colors = ICCs[i].reshape((np.shape(ICCs[i])[0]*np.shape(ICCs[i])[1], 3))
    norm = colors.Normalize(vmin=-1.,vmax=1.)
    norm.autoscale(pixel_colors)
    pixel_colors = norm(pixel_colors).tolist()

    h, s, v = cv2.split(ICCs[i])
    ax = fig.add_subplot(4, 2, i+1, projection='3d')
    ax.set_xlabel("Hue")
    ax.set_ylabel("Saturation")
    ax.set_zlabel("Value")
    ax.set_title(flags[i])
    ax.scatter(h.flatten(), s.flatten(), v.flatten(), facecolors=pixel_colors, marker=".")

First three images have only DAPI staining, whereas latter images have other stainings to highlight other cellular structures than just the nuclei.

Color filtering, thresholding, and masking¶

Code for thresholding was repurposed from here. I created a simple jupyter widget UI to try out what threshold value is optimal for image in question, before actually processing and saving it for clustering.

def inspectImages(imageIndex, RedHiThresh, GrHiThresh, BluHiThresh, 
                  RedLoThresh, GrLoThres, BluLoThres, GlobalTresh, 
                  AdaptiveP1, AdaptiveP2):
    
    image = ICCs[imageIndex]
    # these are somewhat more suitable for latter images
    hi_blue = (RedHiThresh, GrHiThresh, BluHiThresh) # upper threshold
    lo_blue = (RedLoThresh, GrLoThres, BluLoThres) # lower threshold

    mask = cv2.inRange(image, lo_blue, hi_blue) # color filter
    result = cv2.bitwise_and(image, image, mask=mask) # color mask added

    plt.figure(figsize=(20,10))

    plt.subplot(1, 2, 1)
    plt.imshow(image, cmap="gray")
    plt.title('original')
    plt.subplot(1, 2, 2)
    plt.imshow(result)
    plt.title('filtered')
    plt.show()

    ret,th1 = cv2.threshold(image,GlobalTresh,255,cv2.THRESH_BINARY)
    
    th2 = cv2.adaptiveThreshold(mask, 255,cv2.ADAPTIVE_THRESH_MEAN_C, 
                                cv2.THRESH_BINARY,AdaptiveP1, AdaptiveP2)
    
    th3 = cv2.adaptiveThreshold(mask, 255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                cv2.THRESH_BINARY,AdaptiveP1, AdaptiveP2)

    titles = ['Original Binarized Image', 'Global Thresholding (v = '+str(GlobalTresh)+')',
                'Adaptive Mean Thresholding', 'Adaptive Gaussian Thresholding']
    images = [mask, th1[:,:, 2], th2, th3]

    plt.figure(figsize=(10,10))

    for i in range(4):
        plt.subplot(2,2,i+1),plt.imshow(images[i],'gray')
        plt.title(titles[i])
        plt.xticks([]),plt.yticks([])
    plt.show()
    
interact_manual(inspectImages,
    imageIndex = widgets.IntSlider(min=0, max=len(ICCs)-1, value=4), 
    RedHiThresh=widgets.IntSlider(min=0, max=255, value=130), 
    GrHiThresh=widgets.IntSlider(min=0, max=255, value=130), 
    BluHiThresh=widgets.IntSlider(min=0, max=255, value=255), 
    RedLoThresh=widgets.IntSlider(min=0, max=255, value=0), 
    GrLoThres=widgets.IntSlider(min=0, max=255, value=0),
    BluLoThres=widgets.IntSlider(min=0, max=255, value=25), 
    GlobalTresh=widgets.IntSlider(min=0, max=255, value=127),
    AdaptiveP1 = widgets.IntSlider(min=0, max=255, value=11),
    AdaptiveP2 = widgets.IntSlider(min=0, max=255, value=2));

With the information we got from the interactive cell above, we can apply RGB threshold values to mask all images in your dataset

# These work better for first 3 images
#hi_blue = (183, 223, 255)
#lo_blue = (8, 0, 22)

# these are somewhat more suitable for latter images
hi_blue = (130, 130, 255) # upper RGB threshold
lo_blue = (0, 0, 25) # lower RGB threshold

def mask_images(images, hi_blue, lo_blue):
    masks = []
    results = []

    for image in images:
        mask = cv2.inRange(image, lo_blue, hi_blue) # color filter
        result = cv2.bitwise_and(image, image, mask=mask) # color mask added
        masks.append(mask)
        results.append(result)

    return masks, results

masks, results = mask_images(ICCs, hi_blue, lo_blue)

In the cell below we have to pick an image from which you want to threshold and count the cells from.

def processImages(image, mask, result, hi_blue, lo_blue, gTreshold, Ap1, Ap2):
    
    ret,th1 = cv2.threshold(image,gTreshold,255,cv2.THRESH_BINARY)
    th2 = cv2.adaptiveThreshold(mask, 255,cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY,Ap1, Ap2)
    th3 = cv2.adaptiveThreshold(mask, 255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY,Ap1, Ap2)
    
    #adaptiveThreshold(src, dst, 255, ADAPTIVE_THRESH_GAUSSIAN_C, THRESH_BINARY, 75, 10);


    titles = ['Original Image', 'Global Thresholding (v = '+str(gTreshold)+')',
                'Adaptive Mean Thresholding', 'Adaptive Gaussian Thresholding']
    images = [image, th1[:,:,2], th2, th3] # red and green filter for th1!!!

    plt.figure(figsize=(10,10))

    for i in range(4):
        plt.subplot(2,2,i+1),plt.imshow(images[i],'gray')
        plt.title(titles[i])
        plt.xticks([]),plt.yticks([])
    plt.show()
    
    return th1, th2, th3

# Parameter legend: images[index], masked images[index], bitwise images[index], 
# upper RGB threshold, lower RGB threshold and global threshold, 
# adaptive thresholding parameter 1, adaptive thresholding parameter 2

imageIndex = 4 # choose your image here!

th1, th2, th3 = processImages(ICCs[imageIndex], masks[imageIndex], results[imageIndex], 
                              (130, 130, 250), (0, 0, 25), 50, 
                              11, 2)

Vectorize the image data for DBSCAN¶

It works better to vectorize image data, where each pixel value that is not a 0, is given into a feature matrix, that represents that pixel's x and y coordinates. I noticed that it is easier to first save the coordinates into a dictionary, which I convert into an array later. Perhaps not the most efficient solution, but it works.

def vectorize(image):
    features = {'x':[],'y':[]}
    for i in range(image.shape[0]):
        for j in range(image.shape[1]):
            if image[i, j] != 0:
                features['x'].append(i)
                features['y'].append(j)
    return features
            
#dots = vectorize(th1[:,:,2]) # global thresholding
#dots = vectorize(th2) # adaptive mean thresholding
dots = vectorize(th3) # adaptive gaussian thresholding
    
plt.scatter(x=dots['x'], y=dots['y'], marker='.')

<matplotlib.collections.PathCollection at 0x29e6bd8ea88>

This method has a weird way of turning the image 90 degrees counterclockwise, but it still works!

Count number of cells¶

Here I'll turn the vectorized image data to a function that turns it into an array and inputs it to the DBSCAN algorithm. I added a some features to the scatterplot function to display noise as red and add labels to the clusters. Much of the code was repurposed from this source. Many thanks!

import sklearn
from sklearn.cluster import DBSCAN

def cluster_and_plot(dots, eps, min_samples, annotate):
    arr = np.array([dots[key] for key in ('x', 'y')]).T # turn the feature dict into an array

    db = DBSCAN(eps=eps, min_samples=min_samples) # set hyperparameters, these seemed to work the best for me with my image dataset
    y_pred = db.fit_predict(arr)

    labels = db.labels_

    core_samples = np.zeros_like(labels, dtype = bool)

    core_samples[db.core_sample_indices_] = True

    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    print("Total number of clusters: " + str(n_clusters_))

    fig, ax = plt.subplots(figsize=(10, 10))
    ax.scatter(arr[:,0], arr[:,1], c=y_pred, marker=".")

    anomalies = arr[np.where(y_pred==-1)]
    ax.scatter(anomalies[:,0], anomalies[:,1], c="red", marker=".")

    if annotate:
        not_unique = []
        for i, txt in enumerate(y_pred):
            if txt not in not_unique and txt !=-1:
                ax.annotate(txt, (arr[i,0], arr[i,1]), bbox=dict(boxstyle="round", fc="w"))
                not_unique.append(txt)
            
# parameter legend: feature input, eps, min_input, cluster annotation
cluster_and_plot(dots, 2, 8, True)

Total number of clusters: 64

Halfway decent. With compression of ~200x200 and eps=2 and min_samples=8 hyperparameters, DBSCAN produced for this image in question 64 clusters. However, the background is also considered a cluster, and it is divided into two parts. If you look to the lower left you can see that it has been segmented by cells into another cluster. Therefore total number of counted cells can be considered as 62. It clearly left some values unclustered even if they would likely be calssified as a cell by a human. Let's have the original image as comparison.

Discussion¶

In this blog post, I presented methods that can be utilized to examine, pre-process, and vectorize, an immunocytochemistry image, and use DBSCAN to count the number of cells from it. Some information is lost in the compression and thresholding process, which significantly decreases noise in the data, but also increases the number of cells missed by DBSCAN. Also, some areas that are not cells are also clustered as well, which is the result of adaptive Gaussian thresholding. If adaptive Gaussian thresholding produces too many false positives, global thresholding can be used instead, but it may fuse some nuclei into one, which DBSCAN will interpret as a single cluster, increasing the false negative rate. Nevertheless, I'm quite satisfied with the results and hope that some unfortunate that has to count every single individual cell by hand may find this useful.