Triplet Loss (Keras) Tutorial

Siamese Network with Triplet Loss

Note: I don't have time to neat it all up: data is somehow raw. But if you need info on different aspects of triplet loss - here are some things that are hard to find (in an explicit form) elsewhere. Now tou are warned :)

In this tutorial we are going to implement a Siamese Neural Network that uses Triplet Loss. The basics of triplet loss are simple and popularized a lot in online introductions (see the links below). However, most implementations lack explanations, and a reader, especially a beginner, will not know what to look at, and how to fine-tune the code for a particular task.

So we will create a verbose Keras code, using MNIST as an example. Then we are going to twist the network and its parameters, trying to get higher accuracy and to see what a particular changes do.

Also, we are not going to use simple in-memory solutions, even though MNIST dataset does fit in memory. Instead, data generators are used; as the result, the code base presented here can be used on larger datasets with minimal changes.

Finally, graphics are extensively used, so the reader can better understand things like "using lambdas to fit it to a hypersphere" and so on.

Usefull links

These links are listed in no particular order, some of them more usefull then others. I suggest you read at least some of them to get a grasp on the theory.

https://www.kaggle.com/talmanr/well-converged-triplet-loss
https://www.kaggle.com/ashishpatel26/triplet-loss-network-for-humpback-whale-prediction
https://github.com/keras-team/keras/issues/12861
https://github.com/handongfeng/MNIST-center-loss/blob/master/centerLoss_MNIST.py
https://github.com/zoli333/Center-Loss
http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
https://medium.com/mlreview/experiments-with-a-new-loss-term-added-to-the-standard-cross-entropy-85b080c42446
https://blog.emmanuelcaradec.com/humble-yolo-implementation-in-keras/
https://www.oipapio.com/question-82553
https://www.kaggle.com/guichristmann/training-a-triplet-loss-model-on-mnist
https://github.com/shamangary/Keras-MNIST-center-loss-with-visualization
https://github.com/AdrianUng/keras-triplet-loss-mnist
https://github.com/mjDelta/keras-center-loss-MNIST/blob/master/keras_center_loss_with_embedding.ipynb
https://github.com/shamangary/Keras-MNIST-center-loss-with-visualization/blob/master/TYY_callbacks.py
https://github.com/mjDelta/keras-center-loss-MNIST
https://github.com/keras-team/keras/issues/6929
https://github.com/phongdinhv/triplet-loss-keras-mnist
https://github.com/KinWaiCheuk/Triplet-net-keras

Python code

As was mentioned above, the code is verbose. It could easily be implemented using five times less lines - at a price of readability. This is, after all, a tutorial. The code base below can be (conditionally) divided to four sections. First, I list all the functions we need to run our experiments except for so called mining.
Then we have code to run those experiments.
Then we add code to run mining experiments.
Finally, we run the mining experiments.


Google Colab

The code is implemented on Google Colab, a generous gift from Google to all developers, allowing free access to the power of top video cards. If you do not want to use Colab, simply do not use the code snippet for Google drive below.

First of all, we need to have access to Google Drive:

from google.colab import drive
drive.mount("/content/drive/", force_remount=True)

Importing Libraries

Then we include library files that our code will use. I tried to remove files we don't need, but I am pretty sure I have missed some.

import os
import random 
import numpy as np 
import matplotlib.pyplot as plt
import matplotlib.patheffects as PathEffects
from mpl_toolkits.mplot3d import Axes3D
from sklearn.manifold import TSNE
import keras
from keras import regularizers
from keras.preprocessing.image import ImageDataGenerator
from keras.preprocessing.image import array_to_img, img_to_array, load_img
from keras.layers import Input, Conv2D, MaxPooling2D, Dense, Activation, Dropout, Flatten, Lambda, concatenate, BatchNormalization #, GaussianNoise
from keras.layers import LeakyReLU
from keras.layers import GlobalAveragePooling2D
from keras.callbacks import LambdaCallback
from keras.callbacks import ReduceLROnPlateau, LearningRateScheduler
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adadelta, RMSprop, Adagrad, Adadelta, Adamax, Nadam
from keras.optimizers import SGD, Adam
from keras.models import Model
import keras.backend as K
from keras.models import Sequential
from sklearn.neighbors import NearestNeighbors
from keras.datasets import mnist

import pylab as pl
import seaborn as sns
import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)

warnings.filterwarnings("ignore")
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

Version of Keras and TensorFlow

When I was half-way through this tutorial, Google have released a new version of TensorFlow, and correspondingly, of Keras. So I had a choice to either change the code and to redo the experiments, which would require a lot of computing time, or to keep using TensorFlow 1.5. Here is the code for Colab:


%tensorflow_version 1.5

Loading data and defining useful constants

Here is a path to my folder at Google Drive, that I am going to use across the code as a base for all nested folders. If you use Google drive, provide your own path, if not, provide path to the folder on your computer. We will also need two files, one to keep last network we got during training, another one to keep the best network:

working_path = "/content/drive/My Drive/TripletMnist/data/"
best_weights_filepath = working_path + "models/2_1_TripletMNIST.txt"
last_weights_filepath = working_path + "models/2_1_TripletMNIST_Last.txt"

The last network is used in case we have to stop training and to resume it later. The best one is used as a final result of our computations.

The following code is Colab-specific, we are checking if GPU is active - just in case we forgot to turn it on in settings.

# Is GPU Working?
import tensorflow as tf
tf.test.gpu_device_name()	

Output:

'/device:GPU:0'

MNIST comes with test and train data. However, we are going create a universal code fot any set of images. So we are going to concatenate all MNIST images and split them ourself. As the result of this (rather silly) operation, we will have code to handle a dataset that does not have a built-in test/train splitter.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x = np.concatenate((x_train, x_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

Some constants we need. Surprisingly, amounts of images that MNIST has for different classes (i.e. for "0", "1", "2"...) are not exactly the same. We could have used weights, to give more training to classes that have less images. But instead, we will truncate all arrays to a minimal number of images: in case of MNIST, it will not make much difference.

unique, counts = np.unique(y, return_counts=True)
print(unique, counts)
print(y[:10])

# This is a minimum num. of images among 10 digits
MIN_NUM_OF_IMAGES_PER_CLASS = min(counts) 
print(MIN_NUM_OF_IMAGES_PER_CLASS)

[0 1 2 3 4 5 6 7 8 9] [6903 7877 6990 7141 6824 6313 6876 7293 6825 6958]

[5 0 4 1 9 2 1 3 1 4]

6313

Just to make sure we loaded images without errors, let's display one:

img = np.array(x[4], dtype='float')
print(y[4])
plt.imshow(img, cmap='gray')
plt.show()

Some additional constants. Image size in MNIST is 28x28 pixels.

The idea of triplet loss is to create a representation of the data in such a way, that similar points (like images for the same digit) are located close to each other in a multi dimensional space. Embedding dimension is the dimensionality of that space. For most of our experiments, we are going to use 3d space. The reason is simple: points in 3d can easily be visualized.

embedding_model and triplet_model are placeholder variables, we will assign our networks to them.

alpha is a parameter used by triplet loss. We will play with it later.

IMAGE_SIZE = 28
input_shape=(IMAGE_SIZE, IMAGE_SIZE, 1)

EMBEDDING_DIM = 3
BATCH_SIZE = 256
TRAINING_IMAGES_PER_CLASS = int(MIN_NUM_OF_IMAGES_PER_CLASS * 0.6)
VALIDATION_IMAGES_PER_CLASS = int(MIN_NUM_OF_IMAGES_PER_CLASS * 0.2)
TESTING_IMAGES_PER_CLASS = MIN_NUM_OF_IMAGES_PER_CLASS - TRAINING_IMAGES_PER_CLASS - VALIDATION_IMAGES_PER_CLASS

embedding_model = 0
triplet_model = 0
alpha = 0.2

Data Preprocessing

We are going to load our data (both images and labels) in a structure called arrLabeledData. Again, this is definitely an overkill for MNIST, but an approach like that will help a lot as your data become larger. arrLabeledData is an array, each record containing a class 'id' (0-9), 'ImageNames' ("0" - "9") (note that labels and ids are identical only in MNIST, in another dataset it can be "0" - "dogs", "1" - "cats"...), and 'Images' - binary data for images. We are not going to use dynamic image loading, as it is generally slower, but you may have to do it for a larger dataset.

arrLabeledData = []
for strClassName in np.unique(y):
  # All images for that class
  fltr = np.where((y == strClassName))
  #print(len(fltr[0]))

  arrImages = []
  arrImageNames = []
  for nIdx in fltr[0]:
    if(len(arrImageNames) > MIN_NUM_OF_IMAGES_PER_CLASS):
      break
    strFileName = y[nIdx]
    arrImageNames.append(strFileName)
	# Note: here we can use 3 channels instead of one
    #arrImages.append(np.stack((x[nIdx],)*3, axis=-1))  
    arrImages.append(x[nIdx])

  arrLabeledData.append(
  {
    'Id':strClassName,
    'ImageNames':arrImageNames[:MIN_NUM_OF_IMAGES_PER_CLASS],
    'Images':arrImages[:MIN_NUM_OF_IMAGES_PER_CLASS]
  })
    
  print("Classes selected: ", len(arrLabeledData), 
	" ->", arrLabeledData[0]['Id'], arrLabeledData[1]['Id'], 
	arrLabeledData[2]['Id'], arrLabeledData[3]['Id'], 
	arrLabeledData[4]['Id'], arrLabeledData[5]['Id'],
    arrLabeledData[6]['Id'], arrLabeledData[7]['Id'], 
	arrLabeledData[8]['Id'], arrLabeledData[9]['Id']) 

Classes selected: 10 -> 0 1 2 3 4 5 6 7 8 9

Scatter Plots

The following code (copied from somewhere - see links above) is not really required. It is only used to create a scatterplot. We will have more charting functions below, and this one is not that useful, except for data exploring.
But it can be used in some projects, so i decided to put it here. See comments in the code for details:

# This is used in scatterplot (below) only
# We create arrays, that are going to be used by scatter plot. As a result, we have both arrLabeledData and those 
# arrays, which is a waste of memory. Still, as I mentioned above, this is just a demo code, in case we need it.
arrTrainImages = []
arrTrainClasses = []

arrTestImages = []
arrTestClasses = []

# Fill arrays from arrLabeledData
for nClassIdx in range(len(arrLabeledData)):
  for nImageIdx in range(TRAINING_IMAGES_PER_CLASS + VALIDATION_IMAGES_PER_CLASS):
    arrTrainClasses.append(arrLabeledData[nClassIdx]['Id'])
    arrTrainImages.append(img_to_array(arrLabeledData[nClassIdx]['Images'][nImageIdx]))
  for nImageIdx in range(TRAINING_IMAGES_PER_CLASS + VALIDATION_IMAGES_PER_CLASS + 1, 
	TRAINING_IMAGES_PER_CLASS + VALIDATION_IMAGES_PER_CLASS + TESTING_IMAGES_PER_CLASS):
	arrTestClasses.append(arrLabeledData[nClassIdx]['Id'])
    arrTestImages.append(img_to_array(arrLabeledData[nClassIdx]['Images'][nImageIdx]))

# Convert to numpy arrays
arrTrainClasses = np.array(arrTrainClasses)
arrTrainImages = np.array(arrTrainImages)
arrTestClasses = np.array(arrTestClasses)
arrTestImages = np.array(arrTestImages)
    
# And change the shape (basically, flatten)
arrTrainImagesFlat = arrTrainImages.reshape(-1,IMAGE_SIZE*IMAGE_SIZE*1)
arrTestImagesFlat = arrTestImages.reshape(-1,IMAGE_SIZE*IMAGE_SIZE*1)

The arrays we just created can be fed to the following function to produce a scatter plot.

# Define our own plot function
def scatter(x, labels, subtitle=None, bShowClasterLabels=0):
    # Choose a color palette with seaborn.
    cmap = np.array(sns.color_palette("hls", len(labels)))

    # Create a scatter plot.
    f = plt.figure(figsize=(8, 8))
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40, c=cmap)
    plt.xlim(-25, 25)
    plt.ylim(-25, 25)
    ax.axis('off')
    ax.axis('tight')

    # Add the labels for each digit. Use if clasters are present.
	# Comment otherwise.
    if(bShowClasterLabels):
      txts = []
      for i in range(len(arrLabeledData)):
          # Position of each label.
          xtext, ytext = np.median(x[labels == arrLabeledData[i]['Id'], :], axis=0)
          txt = ax.text(xtext, ytext, str(i), fontsize=24)
          txt.set_path_effects([
              PathEffects.Stroke(linewidth=5, foreground="w"),
              PathEffects.Normal()])
          txts.append(txt)
        
    if subtitle != None:
        plt.suptitle(subtitle)
        
    plt.savefig(subtitle)

Here is how the function can be called. Note that to make a chart easier to handle, both visually and performance wise, we only use 100 points from training set and 20 points from test set (so we wasted memory, copying entire arrLabeledData):

tsne = TSNE()

arr = list(range(len(arrTrainImagesFlat)))
arrSamples = random.sample(arr, 100)
train_tsne_embeds = tsne.fit_transform(arrTrainImagesFlat[arrSamples])
scatter(train_tsne_embeds, arrTrainClasses[arrSamples], 
	"Samples from Training Data")

arr = list(range(len(arrTestImagesFlat)))
arrSamples = random.sample(arr,20)
eval_tsne_embeds = tsne.fit_transform(arrTestImagesFlat[arrSamples])
scatter(eval_tsne_embeds, arrTestClasses[arrSamples], 
	"Samples from Test Data", 0)

ImageDataGenerator

ImageDataGenerator is an important part of your Keras project if your data do not fit in memory. It provides Neural Network with batches of images on demand, loading them from whatever source you provide (usually from disk). MNIST does fit in memory, so we use generators just to make our code applicable for larger tasks. Still, instead of loading images from disk, we "load" them from memory.

Note that as MNIST has enough data, we are not going to augment it (rotate, zoom and so on). If your project is more advanced, by all means, use augmentation, by using non-zero numbers in a constructor below.

datagen = ImageDataGenerator(
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0)

As before, let's take a look at the output of ImageDataGenerator. First, we need a function to retrieve an image (note that the code supports augmentation):

def loadImage(cClass, nImageIdx, datagen):
  img = cClass['Images'][nImageIdx]
  arrImg = img_to_array(img)
  arrImg = datagen.random_transform(arrImg) # augmentation
  return np.array(arrImg, dtype="float32")

Then we can display the result:

np_img = loadImage(arrLabeledData[0], 0, datagen)
img = np_img.reshape((28, 28))
plt.imshow(img, cmap='gray')
plt.show()

Helper functions

As we are going to try different settings for our Neural Network, we need to make sure previously saved networks (those we produced in our previous experiments) are properly destroyed:

def deleteSavedNet(best_weights_filepath):
    if(os.path.isfile(best_weights_filepath)):
        os.remove(best_weights_filepath)
        print("deleteSavedNet():File removed")
    else:
        print("deleteSavedNet():No file to remove")  

When training, our NN produces (returns) a history object, an array containing loss and metrics for each training epoch. We need functions to plot them. We have three functions, two for "normal" training and a third one for hand-crafted training loop I use when experimenting with "thiplet mining". As I mentioned earlier, the code is over-verbose, otherwise one function could be used instead of three:

def plotHistoryAcc(history):
    plt.plot(history.history['my_accuracy'])
    plt.plot(history.history['val_my_accuracy'])
    plt.title('Model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.show()
    
def plotHistoryLoss(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.show()    
    
def plotLoss(arrTraining, arrValidation):
    plt.plot(arrTraining)
    plt.plot(arrValidation)
    plt.title('')
    plt.ylabel('')
    plt.xlabel('Epoch')
    plt.show()       

Triplet Loss

Is triplet loss in its standard / most used form (triplet_loss_01 below) the best possible choice? We are going to run couple of tests with alternative (though using the same idea) losses to make sure we are not missing something important.

def triplet_loss_01(y_true, y_pred, alpha = 0.2):

	total_lenght = y_pred.shape.as_list()[-1]
    
	# We got data packed in one array (just a design choice), so we need to 
	# unpack it, to get predicted values (fron the network, see below)
	# for anchor, positive class and negative class.
	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	# Distance between anchor and positive (same class) example
	pos_dist = K.sum(K.square(anchor-positive),axis=1)

	# Distance between anchor and negative
	neg_dist = K.sum(K.square(anchor-negative),axis=1)

	basic_loss = pos_dist-neg_dist+alpha
	loss = K.maximum(basic_loss,0.0)

	return loss

	# --- Now few non-standard triplets
  
def triplet_loss_02(y_true, y_pred, alpha = 0.2):

	epsilon=1e-8
	beta = 3
	N = 3
  
	total_lenght = y_pred.shape.as_list()[-1]

	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	# distance between the anchor and the positive
	pos_dist = K.sum(K.square(anchor - positive),axis=1)

	# distance between the anchor and the negative
	neg_dist = K.sum(K.square(anchor - negative),axis=1)

	# The link to the source is definitely one of the links above
	pos_dist = -K.log(-((pos_dist) / beta)+1+epsilon)
	neg_dist = -K.log(-((N-neg_dist) / beta)+1+epsilon)

	loss = neg_dist + pos_dist

	return loss
  
	# ---
  
def triplet_loss_03(y_true, y_pred, alpha = 0.2):

	total_lenght = y_pred.shape.as_list()[-1]
    
	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	pos_dist = K.sum(K.square(anchor-positive),axis=1)

	neg_dist = K.sum(K.square(anchor-negative),axis=1)

	# Here we divide, instead of subtracting
	basic_loss = (pos_dist+alpha)/(neg_dist+alpha)
	loss = K.maximum(basic_loss,0.0)
    
	return K.mean(loss)

  # ---
  
def triplet_loss_04(y_true, y_pred, alpha = 0.2):
  
	total_lenght = y_pred.shape.as_list()[-1]
	print("triplet_loss.total_lenght: ", total_lenght)
    
	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	pos_dist = K.sum(K.square(anchor-positive),axis=1)

	neg_dist = K.sum(K.square(anchor-negative),axis=1)

	basic_loss = (pos_dist + 1e-6) * (pos_dist - neg_dist + alpha)
	loss = K.maximum(basic_loss,0.0)
 
	return loss  
  
  # ---
  
def triplet_loss_05(y_true, y_pred, alpha = 0.2):
  
	total_lenght = y_pred.shape.as_list()[-1]
	print("triplet_loss.total_lenght: ", total_lenght)
    
	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	pos_dist = K.sum(K.square(anchor-positive),axis=1)

	neg_dist = K.sum(K.square(anchor-negative),axis=1)

	basic_loss = (pos_dist + 1) * (EMBEDDING_DIM - neg_dist)
	loss = K.maximum(basic_loss,0.0)
 
	return loss
  
  # ---

# Division loss
#
#P. Wohlhart and V. Lepetit. Learning Descriptors for Object
#Recognition and 3D Pose Estimation. Proceedings of the
#IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
#
#B. G. Kumar, G. Carneiro, and I. D. Reid. Learning Local
#Image Descriptors with Deep Siamese and Triplet Convolutional 
#Networks by Minimising Global Loss Functions. Proceedings of the 
#IEEE Conference on Computer Vision and
#Pattern Recognition (CVPR), 2016.

def triplet_loss_06(y_true, y_pred, alpha = 0.2):
	epsilon=1e-8

	total_lenght = y_pred.shape.as_list()[-1]
    
	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	pos_dist = K.sum(K.square(anchor-positive),axis=1)

	neg_dist = K.sum(K.square(anchor-negative),axis=1)

	basic_loss = 1 - neg_dist/(pos_dist + epsilon)
	loss = K.maximum(basic_loss,0.0)

	return loss

	# ---

# Log loss of softmax
def triplet_loss_07(y_true, y_pred, alpha = 0.2):
  
	total_lenght = y_pred.shape.as_list()[-1]
    
	anchor = y_pred[:,0:int(total_lenght*1/3)]
	positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
	negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

	pos_dist = K.sum(K.square(anchor-positive),axis=1)

	neg_dist = K.sum(K.square(anchor-negative),axis=1)

	basic_loss = exp(pos_dist)/(exp(pos_dist) + exp(neg_dist))
	#loss = K.maximum(basic_loss,0.0)

	return loss  

As i mentioned, most of these losses are copied (and changed to fit my data format) from different Internet sources. The idea is to come up with ANY expression that behave like a loss (the better performance, the smaller loss).


Siamese NN uses two "base networks", that share weights. They learn by comparing their results, and then we use one of them (any, as they are identical) to do predictions. Here are some models we are going to try (note that we do not use transfer learning as our task is really simple):

# A "standard" NN we use in most tests, 
# multiple layers and Lambda at the end.
def getBaseModel_01(input_shape, type, nL2):
  
	base_model = Sequential()
  
	base_model.add(Conv2D(32,(2,2),padding='same',input_shape=input_shape,
		activation='relu',name='conv1a', kernel_regularizer=regularizers.l2(nL2)))

	base_model.add(BatchNormalization())
  
	base_model.add(Conv2D(32,(2,2),padding='same',activation='relu',name='conv1b', 
		kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())

	base_model.add(Conv2D(32,(2,2),padding='same',activation='relu',name='conv1c', 
		kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())

	base_model.add(MaxPooling2D((2,2),(2,2),padding='same',name='pool1'))
    
	base_model.add(Conv2D(64,(2,2),padding='same',input_shape=input_shape,
		activation='relu',name='conv2a', kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())
  
	base_model.add(Conv2D(64,(2,2),padding='same',activation='relu',name='conv2b', 
		kernel_regularizer=regularizers.l2(nL2)))

	base_model.add(BatchNormalization())

	base_model.add(Conv2D(64,(2,2),padding='same',activation='relu',name='conv2c', 
		kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())

	base_model.add(MaxPooling2D((2,2),(2,2),padding='same',name='pool2'))
    
	base_model.add(Conv2D(128,(2,2),padding='same',input_shape=input_shape,
		activation='relu',name='conv3a', kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())
  
	base_model.add(Conv2D(128,(2,2),padding='same',activation='relu',
		name='conv3b', kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())

	base_model.add(Conv2D(128,(2,2),padding='same',activation='relu',
		name='conv3c', kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())

	base_model.add(MaxPooling2D((2,2),(2,2),padding='same',name='pool3'))
    
	base_model.add(Conv2D(256,(2,2),padding='same',activation='relu',
		name='conv4b', kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())
    
	base_model.add(Conv2D(256,(2,2),padding='same',activation='relu',
		name='conv4c', kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(BatchNormalization())
    
	base_model.add(MaxPooling2D((2,2),(2,2),padding='same',name='pool4'))

	# TBD: pass dropout instead of hardcoding
	base_model.add(Dropout(0.5))
	base_model.add(Flatten(name='flatten'))
    
	base_model.add(Dense(EMBEDDING_DIM, name='embeddings', activation=None, 
		kernel_regularizer=regularizers.l2(nL2)))
  
	base_model.add(Lambda(lambda x: K.l2_normalize(x,axis=-1)))
    
	return base_model 

The same NN as above, but we are not using Lambda at the end. In other words, we are not going to "push" points to the hypersphere (as we have 3d embedding in most of our experiments, the "hypersphere is just a 3d sphere, which allows for nice visualizations). Still, if we do not use it, the clasters will be areas in 3d space, not "spots" on the sphere's surface:

# Same as getBaseModel_01, but no Lambda at the end
def getBaseModel_02(input_shape, type, nL2):
	...
	base_model.add(Dense(EMBEDDING_DIM, name='embeddings', 
		activation=None, kernel_regularizer=regularizers.l2(nL2)))
	base_model.add(Activation('sigmoid'))
  
	# Note this line commented
	#base_model.add(Lambda(lambda x: K.l2_normalize(x,axis=-1)))
    
	return base_model 	

How many layers should we use in classifier?

When looking through the examples online, one can notice that most examples use two Fully Connected layers as the network's classifier. Why two? This NN allows us to compose multilayer classifiers with an arbitrary number of layers:


# More FC layers + Lambda
def getBaseModel_03(input_shape, type, nL2):
  
	...
  
	base_model.add(MaxPooling2D((2,2),(2,2),padding='same',name='pool4'))

	base_model.add(Dropout(0.5))
	base_model.add(Flatten(name='flatten'))
  
	for layer_size in range(dense_layer_sizes):
		base_model.add(Dense(256, activation='relu', 
			kernel_regularizer=regularizers.l2(nL2)))
		base_model.add(BatchNormalization())
		base_model.add(Dropout(0.5))
    
	base_model.add(Dense(EMBEDDING_DIM, name='embeddings', 
		activation=None, kernel_regularizer=regularizers.l2(nL2)))
	base_model.add(Lambda(lambda x: K.l2_normalize(x,axis=-1)))
    
	return base_model 

Estimating Accuracy

Accuracy is how we estimate the performance of our network. The way it works (in this particular case) is by taking all elements in a batch, and calculating an average of "is positive distance smaller than negative one":


def my_accuracy(y_true, y_pred):
    return K.mean(y_pred[:,0] < y_pred[:,1])

How many layers should we use in classifier?

Now let's build a model, this is simply a wrapper function that takes few parameters (like type of triplet loss and type of a network to use) and returns two models: base model and triplet model:

Creating a model

def createModel(nL2, triplet_loss, optimizer, nBaseModel):
  
	if(nBaseModel == 0):
		base_model = getBaseModel_00(input_shape, "lossless", nL2)
	if(nBaseModel == 1):
		base_model = getBaseModel_01(input_shape, "lossless", nL2)
	elif(nBaseModel == 2):
		base_model = getBaseModel_02(input_shape, "lossless", nL2)
	elif(nBaseModel == 3):
		base_model = getBaseModel_03(input_shape, "lossless", nL2)
	elif(nBaseModel == 4):
		base_model = getBaseModel_04(input_shape, "lossless", nL2)
  
	anchor_input = Input(shape=input_shape, name='anchor_input')
	positive_input = Input(shape=input_shape, name='positive_input')
	negative_input = Input(shape=input_shape, name='negative_input')

	encoded_anchor = base_model(anchor_input)
	encoded_positive = base_model(positive_input)
	encoded_negative = base_model(negative_input)

	merged_vector = concatenate([encoded_anchor, encoded_positive, 
		encoded_negative], axis=-1, name='merged_layer')

	triplet_model = Model(inputs=[anchor_input, positive_input, 
		negative_input], outputs=merged_vector)
	triplet_model.compile(loss=triplet_loss, optimizer=optimizer, 
		metrics=[my_accuracy])
  
	#triplet_model.summary()
  
	return base_model, triplet_model

Dynamically loading triplets of images

We use generator (variable's name is "gen") to provide data on request. The reason is simple: we can not fit all images in memory (well, we can for MNIST, but this code is intended to be universal), so we need an on demand loading. By data we mean triplet of images and labels for them. So, here is a get_triplet function:

def get_triplet(nSplitIdx, bIsTrain):
	# Select random class
	positiveClass = np.random.choice(arrLabeledData)
  
	# Depending if it is train or validate, select range. 
	# Say we have 10 images per class, and 70% does to train. 
	# Then 0-6 (train); 7-9 (valid, at least 3)
	if(bIsTrain):
		nMinIdx = 0
		nMaxIdx = nSplitIdx - 1
	else:
		nMinIdx = nSplitIdx
		nMaxIdx = MIN_NUM_OF_IMAGES_PER_CLASS - 1 - 
			TESTING_IMAGES_PER_CLASS

	# Get 3 indices: for base image and for positive example, 
	#	from same class. And one more for negative example.
	arrImageIdx = np.random.choice(range(nMinIdx, nMaxIdx), 3)
      
	while arrImageIdx[0] == arrImageIdx[1]:
		arrImageIdx[1] = np.random.choice(range(nMinIdx, nMaxIdx))
    
	negativeClass = np.random.choice(arrLabeledData)
	while negativeClass['Id'] == positiveClass['Id']:
		negativeClass = np.random.choice(arrLabeledData)

	return (arrImageIdx, positiveClass, negativeClass)

What it does is returning three pieces of data (to be used together with the arrLabeledData array): one is for positive class, one for negative, and an array of three indexes. In other words: arrImageIdx[0] is an index (in positiveClass array) of the "base" element, arrImageIdx[1] is an index (in positiveClass array) of the "positive" element, arrImageIdx[2] is is an index (in negativeClass array) of the "negative" element. There are 3 indexes and 2 arrays (positiveClass and negativeClass) because base element and positive element should belong to the same class.

Now the generator that uses this function:

def gen(bIsTrain):
	while True:
		arrBaseExamples = []
		arrPositiveExamples = []
		arrNegativeExamples = []

		for i in range(BATCH_SIZE):
			nImageIdx, positiveClass, negativeClass = 
				get_triplet(TRAINING_IMAGES_PER_CLASS, bIsTrain)
		  
			baseExampleImg = loadImage(positiveClass, nImageIdx[0], datagen)      
			positiveExampleImg = loadImage(positiveClass, nImageIdx[1], datagen)
			negativeExampleImg = loadImage(negativeClass, nImageIdx[2], datagen)
		  
			arrBaseExamples.append(baseExampleImg)
			arrPositiveExamples.append(positiveExampleImg)
			arrNegativeExamples.append(negativeExampleImg)

		base = np.array(arrBaseExamples) / 255.
			
		positive = np.array(arrPositiveExamples) / 255.
			
		negative = np.array(arrNegativeExamples) / 255.
			
		label = None
		  
		yield ([base, positive, negative], base)

Callbacks are functions that we can provide to Keras, so that they are called at certain moments of the network training. We pass them to the "fit" function as a list (list of functions). Here we are going to use "checkpoint", called to save the best model we got so far, and "save_model_at_epoch_end_callback" that is used to save (overwriting the existing one) a network weights at the end of every epoch.

Why? The "checkpoint" one overwrites the previous "checkpointed" network when we get a better one (according to our loss criteria). It can happen once in a while, and not every epoch.

Now, the network we save every epoch is required in case we want to stop a long training and then to restart it from where we left.

def getCallbacks(monitor, mode):
	checkpoint = ModelCheckpoint(best_weights_filepath, 
		monitor="val_my_accuracy", save_best_only=True, 
		save_weights_only=True, mode='max', verbose=1)

	save_model_at_epoch_end_callback = LambdaCallback(
		on_epoch_end=lambda epoch, logs: 
			triplet_model.save_weights(last_weights_filepath))  

	callbacks_list = [checkpoint, save_model_at_epoch_end_callback]  # , early]

	return callbacks_list

Creating a generator object:

gen_train = gen(True)
gen_valid = gen(False)	

As a good practice, when we create a function to retrieve something (images) in a twisted, elaborate way, it is a good idea to test it. Here is a function to display three images, and the "next" call, getting the next trio of images out of the generator:

def ShowImg(base, pos, neg):
    fig = plt.figure() #figsize=(16,16))

    fig.add_subplot(1, 3, 1)
    plt.imshow(base, cmap='gray')

    fig.add_subplot(1, 3, 2)
    plt.imshow(pos, cmap='gray')

    fig.add_subplot(1, 3, 3)
    plt.imshow(neg, cmap='gray')

    plt.show()
    plt.close()

([base, positive, negative], base) = next(gen_train)

for i, nBatch in enumerate(range(3)): # enumerate(range(BATCH_SIZE)):
  imgBase = base[i][:,:,0]
  imgPositive = positive[i][:,:,0]
  imgNegative = negative[i][:,:,0]
  ShowImg(imgBase, imgPositive, imgNegative)

Loading saved models

As we saved the "best" and "last" model, we need to be able to load it afterwards:

# Load the "best model" we saved during the training
def loadBestModel(triplet_model):
	global best_weights_filepath
    
	if(os.path.isfile(best_weights_filepath)):
		triplet_model.load_weights(best_weights_filepath)
		print("loadBestModel(): File loaded")
	else:
		print("loadBestModel(): No file to load")

	return triplet_model

def loadLastModel(triplet_model):
	global last_weights_filepath
    
	if(os.path.isfile(last_weights_filepath)):
		triplet_model.load_weights(last_weights_filepath)
		print("loadLastModel(): File loaded")
	else:
		print("loadLastModel(): No file to load")

	return triplet_model

Training the network

The following function (train_network) is a wrapper that creates a network according to parameters we pass (like type of the base network etc) and then trains it:

def trainNetwork(EPOCHS, nL2, nMultiplier, 
	optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning = False):
  
	global embedding_model
	global triplet_model
  

	if(bCumulativeLearning == False):
		deleteSavedNet(best_weights_filepath)

	#random.seed(datetime.now())
	random.seed(7)
    
	if(nTripletLoss == 1):
		embedding_model, triplet_model = createModel(nL2, 
			triplet_loss_01, optimizer, nBaseModel = nBaseModel)
		print("Model created")
	elif(nTripletLoss == 2):
		embedding_model, triplet_model = createModel(nL2, 
			triplet_loss_02, optimizer, nBaseModel = nBaseModel)
	elif(nTripletLoss == 3):
		embedding_model, triplet_model = createModel(nL2, 
			triplet_loss_03, optimizer, nBaseModel = nBaseModel)
	elif(nTripletLoss == 4):
		embedding_model, triplet_model = createModel(nL2, 
			triplet_loss_04, optimizer, nBaseModel = nBaseModel)
	elif(nTripletLoss == 5):
		embedding_model, triplet_model = createModel(nL2, 
			triplet_loss_05, optimizer, nBaseModel = nBaseModel) 
	elif(nTripletLoss == 6):
		embedding_model, triplet_model = createModel(nL2, 
			triplet_loss_06, optimizer, nBaseModel = nBaseModel)       

	callbacks_list = getCallbacks("val_my_accuracy", 'max')  
      
	if(bCumulativeLearning == True):
		loadLastModel(triplet_model)

	nNumOfClasses = len(arrLabeledData)
	nNumOfTrainSamples = TRAINING_IMAGES_PER_CLASS * nNumOfClasses
	nNumOfValidSamples = VALIDATION_IMAGES_PER_CLASS * nNumOfClasses
	STEP_SIZE_TRAIN = nMultiplier * nNumOfTrainSamples // BATCH_SIZE
	if(STEP_SIZE_TRAIN == 0):
		STEP_SIZE_TRAIN = 1

	STEP_SIZE_VALID = nMultiplier * nNumOfValidSamples // BATCH_SIZE
	if(STEP_SIZE_VALID == 0):
		STEP_SIZE_VALID = 1

	print("Available metrics: ", triplet_model.metrics_names)

	random.seed(7)
	history = triplet_model.fit_generator(gen_train, 
		validation_data=gen_valid, verbose=0,
		epochs=EPOCHS, steps_per_epoch=STEP_SIZE_TRAIN, 
		validation_steps=STEP_SIZE_VALID, callbacks=callbacks_list)
		#workers=4, 
		use_multiprocessing=True)

	print(nL2, EMBEDDING_DIM)
	plotHistoryLoss(history)    
	plotHistoryAcc(history)
    
	return embedding_model, triplet_model

Data generator for individual network

Next we need a data generator... Wait! Haven't we already created one? Well, the one above was for the triplet network, while after the training is complete we need to verify the result, right? So we need to be able to get image, as well as its label and class. Not the triplet - just an image data for one of siamese networks. Here we go:

def data_generator_simple(arrAllImages, 
	arrAllImageLabels, arrAllImageClasses):
	i = 0
	arrImages = []
	arrImageLabels = []
	arrImageClasses = []
	for nImageIdx in range(len(arrAllImages)):
		if(i == 0):
			arrImages = []
			arrImageLabels = []
			arrImageClasses = []
      
		i += 1
		
		arrImg = img_to_array(arrAllImages[nImageIdx])
		arrImg = datagen.random_transform(arrImg) / 255.
		arrImg = np.array(arrImg, dtype="float32")    
		
		arrImages.append(arrImg)
		arrImageLabels.append(arrAllImageLabels[nImageIdx])
		arrImageClasses.append(arrAllImageClasses[nImageIdx])
		
		if i == BATCH_SIZE:
			i = 0
			arrImages = np.array(arrImages)
			yield arrImages, arrImageLabels, arrImageClasses
				
		if i != 0:
			arrImages = np.array(arrImages)
			yield arrImages, arrImageLabels, arrImageClasses
        
	raise StopIteration()

As always, we need to test the resulting function to make sure it does what we expect it to do. ShowImgSimple: this simple function displays the image with the label. Then we create three arrays (images, labels, classes) and feed them to a "simple" generator. As expected, the generator returns a batch of images and associated data. So we display it.

def ShowImgSimple(img, label):
	print(label)
  
	fig = plt.figure()

	fig.add_subplot(1, 1, 1)
	plt.imshow(img, cmap='gray')

	plt.show()
	plt.close()

arrAllImages = []
arrAllImageLabels = []
arrAllImageClasses = []

for cClass in arrLabeledData:
	for nIdx in range(MIN_NUM_OF_IMAGES_PER_CLASS):
		arrAllImages.append(cClass['Images'][nIdx])
		arrAllImageLabels.append(cClass['ImageNames'][nIdx])
		arrAllImageClasses.append(cClass['Id'])
  
gen_simple = data_generator_simple(arrAllImages, 
	arrAllImageLabels, arrAllImageClasses)
images, labels, classes = next(gen_simple)

for i, nBatch in enumerate(range(1)): #BATCH_SIZE)):
	img = images[i][:,:,0]
	label = "{}; {}".format(labels[i], str(classes[i]))
	ShowImgSimple(img, label)

This is what an output looks like:

Getting predictions

The following function has a somehow misleading name getAllTrainImages. What it does is getting (hence the name) all training images together, feeding it to a generator, and as generator returns batches, feeding those batches to our trained Neural Net "model". The output is predictions (what the network thinks those images were), together with class names and file (image) names, which are the ground truth. In other words, it is returning predictions which we can verify.

def getAllTrainImages():
	global embedding_model
  
	arrAllImages = []
	arrAllImageLabels = []
	arrAllImageClasses = []

	for i in range(nMultiplier):
		for cClass in arrLabeledData:
			for nIdx in range(TRAINING_IMAGES_PER_CLASS):
				arrAllImages.append(cClass['Images'][nIdx])
				arrAllImageLabels.append(cClass['ImageNames'][nIdx])
				arrAllImageClasses.append(cClass['Id'])

	train_preds  = []
	train_file_names = []
	train_class_names = []

	for imgs, fnames, classes in data_generator_simple(arrAllImages, 
		arrAllImageLabels, arrAllImageClasses):
		predicts = embedding_model.predict(imgs)
		predicts = predicts.tolist()
		train_preds += predicts
		train_file_names += fnames
		train_class_names += classes
  
	train_preds = np.array(train_preds) 
  
	return train_preds, train_file_names, train_class_names

Same as above, this time for test image set. After all, we only care for prediction accuracy on this dataset, not on the training one.

def getAllTestImages():
	global embedding_model
  
	arrAllImages = []
	arrAllImageLabels = []
	arrAllImageClasses = []

	for i in range(nMultiplier):
		for cClass in arrLabeledData:
			for nIdx in range(TRAINING_IMAGES_PER_CLASS + VALIDATION_IMAGES_PER_CLASS, 
				TRAINING_IMAGES_PER_CLASS + VALIDATION_IMAGES_PER_CLASS + 
				TESTING_IMAGES_PER_CLASS):
				arrAllImages.append(cClass['Images'][nIdx])
				arrAllImageLabels.append(cClass['ImageNames'][nIdx])
				arrAllImageClasses.append(cClass['Id'])

	test_preds  = []
	test_file_names = []
	test_class_names = []

	for imgs, fnames, classes in data_generator_simple(arrAllImages, 
		arrAllImageLabels, arrAllImageClasses):
		predicts = embedding_model.predict(imgs)
		predicts = predicts.tolist()
		test_preds += predicts
		test_file_names += fnames
		test_class_names += classes
	test_preds = np.array(test_preds)
  
	return test_preds, test_file_names, test_class_names

KNN

As we know, what Triplet Loss Network does is groupping points to clasters geometrically, similar points being closer together. To turn this information into a classifier, we can use Nearest Neighbours method (among many other methods).

In a code below, we create a scikit-learn NearestNeighbors object, and fit it on points (which are geometrical coordinates) our network produced. As we do not know what is an optimal number of neighbours (and because it is computationally cheap), we are going to try a range of values, between 1 and 16, just to pick the best one. We fit it on train data.

A "fit" NearestNeighbors object can now be used to classify test data. We sort the result by distance to the clasters (all 10 of them) and take the winner (sample_result[0][0], first in a sorted list).

Then... Then we distrust the accuracy we got during training, and instead, use the exact value, whis is "guessed right" vs "guessed wrong". We count correct and incorrect answers, to estimate how good is our network on a test set.

def getBestNeighboursMNIST(train_preds, train_file_names, 
	train_class_names, test_preds, test_file_names, test_class_names):

	arrNeighbours = []

	for neighbors in range(1,16):

		neigh = NearestNeighbors(n_neighbors=neighbors)
		neigh.fit(train_preds)

		distances_test, neighbors_test = neigh.kneighbors(test_preds)
		distances_test, neighbors_test = distances_test.tolist(), 
			neighbors_test.tolist()

		preds_str = []
		arrSearchPositions = []

		for filepath, distance, neighbour_ in zip(test_file_names, 
			distances_test, neighbors_test):

			sample_result = []
			sample_classes = []

			for d, n in zip(distance, neighbour_):
				train_file = train_file_names[n]
				class_train = train_file

				sample_classes.append(class_train)
				sample_result.append((class_train, d))

			sample_result.sort(key=lambda x: x[1], reverse=True)

			sample_result = sample_result[0][0]
			preds_str.append(sample_result)

		nTotalSuccess = 0
		for i, strClassNames in enumerate(preds_str):
			if(test_class_names[i] == strClassNames):
				strContains = ": Yes"
				nTotalSuccess += 1
			else:
				strContains = ": No"

		nSuccess = nTotalSuccess / (i+1)

		arrNeighbours.append([neighbors, nSuccess])
		
	arrNeighbours.sort(key=lambda x: x[1], reverse=True)
	arrNeighbours = arrNeighbours[:4]
	  
	return arrNeighbours

Colors for our presentation

To draw 10 different sets of points, we need 10 different sets of colors. First, let's take a look on what sns package has to offer. We only do it to figure how the palette is stored, then we will create one of our own. Here is its "hls" pallete:

cmap = np.array(sns.color_palette("hls", 10))
print(cmap)
[[0.86 0.3712 0.34 ]
[0.86 0.6832 0.34 ]
[0.7248 0.86 0.34 ]
[0.4128 0.86 0.34 ]
[0.34 0.86 0.5792]
[0.34 0.8288 0.86 ]
[0.34 0.5168 0.86 ]
[0.4752 0.34 0.86 ]
[0.7872 0.34 0.86 ]
[0.86 0.34 0.6208]]
As we can see, palette is in a range 0-1, so to convers colors from an RGB range, we need to divide by 255. Instead of using one of a predefined palettes, let's create our own, and build a scatter plot:
def plotMNIST2d(test_class_names, test_preds, 
	bShowClasterLabels=True):

	h = 0.02

	cmap = np.array([
		[128, 0, 0],
		[0, 128, 0],
		[0, 0, 128],
		[0, 128, 128],
		[128, 0, 128],
		[255, 0, 0],
		[0, 255, 0],
		[0, 0, 255],
		[0, 255, 255],
		[255, 255, 0]
	]) / 255.0
	j = 0
	name = ""
  
	color_space = []
	for i in range(len(test_class_names)):
		if(name != test_class_names[i]):
			name = test_class_names[i]
			if(j >= len(unique)):
				j = 0
			j = j + 1
			#print(j-1, name)
		color_space.append(cmap[j-1])

	tsne_results = test_preds
    
	f = plt.figure(figsize=(10, 10))
	ax = plt.subplot(aspect='equal')
	sc = ax.scatter(tsne_results[:,0], tsne_results[:,1], 
		lw=0, s=40, c=color_space)
	plt.xlim(-25, 25)
	plt.ylim(-25, 25)
	ax.axis('off')
	ax.axis('tight')

	# Add the labels for each digit. Use if clasters are present.
	if(bShowClasterLabels == True):
		txts = []
		for i in range(len(arrLabeledData)):
			# Position of each label.
			xtext, ytext = np.median(tsne_results[
				test_class_names == arrLabeledData[i]['Id'], :], axis=0)
			txt = ax.text(xtext, ytext, str(i), fontsize=24)
			txt.set_path_effects([
				PathEffects.Stroke(linewidth=5, foreground="w"), 
					PathEffects.Normal()])
			txts.append(txt)
        
	plt.show()

Above we used 2d scatter plot to present our results. Basically, we are going to have three scenarios, as we are going to have embedding dimension of our model equal to 2, 3 or more. If the embedding dimension equals two, we can use 2d scatter plot to display our clasters of points. If the embedding dimension is three, we can plot clasters using 3d plot. Finally, if the dimension is four or above, we are goiing to use TSNE transformation to map results to 2d surface.

Displaying results for the >= 4 dimensions case:

def plotMNIST_Nd(test_class_names, test_preds):

  cmap = np.array([
      [128, 0, 0],
      [0, 128, 0],
      [0, 0, 128],
      [0, 128, 128],
      [128, 0, 128],
      [255, 0, 0],
      [0, 255, 0],
      [0, 0, 255],
      [0, 255, 255],
      [255, 255, 0]
  ]) / 255.0
  j = 0
  name = ""
  
  color_space = []
  for i in range(len(test_class_names)):
    if(name != test_class_names[i]):
      name = test_class_names[i]
      if(j >= len(unique)):
        j = 0
      j = j + 1
      #print(j-1, name)
    color_space.append(cmap[j-1])

  tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
  tsne_results = tsne.fit_transform(test_preds)
    
  f = plt.figure(figsize=(10, 10))
  ax = plt.subplot(aspect='equal')
  sc = ax.scatter(tsne_results[:,0], tsne_results[:,1], lw=0, s=40, c=color_space)
  plt.xlim(-25, 25)
  plt.ylim(-25, 25)
  ax.axis('off')
  ax.axis('tight')
        
  plt.show()

Displaying results for 3 dimensional case:

def plotMNIST3d(test_class_names, test_preds):

  h = 0.02

  cmap = np.array([
      [128, 0, 0],
      [0, 128, 0],
      [0, 0, 128],
      [0, 128, 128],
      [128, 0, 128],
      [255, 0, 0],
      [0, 255, 0],
      [0, 0, 255],
      [0, 255, 255],
      [255, 255, 0]
  ]) / 255.0
  j = 0
  name = ""
  
  #print("len(test_class_names): ", len(test_class_names))
  
  color_space = []
  for i in range(len(test_class_names)):
    if(name != test_class_names[i]):
      name = test_class_names[i]
      if(j >= len(unique)):
        j = 0
      j = j + 1
      #print(j-1, name)
    color_space.append(cmap[j-1])


  if(EMBEDDING_DIM != 3):
    tsne = TSNE(n_components=3, verbose=1, perplexity=40, n_iter=300)
    tsne_results = tsne.fit_transform(test_preds)
    
    test_preds = tsne_results
    
  fig = plt.figure(figsize=(16, 8))
  
  alpha = 30 
  beta = 60
  
  ax = Axes3D(fig)
  ax.view_init(alpha, beta)
  
  ax.scatter(test_preds[:,0], test_preds[:,1], test_preds[:,2], c= color_space)
  plt.show()

Wrapper for training and testing

Here is a wrapper function that creates and trains the network, calculates results for the testing data set, and displays resulting clasters:

def test(nMultiplier = 1, EPOCHS=20, nL2 = 0.8, optimizer=Adam(lr=0.00004), nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning = False):
  global embedding_model
  global triplet_model
  
  embedding_model, triplet_model = trainNetwork(EPOCHS, nL2, nMultiplier, optimizer, nTripletLoss, nBaseModel, bCumulativeLearning)
  print("loading best model")
  triplet_model = loadBestModel(triplet_model)
  
  train_preds, train_file_names, train_class_names = getAllTrainImages()
  test_preds, test_file_names, test_class_names = getAllTestImages()
  
  # print("test_preds[0], test_file_names[0], test_class_names[0]: ", test_preds[0], test_file_names[0], test_class_names[0])
  
  arrNeighbours = getBestNeighboursMNIST(train_preds, train_file_names, train_class_names, test_preds, test_file_names, test_class_names)
  
  print(">>> Accuracy on test set:", arrNeighbours[0][1], "<<<")
  
  if(EMBEDDING_DIM == 3):
    plotMNIST_Nd(test_class_names, test_preds)
    plotMNIST3d(test_class_names, test_preds)
  elif(EMBEDDING_DIM == 2):
    plotMNIST2d(test_class_names, test_preds, bShowClasterLabels=True)
  else:
    plotMNIST_Nd(test_class_names, test_preds)

Experiments

We are going to run few preliminary tests, to figure out which optimizer to use. Most of the cases below are commented out: I do not have enough computing power to cover it all. Besides, as we will see later, the "no free lunch" rule seems to work well here, so no mater how many tests we do, there is a limit we can not exceed.

The code below uses array with parameters, and if necessary, we can cycle over it, performing large number of tests.

First, let's see how different optimizers perform for our task:

# Breaking the example (above) to chunks

arrOptimizers = [
  [
    "adam", 
    [
      Adam(0.00002, beta_1=0.9, beta_2=0.999),
#      Adam(0.00004, beta_1=0.9, beta_2=0.999),
#      Adam(0.00008, beta_1=0.9, beta_2=0.999),
#      Adam(0.00016, beta_1=0.9, beta_2=0.999),
#      Adam(0.00032, beta_1=0.9, beta_2=0.999),
#      Adam(0.00064, beta_1=0.9, beta_2=0.999),
#      Adam(0.00128, beta_1=0.9, beta_2=0.999),
#      Adam(0.00256, beta_1=0.9, beta_2=0.999)	  
    ],
  [
  [
	"sgd",
    [ 
      SGD(lr=0.001, momentum=0.0, decay=0.0, nesterov=False)
    ]
  ],
  [
    "rms_prop",
    [
      RMSprop(lr=0.00004, rho=0.9, epsilon=None, decay=0.0)
    ]
  ],
  [
    "Adagrad",
    [
       Adagrad(lr=0.001, epsilon=None, decay=0.0)
    ]
  ]
  [
    "Adadelta",
    [
      Adadelta(lr=0.1, rho=0.95, epsilon=None, decay=0.0)
    ]
  ],
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ],
  [
    "Nadam",
    [
      Nadam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004)
    ]
  ]  
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.2

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3      

Here is the summary table for optimizers.

2 4 6 8 12 16 24 32 64 80 120 160
Adam 0.13 0.21 0.62 0.71 0.88 0.97 0.96 0.97 0.97 0.97 0.97 0.97 SGD 0.13 0.13 0.15 0.18 0.20 0.21 0.3 0.39 0.72 0.9 0.95 RMSprop 0.14 0.25 0.65 0.67 0.9 0.93 0.96 0.96 0.96 0.97 0.96 Adagrad 0.17 0.23 0.35 0.38 0.39 0.41 0.53 0.63 0.72 0.26 0.79 0.55 Adadelta 0.15 0.2 0.23 0.31 0.30 0.5 0.47 0.85 0.9 0.87 0.93 0.83 Adamax 0.15 0.32 0.46 0.91 0.97 0.97 0.98 0.97 0.97 0.95 0.97 0.98 Nadam 0.17 0.34 0.58 0.82 0.55 0.86 0.95 0.9 0.77 0.93 0.92 0.92

Let's take a closer look at the progress RMSprop makes:

Epoch 2. Accuracy on test set: 0.14098101265822785

Epoch 2+4. Accuracy on test set: 0.2511867088607595

Epoch 2+4+6. Accuracy on test set: 0.6492879746835443

Epoch 2+4+6+8. Accuracy on test set: 0.6677215189873418

Epoch 2+4+6+8+12. Accuracy on test set: 0.9036392405063292

Epoch 2+4+6+8+12+16. Accuracy on test set: 0.9321202531645569

Epoch 2+4+6+8+12+16+24. Accuracy on test set: 0.9580696202531646

Epoch 2+4+6+8+12+16+24+32. Accuracy on test set: 0.959256329113924

Epoch 2+4+6+8+12+16+24+32+64. Accuracy on test set: 0.9643196202531645

Epoch 2+4+6+8+12+16+24+32+64+80. Accuracy on test set: 0.9522151898734177

Epoch 2+4+6+8+12+16+24+32+64+80+120. Accuracy on test set: 0.967246835443038

Epoch 2+4+6+8+12+16+24+32+64+80+120+160. Accuracy on test set: 0.9640822784810127

Now, let's say we don't want lambda. Lambda, by the way, is used to "push" the points of the resulting cluster to a surface of a hyper sphere. Without it, we will have clusters spread in a volume, not on a sphere:

# No Lambda, sigmoid
# getBaseModel_01: many layers + Lambda
# getBaseModel_02: many layers, no Lambda
# triplet_loss_01: standard, triplet_loss_03: pos/neg

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.2

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 2, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 2, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3      

Accuracy on test set: 0.9589398734177215

Or, using TSNE:

As you can see, without lambda we have slightly worse performance, but just slightly.

Our next experiment is going to be with normalization. Is sample wise normalization of the input images going to improve the result?

	
# Playing with normalization, sample wise

datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.2

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9830696202531646 (above)

As we see, accuracy is a bit better.

Playing with Alpha

The next set of experiments targets the alpha. In a triplet loss, it is the distance we want between clusters. In theory, if the point is not separated well on one alpha, it is not going to be separated better on the other. Nevertheless, let's take a look. Also, it is a good idea to know (visually) what alpha does to the result of our calculations. Note that code snippets below are identical, except for the value of alpha.:

	

datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.0

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9800632911392405

	
datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.05

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9669303797468355

Note that accuracy is better with alpha = 0, than with alpha = 0.05. The difference is large enough to be sure it is not accidental, however there are considerations to keep in mind. First,this is the result of a "best" network, and as initialization is random, we might get better results if we try few times with the same alpha. Second, this result does not mean that the larger alpha we use the worse results we get. Let's continue:

	

datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.1

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

No data

	
datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.4

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9806170886075949

	
	
datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.6

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

No data

	
datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.8

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

No data

	
# No mining, trying different alphas
# https://www.research-collection.ethz.ch/bitstream/handle/20.500.11850/254341/3673.pdf?sequence=1&isAllowed=y
#Most works in the literature normalize their descriptors
#to unit length, leading to a maximum `2-distance between
#two descriptors of dmax = 2. Besides, simply using a sigmoid or hyperbolic tangent activation at the output layer
#is quite common too, leading to much larger distances between descriptors (for example dmax = 32 with the tanhdescriptors of size 256 in PN-Net [2]). If we use the same
#loss for different dmax, we can expect very different behavior during training. If our loss is Lsub with α = 5, the triplet
#condition dp + α < dn is never going to be satisfied in the
#case of dmax = 2, simply because the descriptors cannot
#be five units of distance apart. Conversely, α = 5 seems
#reasonable when dmax = 32.

def triplet_loss_01(y_true, y_pred):
  global alpha

  total_lenght = y_pred.shape.as_list()[-1]
    
  anchor = y_pred[:,0:int(total_lenght*1/3)]
  positive = y_pred[:,int(total_lenght*1/3):int(total_lenght*2/3)]
  negative = y_pred[:,int(total_lenght*2/3):int(total_lenght*3/3)]

  pos_dist = K.sum(K.square(anchor-positive),axis=1)

  neg_dist = K.sum(K.square(anchor-negative),axis=1)

  basic_loss = pos_dist-neg_dist+alpha
  loss = K.maximum(basic_loss,0.0)

  return loss

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nL2 = 0.6

EMBEDDING_DIM = 3
# alpha == 0 produces 0.1 at best - collapse
# alpha == 0.1: 0.97
for alpha in [0.4, 0.8, 1.6, 3.2, 6.4, 12.8]:
  print("============== ALPHA: ", alpha, " ==============")
  for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
    for opt in arrOptimizers:
      print("Optimizer: ", opt[0])
      for optimizer in opt[1]:
        for nMultiplier in [1]: #, 2, 4, 8]:
          print("nMultiplier: ", nMultiplier)
          np.random.seed(7)

          if(EPOCHS == 2):
            test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
          else:
            test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
============== ALPHA: 0.4 ============== >>> Accuracy on test set: 0.982753164556962 <<< ============== ALPHA: 0.8 ============== >>> Accuracy on test set: 0.9882911392405064 <<< ============== ALPHA: 1.6 ============== >>> Accuracy on test set: 0.8254746835443038 <<< ============== ALPHA: 3.2 ============== >>> Accuracy on test set: 0.6303797468354431 <<<

In most articles, two fully connected layers are used at the "output" side of the network, while we used just one. Will adding an extra layer imprive the accuracy?

	
datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.05
EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120, 160]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 3, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 3, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9757911392405063

It seems that an extra fully connected layer actually decreases the performance.

Now let's take a closer look at the triplet loss function we use.

	
# triplet_loss = 2
datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.05
EMBEDDING_DIM = 3

for EPOCHS in [2, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 2, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 2, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9859177215189874

So triplet_loss_2 function performs better.

	
# triplet_loss_03

datagen = ImageDataGenerator(
    samplewise_center=True,
    rotation_range=0,
    width_shift_range=0.0,
    height_shift_range=0.0,
    zoom_range=0.0
)

datagen = ImageDataGenerator()

gen_train = gen(True)
gen_valid = gen(False)

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.05
EMBEDDING_DIM = 3

for EPOCHS in [2, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 3, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 3, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

Accuracy on test set: 0.9901898734177215

A triplet_loss_3 function is the winner!

	
# No mining, trying different losses
arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nL2 = 0.6

EMBEDDING_DIM = 3

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 6, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 6, nBaseModel = 1, bCumulativeLearning=True)
>>> Accuracy on test set: 0.9640822784810127 <<<
	
# No mining, trying different losses
arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nL2 = 0.6

EMBEDDING_DIM = 3

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 7, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 7, nBaseModel = 1, bCumulativeLearning=True)
>>> Accuracy on test set: 0.9761075949367088 <<<


Mining

In most of the triplet loss related articles we either see "mining" used, or at least authors apologise for not using it. Mining is a technic of training the network only using on triplets that have good chances to improve it. There are two explanations for using mining. First, training the network on triplets that are clusterized well is a waste of time, diluting the impact of (rare) underperforming triplets. Second, when we build triplets, we essentially perform an exercise in combinatorics. And number of ALL combinations of ALL (hard and easy) triplets is huge for any task that uses large data set (MNIST is not one of them). Just think about training the net on 10 million faces... in combinations with other faces...

There are different ways of picking triplets, I am going to use the straightforward one: apply the network during training to see if the positive distance is larger than the negative one. This is not a fastest approach, but definitely easiest to code.

Points to keep in mind:

	
# New mining
# We select new hard / semihard triplets so that a balance is kept: 1/3 hard, 1/3 semi-hard and 1/3 easy triplets.
# We only need to do mining for training. Validation and testing is done on a random set of triplets.
# We select the entire training dataset randomly select from it, until the batch is full.

def get_batch_hard():
  global embedding_model
  global alpha
  
  # We are going to fill 3 arrays separately, then we merge them
  #arrTripletsHard = []
  nLenTripletsHard = 0
  
  #arrTripletsSemiHard = []
  nLenTripletsSemiHard = 0
  
  #arrTripletsEasy = []
  nLenTripletsEasy = 0
  
  arrResult = []
    
  # nTotalTried is a counter. We move over all data, and if after signifficant amount of trials we couldn't find
  # enough hard triplets, let's just use easy ones
  nTotalTried = 0  
  while len(arrResult) < BATCH_SIZE:
    arrImagesIdx, positiveClass, negativeClass = get_triplet(TRAINING_IMAGES_PER_CLASS, bIsTrain = "True")
    
    # arrImagesIdx contains 3 images: anchor, positive, negative. 
    imgAnchor = np.reshape(positiveClass['Images'][arrImagesIdx[0]], (1, 28, 28, 1))
    imgPositive = np.reshape(positiveClass['Images'][arrImagesIdx[1]], (1, 28, 28, 1))
    imgNegative = np.reshape(negativeClass['Images'][arrImagesIdx[2]], (1, 28, 28, 1))
    
    # Note, that instead of using the NN to get embeddings, we can get them from corresponding layer. So our way is not the fastest.
    # arrPredictions contains embeddings (coordinates in multidim. space)
    arrPredictionsAnchor = embedding_model.predict([imgAnchor])
    arrPredictionsPositive = embedding_model.predict([imgPositive])
    arrPredictionsNegative = embedding_model.predict([imgNegative])
    
    #Compute d(A,P)-d(A,N)
    dLoss = np.sum(np.square(arrPredictionsAnchor - arrPredictionsPositive), axis=1) - np.sum(np.square(arrPredictionsAnchor - arrPredictionsNegative), axis=1)
    
    nTotalTried += 1
    
    # Hard or too many trials
    if((dLoss > 0 or nTotalTried >= BATCH_SIZE) and nLenTripletsHard < int(BATCH_SIZE/3)):
        arrResult.append((arrImagesIdx, positiveClass, negativeClass))
        nLenTripletsHard += 1

    # Semi-Hard or too many trials
    elif(((dLoss < 0 and dLoss + alpha > 0) or nTotalTried >= BATCH_SIZE) and nLenTripletsSemiHard < int(BATCH_SIZE/3)):
        arrResult.append((arrImagesIdx, positiveClass, negativeClass))
        nLenTripletsSemiHard += 1

    # Easy or too many trials
    else:
      if(nLenTripletsHard + nLenTripletsSemiHard + nLenTripletsEasy < BATCH_SIZE):
        arrResult.append((arrImagesIdx, positiveClass, negativeClass))
        nLenTripletsEasy += 1
      else:
        break
  
  #arrResult = np.concatenate((arrTripletsHard, arrTripletsSemiHard, arrTripletsEasy))
  #print(len(arrResult))
  return arrResult

# ---

def gen(bIsTrain):
  while True:
    arrBaseExamples = []
    arrPositiveExamples = []
    arrNegativeExamples = []

    if(bIsTrain):
      arrTriplets = get_batch_hard()

    for i in range(BATCH_SIZE):
      if(bIsTrain):
        arrImageIdx, positiveClass, negativeClass = arrTriplets[i]
      else:  
        arrImageIdx, positiveClass, negativeClass = get_triplet(TRAINING_IMAGES_PER_CLASS, bIsTrain)

      baseExampleImg = loadImage(positiveClass, arrImageIdx[0], datagen)      
      positiveExampleImg = loadImage(positiveClass, arrImageIdx[1], datagen)
      negativeExampleImg = loadImage(negativeClass, arrImageIdx[2], datagen)

      arrBaseExamples.append(baseExampleImg)
      arrPositiveExamples.append(positiveExampleImg)
      arrNegativeExamples.append(negativeExampleImg)
      
    base = np.array(arrBaseExamples) / 255.
    positive = np.array(arrPositiveExamples) / 255.
    negative = np.array(arrNegativeExamples) / 255.
        
    label = None
      
    yield ([base, positive, negative], base)
	
# ---
	
gen_train = gen(True)
gen_valid = gen(False)

# ---

def getCallbacks(monitor, mode):
  #checkpoint = ModelCheckpoint(best_weights_filepath, monitor="val_my_accuracy", save_best_only=True, save_weights_only=True, mode='max', verbose=1)

  save_model_at_epoch_end_callback = LambdaCallback(
      on_epoch_end=lambda epoch, logs: triplet_model.save_weights(last_weights_filepath))  

  callbacks_list = [#checkpoint, 
      save_model_at_epoch_end_callback]  # , early]

  return callbacks_list	

# ---

# For triplets mining
def trainNetwork(EPOCHS, nL2, nMultiplier, optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning = False):
  global embedding_model
  global triplet_model
  
  random.seed(7)
  
  print("Starting training process!")
  print("-------------------------------------")

  nBestValidationAccuracy = 0
  
  gen_train = gen(True)
  gen_valid = gen(False)
  
    
    if(bCumulativeLearning == False):
      deleteSavedNet(best_weights_filepath)
    
    if(nTripletLoss == 1):
      embedding_model, triplet_model = createModel(nL2, triplet_loss_01, optimizer, nBaseModel = nBaseModel)
      print("Model created")
    elif(nTripletLoss == 2):
      embedding_model, triplet_model = createModel(nL2, triplet_loss_02, optimizer, nBaseModel = nBaseModel)
    elif(nTripletLoss == 3):
      embedding_model, triplet_model = createModel(nL2, triplet_loss_03, optimizer, nBaseModel = nBaseModel)
    elif(nTripletLoss == 4):
      embedding_model, triplet_model = createModel(nL2, triplet_loss_04, optimizer, nBaseModel = nBaseModel)
    elif(nTripletLoss == 5):
      embedding_model, triplet_model = createModel(nL2, triplet_loss_05, optimizer, nBaseModel = nBaseModel) 
    elif(nTripletLoss == 6):
      embedding_model, triplet_model = createModel(nL2, triplet_loss_06, optimizer, nBaseModel = nBaseModel)   
      
    callbacks_list = getCallbacks("val_my_accuracy", 'max')
    
    if(bCumulativeLearning == True):
      loadLastModel(triplet_model)    

    arrTrainingLosses = []
    arrValidationLosses = []      

    arrTrainingAccuracy = []
    arrValidationAccuracy = []      

    for nEpoch in range(EPOCHS):
      print("Epochs {} of {}".format(nEpoch + 1, EPOCHS))    

      dTrainingLossOnBatch = 0
      dTrainingAccuracyOnBatch = 0
      
      batch_train = next(gen_train)
      batch_valid = next(gen_valid)
      
      arrTrainX = batch_train[0]
      arrTrainY = batch_train[1]
      arrValidX = batch_valid[0]
      arrValidY = batch_valid[1]
      
      nStepsTrain = int(TRAINING_IMAGES_PER_CLASS / BATCH_SIZE)
      nStepsValid = int(VALIDATION_IMAGES_PER_CLASS / BATCH_SIZE)
      
      for nImageNo in range(nStepsTrain):
        print("*", end="")
        batch_train = next(gen_train)
        tx1 = np.concatenate((arrTrainX[0], batch_train[0][0]))
        tx2 = np.concatenate((arrTrainX[1], batch_train[0][1]))
        tx3 = np.concatenate((arrTrainX[2], batch_train[0][2]))
        
        arrTrainX = [tx1, tx2, tx3]
                
        arrTrainY = np.concatenate((arrTrainY, batch_train[1]))
        
      for nImageNo in range(nStepsValid): 
        print("#", end="")
        batch_valid = next(gen_valid)
        tv1 = np.concatenate((arrValidX[0], batch_valid[0][0]))
        tv2 = np.concatenate((arrValidX[1], batch_valid[0][1]))
        tv3 = np.concatenate((arrValidX[2], batch_valid[0][2]))
        
        arrValidX = [tv1, tv2, tv3]
        
        arrValidY = np.concatenate((arrValidY, batch_valid[1]))
        
      history = triplet_model.fit(x=arrTrainX, y=arrTrainY, batch_size=BATCH_SIZE, epochs=1, verbose=0, callbacks=callbacks_list, validation_data=(arrValidX, arrValidY))

      triplet_model.save_weights(last_weights_filepath)
        
      dTrainingLoss = history.history['loss'][0]
      arrTrainingLosses.append(dTrainingLoss)
      
      dTrainingAccuracy = history.history['my_accuracy'][0]
      arrTrainingAccuracy.append(dTrainingAccuracy)
    
      dValidLoss = history.history['val_loss'][0]
      arrValidationLosses.append(dValidLoss)
      
      dValidAccuracy = history.history['val_my_accuracy'][0]
      arrValidationAccuracy.append(dValidAccuracy)
      
      if(nEpoch == 0 or dValidAccuracy > nBestValidationAccuracy):
          nBestValidationAccuracy = dValidAccuracy
          triplet_model.save_weights(best_weights_filepath)       
      
      #print("Training Loss: {:.4f}; Validation Loss: {:.4f}; Training Accuracy: {:.4f}; Validation Accuracy: {:.4f}; >>> nBestValidationAccuracy: {:.4f} <<<"
      #  .format(dTrainingLoss, dValidLoss, dTrainingAccuracy, dValidAccuracy, nBestValidationAccuracy))
      print("Training Loss: {}; Validation Loss: {}; Training Accuracy: {}; Validation Accuracy: {}; >>> nBestValidationAccuracy: {} <<<"
        .format(dTrainingLoss, dValidLoss, dTrainingAccuracy, dValidAccuracy, nBestValidationAccuracy))
      
    plotLoss(arrTrainingLosses, arrValidationLosses)  
    plotLoss(arrTrainingAccuracy, arrValidationAccuracy)  
    
  return embedding_model, triplet_model
  
	
# Using mining: done!
# Accuracy on test set: 0.9509493670886076
arrOptimizers = [
  [
    "rms_prop",
    [
      RMSprop(lr=0.00004, rho=0.9, epsilon=None, decay=0.0)
    ]
  ] 
]

nL2 = 0.6

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)

	
# This time we make 10 times more attempts to find hard triplets

# We select new hard / semihard triplets so that a balance is kept: 1/3 hard, 1/3 semi-hard and 1/3 easy triplets.
# We only need to do mining for training. Validation and testing is done on a random set of triplets.
# We select the entire training dataset randomly select from it, until the batch is full.

def get_batch_hard():
  global embedding_model
  global alpha
  
  # We are going to fill 3 arrays separately, then we merge them
  #arrTripletsHard = []
  nLenTripletsHard = 0
  
  #arrTripletsSemiHard = []
  nLenTripletsSemiHard = 0
  
  #arrTripletsEasy = []
  nLenTripletsEasy = 0
  
  arrResult = []
    
  # nTotalTried is a counter. We move over all data, and if after signifficant amount of trials we couldn't find
  # enough hard triplets, let's just use easy ones
  nTotalTried = 0  
  while len(arrResult) < BATCH_SIZE:
    arrImagesIdx, positiveClass, negativeClass = get_triplet(TRAINING_IMAGES_PER_CLASS, bIsTrain = "True")

    # arrImagesIdx contains 3 images: anchor, positive, negative. 
    imgAnchor = np.reshape(positiveClass['Images'][arrImagesIdx[0]], (1, 28, 28, 1))
    imgPositive = np.reshape(positiveClass['Images'][arrImagesIdx[1]], (1, 28, 28, 1))
    imgNegative = np.reshape(negativeClass['Images'][arrImagesIdx[2]], (1, 28, 28, 1))
    
    # Note, that instead of using the NN to get embeddings, we can get them from corresponding layer. So our way is not the fastest.
    # arrPredictions contains embeddings (coordinates in multidim. space)
    arrPredictionsAnchor = embedding_model.predict([imgAnchor])
    arrPredictionsPositive = embedding_model.predict([imgPositive])
    arrPredictionsNegative = embedding_model.predict([imgNegative])
    
    #Compute d(A,P)-d(A,N)
    dLoss = np.sum(np.square(arrPredictionsAnchor - arrPredictionsPositive), axis=1) - np.sum(np.square(arrPredictionsAnchor - arrPredictionsNegative), axis=1)

    nTotalTried += 1

    # Hard or too many trials
    if((dLoss > 0 or nTotalTried >= 10 * BATCH_SIZE) and nLenTripletsHard < int(BATCH_SIZE/3)):
        arrResult.append((arrImagesIdx, positiveClass, negativeClass))
        nLenTripletsHard += 1

    # Semi-Hard or too many trials
    elif(((dLoss < 0 and dLoss + alpha > 0) or nTotalTried >= 10 * BATCH_SIZE) and nLenTripletsSemiHard < int(BATCH_SIZE/3)):
        arrResult.append((arrImagesIdx, positiveClass, negativeClass))
        nLenTripletsSemiHard += 1

    # Easy or too many trials
    else:
      if(nLenTripletsHard + nLenTripletsSemiHard + nLenTripletsEasy < BATCH_SIZE):
        arrResult.append((arrImagesIdx, positiveClass, negativeClass))
        nLenTripletsEasy += 1
      else:
        break
  
  #print("nLenTripletsHard:", nLenTripletsHard, "nLenTripletsSemiHard:", nLenTripletsSemiHard, "nLenTripletsEasy:", nLenTripletsEasy)

  #arrResult = np.concatenate((arrTripletsHard, arrTripletsSemiHard, arrTripletsEasy))
  #print(len(arrResult))
  return arrResult
	
# Using mining
# Adamax changed with Keras!!!

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.2

EMBEDDING_DIM = 3

for EPOCHS in [2, 4, 6, 8, 12, 16, 24, 32, 64, 80, 120]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3      
>>> Accuracy on test set: 0.946756329113924 <<<
	
# Using mining
arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nL2 = 0.6

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
>>> Accuracy on test set: 0.9329113924050633 <<<

Until now, we used EMBEDDING_DIM = 3, just because the 3d chart is easier to display, then multidimensional ones. However, this definitely is not a scientific approach. Besides, with more complex data, higher dimensions might provide a signifficant model improvement. After all, Deep Faces use 128 dimensions... Should we do the same?

	
EMBEDDING_DIM = 2

arrOptimizers = [
  [
    "rms_prop",
    [
      RMSprop(lr=0.00004, rho=0.9, epsilon=None, decay=0.0)
    ]
  ] 
]

alpha = 0.2

nL2 = 0.6

for EPOCHS in [2, 4, 6, 8, 12, 16, 20, 25, 30, 40]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)
        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

As we can see, 2D embedding yeilds clugged result. For the reference, here is an already familiar 3d embedding:

	
# No mining, study of initial epochs

arrOptimizers = [
  [
    "rms_prop",
    [
      RMSprop(lr=0.00004, rho=0.9, epsilon=None, decay=0.0)
    ]
  ] 
]

nMultiplier = 1
nL2 = 0.6
alpha = 0.2
EMBEDDING_DIM = 3

for EPOCHS in [2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      print("EMBEDDING_DIM: ", EMBEDDING_DIM, "; EPOCHS: ", EPOCHS)
      np.random.seed(7)
      if(EPOCHS == 2):
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
      else:
        test(nMultiplier, EPOCHS, nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)
          
EMBEDDING_DIM = 3

>>> Accuracy on test set: 0.7470727848101266 <<<

>>> Accuracy on test set: 0.9249208860759494 <<<

Now let's go to higher embeddings. As you remember, we are going to use TSNE for visualization, as for accuracy, we use the same approach, so it can be directly compared to 3D:

	
# No mining, trying larger embeddings: 4
arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nL2 = 0.6

EMBEDDING_DIM = 4

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)

>>> Accuracy on test set: 0.980379746835443 <<<

Accuracy is about the same.

	
EMBEDDING_DIM = 8

arrOptimizers = [
  [
    "Adamax",
    [
      Adamax(lr=0.0001, beta_1=0.9, beta_2=0.999)
    ]
  ]
]

nL2 = 0.6

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)

EMBEDDING_DIM = 3
>>> Accuracy on test set: 0.9848101265822785 <<<

Still, no noticeable improvement.

	
# No mining, trying larger embeddings. Question: can that be that we get stuck in a local min. 
# near the global one? Let's try SGD.
arrOptimizers = [
  [
    "sgd",
    [ 
      SGD(lr=0.001, momentum=0.0, decay=0.0, nesterov=False)
    ]
  ]
]

nL2 = 0.6

EMBEDDING_DIM = 12

for EPOCHS in [2, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10]:
  for opt in arrOptimizers:
    print("Optimizer: ", opt[0])
    for optimizer in opt[1]:
      for nMultiplier in [1]: #, 2, 4, 8]:
        print("nMultiplier: ", nMultiplier)
        np.random.seed(7)

        if(EPOCHS == 2):
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=False)
        else:
          test(nMultiplier, int(EPOCHS / nMultiplier), nL2, optimizer=optimizer, nTripletLoss = 1, nBaseModel = 1, bCumulativeLearning=True)

EMBEDDING_DIM = 3

For MNIST, it seems that 3d is an adequate embedding size. That is not surprising, as we have a very simple data set. Still, we should keep that in mind, when working with more complex tasks.