Pre-Training with Surrogate Labels
Using the test sample to improve CNN performance
Last update: 22.10.2021. All opinions are my own.
1. Overview
In many real-world settings, the size of the labeled training sample is lower compared to the unlabeled test data. This blogpost demonstrates a technique that can improve the performance of neural networks in such settings by learning from both training and test data. This is done by pre-training a model on the complete data set using a surrogate label. The approach can help to reduce the impact of sampling bias by exposing the model to the test data and benefit from a larger sample size while learning.
We will focus on a computer vision application, but the idea can be used with deep learning models in other domains. We will use data from SIIM-ISIC Melanoma Classification Kaggle competition to distinguish malignant and benign lesions on medical images. The modeling is performed in tensorflow
. A shorter and interactive version of this blogpost is also available as a Kaggle notebook.
2. Intuition
How to make use of the test sample on the pre-training stage? The labels are only observed for the training data. Luckily, in many settings, there is a bunch of meta-data available for both labeled and unlabeled images. Consider the task of lung cancer detection. The CT scans of cancer patients may contain information on the patient's age and gender. In contrast with the label, which requires medical tests or experts' diagnosis, meta-data is available at no additional cost. Another example is bird image classification, where the image meta-data such as time and location of the photo can serve the same purpose. In this blogpost, we will focus on malignant lesion classification, where patient meta-data is available for all images.
We can leverage meta-data in the following way:
- Pre-train a supplementary model on the complete train + test data using one of the meta-features as a surrogate label.
- Initialize from the pre-trained weights when training the main model.
The intuition behind this approach is that by learning to classify images according to one of meta variables, the model can learn some of the visual features that might be useful for the main task, which in our case is malignant lesion classification. For instance, lesion size and skin color can be helpful in determining both lesion location (surrogate label) and lesion type (actual label). Exposing the model to the test data also allows it to take a sneak peek at test images, which may help to learn patterns prevalent in the test distribution.
P.S. The notebook heavily relies on the great modeling pipeline developed by Chris Deotte for the SIIM-ISIC competition and reuses much of his original code. Kindly refer to his notebook for general questions on the pipeline where he provided comments and documentation.
#collapse-hide
### PACKAGES
!pip install -q efficientnet >> /dev/null
import pandas as pd, numpy as np
from kaggle_datasets import KaggleDatasets
import tensorflow as tf, re, math
import tensorflow.keras.backend as K
import efficientnet.tfkeras as efn
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
from scipy.stats import rankdata
import PIL, cv2
Let's set up training parameters such as image size, number of folds and batch size. In addition to these parameters, we introduce USE_PRETRAIN_WEIGHTS
variable to reflect whether we want to pre-train a supplementary model on full data before training the main melanoma classification model.
For demonstration purposes, we use EfficientNet B0
, 128x128
image size and no TTA. Feel free to experiment with larger architectures and images sizes by editing this notebook.
#collapse-show
# DEVICE
DEVICE = "TPU"
# USE DIFFERENT SEED FOR DIFFERENT STRATIFIED KFOLD
SEED = 42
# NUMBER OF FOLDS. USE 3, 5, OR 15
FOLDS = 5
# WHICH IMAGE SIZES TO LOAD EACH FOLD
IMG_SIZES = [128]*FOLDS
# BATCH SIZE AND EPOCHS
BATCH_SIZES = [32]*FOLDS
EPOCHS = [10]*FOLDS
# WHICH EFFICIENTNET TO USE
EFF_NETS = [0]*FOLDS
# WEIGHTS FOR FOLD MODELS WHEN PREDICTING TEST
WGTS = [1/FOLDS]*FOLDS
# PRETRAINED WEIGHTS
USE_PRETRAIN_WEIGHTS = True
Below, we connect to TPU or GPU for faster training.
#collapse-hide
# CONNECT TO DEVICE
if DEVICE == "TPU":
print("connecting to TPU...")
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
print("Could not connect to TPU")
tpu = None
if tpu:
try:
print("initializing TPU ...")
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
print("TPU initialized")
except _:
print("failed to initialize TPU")
else:
DEVICE = "GPU"
if DEVICE != "TPU":
print("Using default strategy for CPU and single GPU")
strategy = tf.distribute.get_strategy()
if DEVICE == "GPU":
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
AUTO = tf.data.experimental.AUTOTUNE
REPLICAS = strategy.num_replicas_in_sync
print(f'REPLICAS: {REPLICAS}')
4. Image processing
First, we specify data paths. The data is stored as tfrecords
to enable fast processing. You can read more on the data here.
#collapse-show
# IMAGE PATHS
GCS_PATH = [None]*FOLDS
for i,k in enumerate(IMG_SIZES):
GCS_PATH[i] = KaggleDatasets().get_gcs_path('melanoma-%ix%i'%(k,k))
files_train = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')))
files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[0] + '/test*.tfrec')))
The read_labeled_tfrecord()
function provides two outputs:
- Image tensor.
- Either
anatom_site_general_challenge
ortarget
as a label. The former is a one-hot-encoded categorical feature with six possible values indicating the lesion location. The latter is a binary target indicating whether the lesion is malignant. The selection of the label is controlled by thepretraining
argument read from theget_dataset()
function below. Settingpretraining = True
implies usinganatom_site_general_challenge
as a surrogate label.
We also set up read_unlabeled_tfrecord()
that returns image and image name.
#collapse-show
def read_labeled_tfrecord(example, pretraining = False):
if pretraining:
tfrec_format = {
'image' : tf.io.FixedLenFeature([], tf.string),
'image_name' : tf.io.FixedLenFeature([], tf.string),
'anatom_site_general_challenge': tf.io.FixedLenFeature([], tf.int64),
}
else:
tfrec_format = {
'image' : tf.io.FixedLenFeature([], tf.string),
'image_name' : tf.io.FixedLenFeature([], tf.string),
'target' : tf.io.FixedLenFeature([], tf.int64)
}
example = tf.io.parse_single_example(example, tfrec_format)
return example['image'], tf.one_hot(example['anatom_site_general_challenge'], 6) if pretraining else example['target']
def read_unlabeled_tfrecord(example, return_image_name=True):
tfrec_format = {
'image' : tf.io.FixedLenFeature([], tf.string),
'image_name' : tf.io.FixedLenFeature([], tf.string),
}
example = tf.io.parse_single_example(example, tfrec_format)
return example['image'], example['image_name'] if return_image_name else 0
def prepare_image(img, dim = 256):
img = tf.image.decode_jpeg(img, channels = 3)
img = tf.cast(img, tf.float32) / 255.0
img = img * circle_mask
img = tf.reshape(img, [dim,dim, 3])
return img
def count_data_items(filenames):
n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1))
for filename in filenames]
return np.sum(n)
The get_dataset()
function is a wrapper function that loads and processes images given the arguments that control the import options.
#collapse-show
def get_dataset(files,
shuffle = False,
repeat = False,
labeled = True,
pretraining = False,
return_image_names = True,
batch_size = 16,
dim = 256):
ds = tf.data.TFRecordDataset(files, num_parallel_reads = AUTO)
ds = ds.cache()
if repeat:
ds = ds.repeat()
if shuffle:
ds = ds.shuffle(1024*2) #if too large causes OOM in GPU CPU
opt = tf.data.Options()
opt.experimental_deterministic = False
ds = ds.with_options(opt)
if labeled:
ds = ds = ds.map(lambda example: read_labeled_tfrecord(example, pretraining),
num_parallel_calls=AUTO)
else:
ds = ds.map(lambda example: read_unlabeled_tfrecord(example, return_image_names),
num_parallel_calls = AUTO)
ds = ds.map(lambda img, imgname_or_label: (
prepare_image(img, dim = dim),
imgname_or_label),
num_parallel_calls = AUTO)
ds = ds.batch(batch_size * REPLICAS)
ds = ds.prefetch(AUTO)
return ds
We also use a circular crop (a.k.a. microscope augmentation) to improve image consistency. The snippet below creates a circular mask, which is applied in the prepare_image()
function.
#collapse-show
# CIRCLE CROP PREPARATIONS
circle_img = np.zeros((IMG_SIZES[0], IMG_SIZES[0]), np.uint8)
circle_img = cv2.circle(circle_img, (int(IMG_SIZES[0]/2), int(IMG_SIZES[0]/2)), int(IMG_SIZES[0]/2), 1, thickness = -1)
circle_img = np.repeat(circle_img[:, :, np.newaxis], 3, axis = 2)
circle_mask = tf.cast(circle_img, tf.float32)
Let's have a quick look at a batch of our images:
#collapse-hide
# LOAD DATA AND APPLY AUGMENTATIONS
def show_dataset(thumb_size, cols, rows, ds):
mosaic = PIL.Image.new(mode='RGB', size=(thumb_size*cols + (cols-1),
thumb_size*rows + (rows-1)))
for idx, data in enumerate(iter(ds)):
img, target_or_imgid = data
ix = idx % cols
iy = idx // cols
img = np.clip(img.numpy() * 255, 0, 255).astype(np.uint8)
img = PIL.Image.fromarray(img)
img = img.resize((thumb_size, thumb_size), resample = PIL.Image.BILINEAR)
mosaic.paste(img, (ix*thumb_size + ix,
iy*thumb_size + iy))
nn = target_or_imgid.numpy().decode("utf-8")
display(mosaic)
return nn
files_train = tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')
ds = tf.data.TFRecordDataset(files_train, num_parallel_reads = AUTO).shuffle(1024)
ds = ds.take(10).cache()
ds = ds.map(read_unlabeled_tfrecord, num_parallel_calls = AUTO)
ds = ds.map(lambda img, target: (prepare_image(img, dim = IMG_SIZES[0]),
target), num_parallel_calls = AUTO)
ds = ds.take(12*5)
ds = ds.prefetch(AUTO)
# DISPLAY IMAGES
name = show_dataset(128, 5, 2, ds)
i# 5. Modeling
Pre-trained model with surrogate label
The build_model()
function incorporates three important features that depend on the training regime:
- When building a model for pre-training, we use
CategoricalCrossentropy
as a loss becauseanatom_site_general_challenge
is a categorical variable. When building a model that classifies lesions as benign/malignant, we useBinaryCrossentropy
as a loss. - When training a final binary classification model, we load the pre-trained weights using
base.load_weights('base_weights.h5')
ifuse_pretrain_weights == True
. - We use a dense layer with six output nodes and softmax activation when doing pre-training and a dense layer with a single output node and sigmoid activation when training a final model.
#collapse-show
EFNS = [efn.EfficientNetB0, efn.EfficientNetB1, efn.EfficientNetB2, efn.EfficientNetB3,
efn.EfficientNetB4, efn.EfficientNetB5, efn.EfficientNetB6, efn.EfficientNetB7]
def build_model(dim = 256, ef = 0, pretraining = False, use_pretrain_weights = False):
# base
inp = tf.keras.layers.Input(shape = (dim,dim,3))
base = EFNS[ef](input_shape = (dim,dim,3), weights = 'imagenet', include_top = False)
# base weights
if use_pretrain_weights:
base.load_weights('base_weights.h5')
x = base(inp)
x = tf.keras.layers.GlobalAveragePooling2D()(x)
if pretraining:
x = tf.keras.layers.Dense(6, activation = 'softmax')(x)
model = tf.keras.Model(inputs = inp, outputs = x)
opt = tf.keras.optimizers.Adam(learning_rate = 0.001)
loss = tf.keras.losses.CategoricalCrossentropy()
model.compile(optimizer = opt, loss = loss)
else:
x = tf.keras.layers.Dense(1, activation = 'sigmoid')(x)
model = tf.keras.Model(inputs = inp, outputs = x)
opt = tf.keras.optimizers.Adam(learning_rate = 0.001)
loss = tf.keras.losses.BinaryCrossentropy(label_smoothing = 0.01)
model.compile(optimizer = opt, loss = loss, metrics = ['AUC'])
return model
#collapse-hide
### LEARNING RATE SCHEDULE
def get_lr_callback(batch_size=8):
lr_start = 0.000005
lr_max = 0.00000125 * REPLICAS * batch_size
lr_min = 0.000001
lr_ramp_ep = 5
lr_sus_ep = 0
lr_decay = 0.8
def lrfn(epoch):
if epoch < lr_ramp_ep:
lr = (lr_max - lr_start) / lr_ramp_ep * epoch + lr_start
elif epoch < lr_ramp_ep + lr_sus_ep:
lr = lr_max
else:
lr = (lr_max - lr_min) * lr_decay**(epoch - lr_ramp_ep - lr_sus_ep) + lr_min
return lr
lr_callback = tf.keras.callbacks.LearningRateScheduler(lrfn, verbose=False)
return lr_callback
The pre-trained model is trained on both training and test data. Here, we use the original training data merged with the complete test set as a training sample. We fix the number of training epochs to EPOCHS
and do not perform early stopping. You can also experiment with setting up a small validation sample from both training and test data to perform early stopping.
#collapse-show
### PRE-TRAINED MODEL
if USE_PRETRAIN_WEIGHTS:
# USE VERBOSE=0 for silent, VERBOSE=1 for interactive, VERBOSE=2 for commit
VERBOSE = 2
# DISPLAY INFO
if DEVICE == 'TPU':
if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)
# CREATE TRAIN AND VALIDATION SUBSETS
files_train = tf.io.gfile.glob(GCS_PATH[0] + '/train*.tfrec')
print('#### Using 2020 train data')
files_train += tf.io.gfile.glob(GCS_PATH[0] + '/test*.tfrec')
print('#### Using 2020 test data')
np.random.shuffle(files_train)
# BUILD MODEL
K.clear_session()
tf.random.set_seed(SEED)
with strategy.scope():
model = build_model(dim = IMG_SIZES[0],
ef = EFF_NETS[0],
pretraining = True)
# SAVE BEST MODEL EACH FOLD
sv = tf.keras.callbacks.ModelCheckpoint(
'weights.h5', monitor='loss', verbose=0, save_best_only=True,
save_weights_only=True, mode='min', save_freq='epoch')
# TRAIN
print('Training...')
history = model.fit(
get_dataset(files_train,
dim = IMG_SIZES[0],
batch_size = BATCH_SIZES[0],
shuffle = True,
repeat = True,
pretraining = True),
epochs = EPOCHS[0],
callbacks = [sv, get_lr_callback(BATCH_SIZES[0])],
steps_per_epoch = count_data_items(files_train)/BATCH_SIZES[0]//REPLICAS,
verbose = VERBOSE)
else:
print('#### NOT using a pre-trained model')
The pre-training is complete! Now, we need to resave weights of our pre-trained model to make it easier to load them in the future. We are not really interested in the classification head, so we only export the weights of the convolutional part of the network. We can index these layers using model.layers[1]
.
#collapse-show
# LOAD WEIGHTS AND CHECK MODEL
if USE_PRETRAIN_WEIGHTS:
model.load_weights('weights.h5')
model.summary()
# EXPORT BASE WEIGHTS
if USE_PRETRAIN_WEIGHTS:
model.layers[1].save_weights('base_weights.h5')
Main classification model
Now we can train a final classification model using a cross-validation framework on the training data!
We need to take care of a couple of changes:
- Make sure that we don't use test data in the training folds.
- Set
use_pretrain_weights = True
andpretraining = False
in thebuild_model()
function to initialize from the pre-trained weights in the beginning of each fold.
#collapse-show
# USE VERBOSE=0 for silent, VERBOSE=1 for interactive, VERBOSE=2 for commit
VERBOSE = 0
skf = KFold(n_splits = FOLDS, shuffle = True, random_state = SEED)
oof_pred = []; oof_tar = []; oof_val = []; oof_names = []; oof_folds = []
preds = np.zeros((count_data_items(files_test),1))
for fold,(idxT,idxV) in enumerate(skf.split(np.arange(15))):
# DISPLAY FOLD INFO
if DEVICE == 'TPU':
if tpu: tf.tpu.experimental.initialize_tpu_system(tpu)
print('#'*25); print('#### FOLD',fold+1)
# CREATE TRAIN AND VALIDATION SUBSETS
files_train = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxT])
np.random.shuffle(files_train); print('#'*25)
files_valid = tf.io.gfile.glob([GCS_PATH[fold] + '/train%.2i*.tfrec'%x for x in idxV])
files_test = np.sort(np.array(tf.io.gfile.glob(GCS_PATH[fold] + '/test*.tfrec')))
# BUILD MODEL
K.clear_session()
tf.random.set_seed(SEED)
with strategy.scope():
model = build_model(dim = IMG_SIZES[fold],
ef = EFF_NETS[fold],
use_pretrain_weights = USE_PRETRAIN_WEIGHTS,
pretraining = False)
# SAVE BEST MODEL EACH FOLD
sv = tf.keras.callbacks.ModelCheckpoint(
'fold-%i.h5'%fold, monitor='val_auc', verbose=0, save_best_only=True,
save_weights_only=True, mode='max', save_freq='epoch')
# TRAIN
print('Training...')
history = model.fit(
get_dataset(files_train,
shuffle = True,
repeat = True,
dim = IMG_SIZES[fold],
batch_size = BATCH_SIZES[fold]),
epochs = EPOCHS[fold],
callbacks = [sv,get_lr_callback(BATCH_SIZES[fold])],
steps_per_epoch = count_data_items(files_train)/BATCH_SIZES[fold]//REPLICAS,
validation_data = get_dataset(files_valid,
shuffle = False,
repeat = False,
dim = IMG_SIZES[fold]),
verbose = VERBOSE
)
model.load_weights('fold-%i.h5'%fold)
# PREDICT OOF
print('Predicting OOF...')
ds_valid = get_dataset(files_valid,labeled=False,return_image_names=False,shuffle=False,dim=IMG_SIZES[fold],batch_size=BATCH_SIZES[fold]*4)
ct_valid = count_data_items(files_valid); STEPS = ct_valid/BATCH_SIZES[fold]/4/REPLICAS
pred = model.predict(ds_valid,steps=STEPS,verbose=VERBOSE)[:ct_valid,]
oof_pred.append(pred)
# GET OOF TARGETS AND NAMES
ds_valid = get_dataset(files_valid,dim=IMG_SIZES[fold],labeled=True, return_image_names=True)
oof_tar.append(np.array([target.numpy() for img, target in iter(ds_valid.unbatch())]) )
oof_folds.append(np.ones_like(oof_tar[-1],dtype='int8')*fold )
ds = get_dataset(files_valid,dim=IMG_SIZES[fold],labeled=False,return_image_names=True)
oof_names.append(np.array([img_name.numpy().decode("utf-8") for img, img_name in iter(ds.unbatch())]))
# PREDICT TEST
print('Predicting Test...')
ds_test = get_dataset(files_test,labeled=False,return_image_names=False,shuffle=False,dim=IMG_SIZES[fold],batch_size=BATCH_SIZES[fold]*4)
ct_test = count_data_items(files_test); STEPS = ct_test/BATCH_SIZES[fold]/4/REPLICAS
pred = model.predict(ds_test,steps=STEPS,verbose=VERBOSE)[:ct_test,]
preds[:,0] += (pred * WGTS[fold]).reshape(-1)
#collapse-show
# COMPUTE OOF AUC
oof = np.concatenate(oof_pred); true = np.concatenate(oof_tar);
names = np.concatenate(oof_names); folds = np.concatenate(oof_folds)
auc = roc_auc_score(true,oof)
print('Overall OOF AUC = %.4f'%auc)
How does the OOF AUC compare to a model without the pre-training stage? To check this, we can simply set USE_PRETRAIN_WEIGHTS = False
in the beginning of the notebook. This is done in thus version of the Kaggle notebook, yielding a model with a lower OOF AUC (0.8329 compared to 0.8414 with pre-training).
Compared to a model initialized from the Imagenet weights, pre-training on a surrogate label brings a CV improvement. The AUC gain also translates into the performance gain on the competition leaderboard (increase from 0.8582 to 0.8809). Great news!
6. Closing words
This is the end of this blogpost. Using a computer vision application, we demonstrated how to use meta-data to construct a surrogate label and pre-train a CNN on both training and test data to improve performance.
The pre-trained model can be further optimized to increase performance gains. Using a validation subset on the pre-training stage can help to tune the number of epochs and other learning parameters. Another idea could be to construct a surrogate label with more unique values (e.g., combination of anatom_site_general_challenge
and sex
) to make the pre-training task more challenging and motivate the model to learn better. On the other hand, further optimizing the main classification model may reduce the benefit of pre-training.
Liked the post? Share it on social media!
You can also buy me a cup of coffee to support my work. Thanks!