Contour-refining of rectangular tags with convolutional neural networks

Markus Liedl, 18th October 2017

TL;DR. Convolutional networks can digest contradicting training data quite well. Let's exploit this "behaviour" to refine rectangular tags to some closer fitting contour.

I'm starting with 1400 tagged fashion images. Each tag is a rectangular area around the fashion model.

Obviously the rectangular form doesn't fit perfectly and many pixels inside the tag are background pixels.

I'm trying to refine this rectangular area by training a convolutional neural network that distinguishes between background and foreground.

The input for the convolutional neural network are 32x32 patches extracted from the images. At the start all background examples are from outside the tagged area. And the foreground examples are from within the tagged area.

The examples from within the tagged area contain part of the fashion model, but in some cases they show background that is close to the model.

The background examples from outside the tag look like this:

(It's about the center of the image: If the center pixel is outside of the tagged area then the patch counts as an outside example)

I defined a simple model in PyTorch (more details below) and after some minutes of training it starts approaching a solution. The middle image shows the detected foreground, the right the background.

The flexibility of convolution neural networks solves the problem: Many more examples from outside the tagged area contain background but only few of the examples within contain background.

In the end the larger amount of examples wins and background patches are recognized as background independent of where they came from.

Contradicting Training Data

I'd say the phenomenon I'm observing is
convolutional networks can handle contradicting or noisy training data quite well

PyTorch code

I'm using PyTorch 2d convolutions. The last non-linearity is a sigmoid to output a score between 0.0 and 1.0 that means background or foreground. For the test images I was just testing if that score is above 0.5 or below.

All convolutions except the first have stride 2 to downscale the image step by step. Pooling layers would work as well.

fs = [32, 64, 128, 128, 128, 1]
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.useBN = True
        self.conv1 = nn.Conv2d(3, fs[0], 5, 1, 2)
        self.conv2 = nn.Conv2d(fs[0], fs[1], 3, 2, 1)
        self.conv3 = nn.Conv2d(fs[1], fs[2], 3, 2, 1)
        self.conv4 = nn.Conv2d(fs[2], fs[3], 3, 2, 1)
        self.conv5 = nn.Conv2d(fs[3], fs[4], 3, 2, 1)
        self.conv6 = nn.Conv2d(fs[4], fs[5], 3, 2, 1, bias=False)
        if self.useBN:
            self.bn1 = nn.BatchNorm2d(fs[0])
            self.bn2 = nn.BatchNorm2d(fs[1])
            self.bn3 = nn.BatchNorm2d(fs[2])
            self.bn4 = nn.BatchNorm2d(fs[3])
            self.bn5 = nn.BatchNorm2d(fs[4])
    def forward(self, x):
        x = self.conv1(x)
        if self.useBN: x = self.bn1(x)
        x = F.leaky_relu(x, 0.2)

        x = self.conv2(x)
        if self.useBN: x = self.bn2(x)
        x = F.leaky_relu(x, 0.2)

        x = self.conv3(x)
        if self.useBN: x = self.bn3(x)
        x = F.leaky_relu(x, 0.2)

        x = self.conv4(x)
        if self.useBN: x = self.bn4(x)
        x = F.leaky_relu(x, 0.2)

        x = self.conv5(x)
        if self.useBN: x = self.bn5(x)
        x = F.leaky_relu(x, 0.2)

        x = self.conv6(x)

        x = F.sigmoid(x)
        return x

Further

You might have guessed, this is just a quick hack. Maybe unfinished work makes a better blog post than finished things. Here the list of ideas is particularly long:

apply the discriminator more often than just a few times to get a smoother, less blocky result.
derive a new contour from the trained discriminator. Maybe take away pixels from the borders of the tagged area. In an iterative way.
patches that are outside the newly derived contour could be added to the set of background patches. This will present less contradicting data to the convolutional network.
add a prior. The foreground pixels are connected. There's no such thing as a single forground pixel lost somewhere in the image. The biggest cluster of connected foreground pixels is the real foreground.
Adapt dropout in such a way that a patches center pixel gets more weight. The network should understand that a patch is a foreground patch if the center pixel is foreground. If it works, it should be helpful for inference as well.

If you want to apply this technique to other datasets it might even work nicely without any rectangular tags at all! Foregrounds are often somewhere in the center of an image. You could use the patches close to the corner as background examples and patches from all the rest of the image as foreground examples!

Hope you had an inspiring read!

Markus

Deep Learning

Follow me on twitter.com/markusliedl

I'm offering deep learning trainings and workshops in the Munich area.