Virtual Shop Assistant Chatbot with Amazing Image Recognition

There has been some significant progress in “deep learning”, AI, and image recognition over the past couple of years; Google, Microsoft, and Amazon each have their own service offering. But what is the service like? How useful is it?

Everyone’s having a go at making a chatbot this year (and if you’re not, perhaps you should contact me for consultancy or training!) – and although there are some great examples out there, I’ve not seen much in the e-commerce sector worth talking about.

In this article I’m going to show you a cool use case for an image recognition e-commerce chatbot via a couple of clever APIs wired together by botframework.

The Concept

Could we build a chatbot that acts as a Virtual Shop Assistant, allowing the user to upload an image of a item of clothing they’d like to buy, and have the chatbot reply with similar items that I could buy?

Is image recognition up to the task? There are a few solutions out there; a lot of very reasonably priced services from a few big players (Microsoft, Amazon, and Google).

There are plenty of affiliate networks to allow us to find matching products, and there must be an API we can leverage.

Here’s how I want to structure this concept project:

  1. User uploads image
  2. Bot processes the image to find a product
  3. Bot responds with the description of the product it found in the image
  4. Bot finds similar products for sale
  5. Bot returns products in carousel (with affiliate links!)
  6. User taps the item they would like
  7. Bot redirects user to that item’s website for purchase

Let’s get started!

Stage 1: Image Upload

In a previous article I explained how to receive files from the user in a botframework chatbot; we’ll implement that again here to receive the image the user has posted. Something like this should work:

public async Task MessageReceivedAsync(IDialogContext context, IAwaitable<IMessageActivity> argument)
{
    var message = await argument;
    var connector = new ConnectorClient(new Uri(message.ServiceUrl));

    var image = message.
        Attachments?.
        FirstOrDefault(x => x.ContentType.ToLowerInvariant().Contains("image"));

    // if the image was passed as byte data in Content, use that
    var data = image.Content ?? 
        // if not, download it from the ContentUrl
         connector
         .HttpClient
         .GetByteArrayAsync(image?.ContentUrl)
         .Result;
}

Obviously I’m ignoring all sanity checking in this code for brevity – e.g. “is there even an attachment”, that sort of thing..

Stage 2: Product Image Recognition

Let’s investigate the Image Recognition APIs available from the big players (and a small one) and their results using a test image.

The image I’m using is for a product from Asos, since I didn’t want to just use a basic “black shirt” or something simple; this is a “Longline Roll Neck Sleeveless Nude Jumper”:

Longline Roll Neck Sleeveless Nude Jumper

This should be interesting – let’s see how they do!

Microsoft’s Computer Vision API

The first contender comes from Microsoft’s Cognitive Research arm; their image recognition solution is called Computer Vision API.

The Computer Vision API returns information about visual content found in an image. You can use tagging, descriptions, and domain-specific models to identify content and label it – however, the only “domain specific model” that exists so far is “celebrities”, so not really useful for clothing.

Let’s see how does it does analysing the test image:

Microsoft Computer Vision response example

Result: “a woman posing for a picture” with the tags “person” and “standing”.

Ooookay. Not really what we need for a product recognition concept, right? Maybe it can get better in future; there is the option to use a domain specific model, however the only model that exists at the moment is “celebrities”. I’d love to be able to train it with my own model, perhaps provide a huge upload of images and associated product descriptions.

Summary
It certainly described the image, but did it give me the contextual accuracy I require? Nope. Even the colour breakdown isn’t particularly good. Shame.

Colour breakdown

It does have a category taxonomy which would suggest the ability to provide domain-specific information to allow for more detailed categorisation, but it doesn’t look like you can train it on a known data set; for example, your website’s product catalogue with all the associated images and descriptions.

Pricing

  • Free: 5000 calls per month capped limit
  • Standard: $1.50 per 1000 calls, 10 transactions per second limit

Time to move on..

Google Vision API

Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy to use REST API. It quickly classifies images into thousands of categories (e.g., “sailboat”, “lion”, “Eiffel Tower”), detects individual objects and faces within images, and finds and reads printed words contained within images.

So how does it fare with our clothing image?

Google Vision API response example

Result: “Clothing, Sleeve, Dress, Photo Shoot, Outerwear, Neck, Textile, Pattern, Collar, Beige..”

Hmm. Slightly better.

Summary
Slightly better than Microsoft’s offering, perhaps? Still not something we could pass into an ecommerce site’s search box, hoping to get a similar item though.

However, the colour breakdown is nice, and the demo site displays it well.
Google vision API colour breakdown

The full JSON response has a ridiculous amount of detail about the face; from the bounding rectangle of the face elements to the location of eyebrow edges.

Google vision API landmarks

It just keeps on going..!

Pricing
$1.50 per 1000 images per month, with the first 1000 per month free.

Amazon’s AWS Rekognition

With Amazon’s Rekognition, you can detect objects, scenes, and faces in images. You can also search and compare faces. Rekognition’s API enables you to quickly add sophisticated deep learning-based visual search and image classification to your applications.

Let’s see what it can do with the dress image:

AWS Rekognition response example

Result: “Human, People, Person, Cardigan, Clothing, Sweater, Blonde, Female..”

Summary
Yeah, pretty similar to Google’s results really. Still not something we can search on, so it looks like maybe my visual product search chatbot will never get off the ground..

Pricing
$1 per 1000 images processed.

Let’s give it one last try.

Cloud Sight API

CloudSight’s mission is to become the global leader in image captioning and understanding. You can make things more discoverable for your e-commerce site or marketplace through augmented product and image details such as brand, style, type, and more.

How does it deal with our test image?

CloudSight API response example

Result: “women’s brown cowl neckline sleeveless midi dress”

Woah. That’s even more accurate than the original Asos description..!

Summary
We have a winner! But how?! If Microsoft, Google, and Amazon collectively fail to achieve such incredible product specific accuracy, how have CloudSight managed?

Of course, CloudSight don’t go around telling everyone their secret sauce, but if you follow various discussions on Reddit you’ll find several people of the opinion that it’s a mechanical turk – or at the very least a semi-mechanical one; that is, humans in the background, pretending to be a machine, possibly helping to train the underlying AI.

However, there’s also a great article from their CTO about using Amazon EC2 instances with nvidia docker images to expose the GPUs for deep learning, as well as an interesting article on Visual Cognition itself.

Given how limited in terms of scaling it would be to rely on humans for this, I’m pretty certain they’re doing the GPU-based solution.

Pricing
Either solution could certainly explain the high cost per image when compared to Microsoft, Amazon, or Google:

cloud sight pricing

Wiring CloudSight into Botframework

Let’s update the MessageReceivedAsync method to pass the image data over to CloudSight for analysis:

public async Task MessageReceivedAsync(IDialogContext context, IAwaitable<IMessageActivity> argument)
{
    var message = await argument;
    var connector = new ConnectorClient(new Uri(message.ServiceUrl));

    var image = message.
        Attachments?.
        FirstOrDefault(x => x.ContentType.ToLowerInvariant().Contains("image"));

    // if the image was passed as byte data in Content, use that
    var data = image.Content ?? 
        // if not, download it from the ContentUrl
         connector
         .HttpClient
         .GetByteArrayAsync(image?.ContentUrl)
         .Result;

    // process the image
    var product = await ProcessImage(context, message, data);
}

To query the CloudSight API you need to build a specific structure for your request; the documentation and SDKs are in python, go, ruby, and objective-c, so hooking it up in C# can be a bit tricky.

Reading through their their github repos allowed me to reference the various other implementations and come up with a C# version.

We have to do quite a lot of code here since CloudSight can sometimes take a while to respond – even timeout. I’ve decided to go with just Thread.Sleeping for this example, but you could create an out of band proactive reply if you prefer, or wire in a webhook perhaps.

The key elements are:

Building the CloudSight request

var content =
    new MultipartFormDataContent("Upload----" + DateTime.Now)
    {
        {
        // the image byte data
        new StreamContent(new MemoryStream(data)), 
        "image_request[image]", "image.jpg"
        },
        {
        new StringContent("en-GB"), 
        "image_request[locale]"
        }
    };

var imgClient =
    new HttpClient
    {
        BaseAddress = new Uri("https://api.cloudsightapi.com/")
    };

imgClient.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("CloudSight", "<your api key goes here>");

Submit the request and check the processing status

// Send the image for processing to /image_requests
var responseMessage =
    await imgClient.PostAsync("image_requests", content);

// Get the token for this request from the response
var jsonimageresponse =
    await responseMessage.Content.ReadAsStringAsync();

// get a dynamic object using Newtonsoft.Json
dynamic imageresponse =
    JsonConvert.DeserializeObject(jsonimageresponse);

// check the image processing status using the token 
// (this is a different endpoint - /image_responses)
var jsonimagestatus = await
        (await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
        .Content
        .ReadAsStringAsync();

dynamic imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
// it will be in "imagestatus.status"

Check the processing status until we get a result or a timeout

// if it's not Completed or Timed Out yet, wait and poll
while (imagestatus.status != "completed" && imagestatus.status != "timeout")
{
    // wait a couple of seconds
    Thread.Sleep(2000);

    // check the status again
    jsonimagestatus = await
     (await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
     .Content
     .ReadAsStringAsync();

    imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
}

Now let’s pull all of the above chunks of code together into a single method with a few “typing” responses where appropriate:

private static async Task<string> ProcessImage(IDialogContext context, 
                                    IMessageActivity message, 
                                    byte[] data)
{
    // build the request's content - a very specific request structure
    var content =
        new MultipartFormDataContent("Upload----" + DateTime.Now)
        {
            {
            // the image byte data
            new StreamContent(new MemoryStream(data)), 
            "image_request[image]", "image.jpg"
            },
            {
            new StringContent("en-GB"), 
            "image_request[locale]"
            }
        };

    var imgClient =
        new HttpClient
        {
            BaseAddress = new Uri("https://api.cloudsightapi.com/")
        };

    imgClient.DefaultRequestHeaders.Authorization =
        new AuthenticationHeaderValue("CloudSight", "<your api key goes here>");

    // Send the image for processing
    var responseMessage =
        await imgClient.PostAsync("image_requests", content);

    // Get the token for this request from the response
    var jsonimageresponse =
        await responseMessage.Content.ReadAsStringAsync();

    // get a dynamic object using Newtonsoft.Json
    dynamic imageresponse =
        JsonConvert.DeserializeObject(jsonimageresponse);

    // check the image processing status using the token 
    var jsonimagestatus = await
            (await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
            .Content
            .ReadAsStringAsync();

    dynamic imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);

    // prepare the "typing" response..
    var typing = context.MakeMessage();
    typing.Type = ActivityTypes.Typing;

    // if it's not Completed or Timed Out yet, wait and poll
    while (imagestatus.status != "completed" && imagestatus.status != "timeout")
    {
        // not done yet, show the chatbot loading spinner..
        await context.PostAsync(typing);
        Thread.Sleep(2000);

        jsonimagestatus = await
            (await imgClient.GetAsync($"image_responses/{imageresponse.token}"))
            .Content
            .ReadAsStringAsync();

        imagestatus = JsonConvert.DeserializeObject(jsonimagestatus);
    }

    // Did it Complete or Time Out?
    string productdescription;
    if (imagestatus.status != "timeout")
    {
        // Got a result!
        productdescription = imagestatus.name;
        await 
            context.PostAsync($"Aha - looks like it's a {productdescription}!");
    }
    else
    {
        // Timed Out 
        await 
            context.PostAsync("Ah, couldn't find anything this time, sorry.");
    }
    return productdescription;
}

That should do for receiving an image, submitting it to CloudSight, and getting a product description back (or bombing out).

Now let’s use that to find similar products for the user to spend their money on!

Stage 3: Product Listings

In order to take the product description from CloudSight and turn it into a purchase, I’ve gone with ShopStyle; a shopping platform with an affiliate program. By using ShopStyle I can search on many online fashion shops all at once, and even receive an affiliate sale if anyone clicks through and buys – KACHING!

If you decide to build this yourself you’d just submit the text result from CloudSight into your own site’s search endpoint

The ShopStyle API allows client applications to retrieve the underlying data for all the basic elements of the ShopStyle website, including products, brands, retailers, and categories. For ease of development, the API is a REST-style web service, composed of simple HTTP GET requests. Data is returned to the client in either XML or JSON formats.

ShopStyle Product API

We can hit their api endpoint for product listing with the search term and we’ll get a result that looks like this:

ShopStyle product listing API response

Should be pretty easy to extract product data from that and display a carousel, right? Let’s have a go:

Query ShopStyle

// build the shopstyle query 
var shopClient = new HttpClient { 
 BaseAddress = new Uri("http://api.shopstyle.com/api/v2/") 
};

var jsonproductresponse = await
    (await
        shopClient.GetAsync(
            "products?" +
            $"pid=<your api key goes here>&" +
            $"fts={HttpUtility.UrlEncode(productdescription)}&"+
            "offset=0&limit=10"))
        .Content
        .ReadAsStringAsync();

// create a dynamic object from the json response
dynamic productresponse = JsonConvert.DeserializeObject(jsonproductresponse);

Create a list of Hero cards for the carousel from the dynamic response object:

var productlist = new List<Attachment>();

// show a max of 5 items
int productMax = 
 productresponse.metadata.total < 5 ? 
 productresponse.metadata.total : 5;

for (var i = 0; i < productMax; i++)
{
    // create a link to the product as a Card Action
    var buttons = new List<CardAction>
    {
        new CardAction
        {
            Title = "View details",
            Type = "openUrl",
            Value = productresponse.products[i].clickUrl
        }
    };

    // try to get an image if there is one
    var imgs = new List<CardImage>();
    string img = 
        productresponse
        .products[i]?
        .image?
        .sizes?
        .XLarge?
        .url;

    if (!string.IsNullOrEmpty(img))
    {
        imgs.Add(new CardImage(img));
    }

    // add the Card Action and the image to a Hero Card attachment
    var attachment = new HeroCard
    {
        Text = productresponse.products[i].name,
        Images = imgs,
        Subtitle = productresponse.products[i].priceLabel,
        Buttons = buttons
    };
    productlist.Add(attachment.ToAttachment());
}

Respond with a carousel:

// create the carousel from the product list  
var carousel = context.MakeMessage();
carousel.Attachments = productlist;
carousel.AttachmentLayout = AttachmentLayoutTypes.List;
carousel.Text = "Similar products - tap to buy";

await context.PostAsync(carousel);

Let’s pull that all together into a single method with some extra messages to the user:

private static async Task ShowProductListing(IDialogContext context, string productdescription)
{
    await context.PostAsync($"Now looking for similar products - brb!");

    // build the shopstyle query 
    var shopClient = new HttpClient { 
        BaseAddress = new Uri("http://api.shopstyle.com/api/v2/") 
    };

    var jsonproductresponse = await
        (await
            shopClient.GetAsync(
                "products?" +
                $"pid=<your api key goes here>&" +
                $"fts={HttpUtility.UrlEncode(productdescription)}&"+
                "offset=0&limit=10"))
            .Content
            .ReadAsStringAsync();

    // create a dynamic object from the json response
    dynamic productresponse 
        = JsonConvert.DeserializeObject(jsonproductresponse);

    // did we find any results?
    if (productresponse.metadata.total > 0)
    {
        await
            context.PostAsync($"I found {productresponse.metadata.total} items!");

        var productlist = new List<Attachment>();

        // show a max of 5 items
        int productMax = 
            productresponse.metadata.total < 5 ? 
            productresponse.metadata.total : 5;

        for (var i = 0; i < productMax; i++)
        {
            // create a link to the product as a Card Action
            var buttons = new List<CardAction>
            {
                new CardAction
                {
                    Title = "View details",
                    Type = "openUrl",
                    Value = productresponse.products[i].clickUrl
                }
            };

            // try to get an image if there is one
            var imgs = new List<CardImage>();
            string img = productresponse
                        .products[i]?
                        .image?
                        .sizes?
                        .XLarge?
                        .url;

            if (!string.IsNullOrEmpty(img))
            {
                imgs.Add(new CardImage(img));
            }

            // add the Card Action and the image to a Hero Card attachment
            var attachment = new HeroCard
            {
                Text = productresponse.products[i].name,
                Images = imgs,
                Subtitle = productresponse.products[i].priceLabel,
                Buttons = buttons
            };
            productlist.Add(attachment.ToAttachment());
        }

        // create the carousel from the product list  
        var carousel = context.MakeMessage();
        carousel.Attachments = productlist;
        carousel.AttachmentLayout = AttachmentLayoutTypes.Carousel; //or List
        carousel.Text = "Similar products - tap to buy";

        await context.PostAsync(carousel);
    }
    else
    {
        await 
            context.PostAsync($"Sorry, didn't find anything. Try again?");
    }
}

Stage 4: Wiring it up

Now that I’ve managed to find an image recognition solution that’s surprisingly accurate, and a fashion e-commerce api that takes an arbitrary search term, connecting them together should give us a basic bot that can take an input image and return products in a nice carousel to browse and buy.

This entire proof of concept consists of just 3 methods: MessageReceivedAsync, ProcessImage, and ShowProductListing.

In the main MessageReceivedAsync method I’ve added in a little sanity checking and the ability to bypass image recognition if there’s no image and the message contains some text.

public async Task MessageReceivedAsync(IDialogContextcontext, IAwaitable<IMessageActivity> argument)
{
    var message = await argument;

    // default to message text, so we can bypass the image recognition
    var product = message.Text;

    // is there an image in the message?
    var image = message.
        Attachments?.
        FirstOrDefault(x => x.ContentType.ToLowerInvariant().Contains("image"));

    // if so, fire off image recognition
    if (image != null)
    {
        var connector = 
            new ConnectorClient(new Uri(message.ServiceUrl));

        // if the image was passed as byte data in Content, use that
        var data = image.Content as byte[] ??

                   // if not, download it from the ContentUrl
                   connector
                       .HttpClient
                       .GetByteArrayAsync(image.ContentUrl)
                       .Result;

        product = await ProcessImage(context, message, data);
    }

    // find matching products and display them
    await ShowProductListing(context, product);

    context.Wait(MessageReceivedAsync);
}

The End Result

Summary

As you can see, product specific image recognition can be exceptional in some cases; pretty much spot on for this proof of concept. The image description is so specific that in some cases it means the product search doesn’t find anything; this is more a failure of the product search functionality though.

Unfortunately, it’s quite cost-prohibitive – just to break even with CloudSight I’d need to have 7500 affiliate clicks with a minimum of 4 cents earned per click each month. That doesn’t take into account hosting costs (receiving and sending the image data around could be charged depending on your hosting solution).

Maybe the ability to “train” an image recognition solution using a seed set of image data, such as your own product catalogue, would help. It certainly would be interesting to try. (*Ahem* Any forward-thinking e-commerce companies out there willing to pay me to do this, please get in contact!)

But why do this as a chatbot? Why not a web page? Anecdotally, I’ve found people happier to upload an image in a chat than submit it to a webpage. Also, I personally feel that the asynchronous nature of a conversational style allows for a more natural interface, in this particular scenario. It would be interesting to try both approaches out and user test them.

So where to go next with this?

In this particular concept, I’d look to improve the conversation flow; e.g. not rely on just an image but also ask for more information from the user if possible. Maybe I’ll put it in the wild and see how many people click through and possibly purchase in a month.

Some CloudSight coolness to end with

I’ll leave you with these CloudSight responses that just blew me away; how on earth did it come up with this for another few of my test images of friends’ clothing? It makes me think that CloudSight is actually using humans after all!

Well, of course that's what it is..

Wat?!

Mammut beanie

That is a Mammut beanie..

Harley Quin tee

Stef is wearing a Harley Quin tee under a black zip-up hoodie. This is crazy.

What do YOU think CloudSight are using?.. I’d love to hear theories..

5 thoughts on “Virtual Shop Assistant Chatbot with Amazing Image Recognition

  1. Hi. Thank you for sharing. The content is very useful. Can I access the project source? I need this for an academic study

  2. User uploads image
    Bot processes the image to find a product
    Bot responds with the description of the product it found in the image
    Bot finds similar products for sale
    Bot returns products in carousel (with affiliate links!)
    User taps the item they would like
    Bot redirects user to that item’s website for purchase

    Seems like the “Bot” in here is a website that calls an API. How is this a bot? I mean how can a bot redirect a user?

    • > Bot redirects user to that item’s website for purchase

      By “redirect” I mean “opens web page”… *ahem*

      Everything up until that point is the bot.

Leave a Reply

Your email address will not be published. Required fields are marked *