digital assistants and chatbots - a brands best friend?

1

DIGITAL ASSISTANTS/CHATBOTS – A BRAND’S BEST FRIEND?

INTRODUCTION

Background

Microsoft CEO Satya Nadella recently stated at their Worldwide Partner conference that “chatbots will

fundamentally revolutionize how computing is experienced by everybody”.

Digital Assistants (DAs) and chatbots define a shift to a more conversational style of interacting with

internet based services.

At its most primitive, a DA might be a simple search interface built upon an existing FAQ database. At

its more advanced, a DA would use contextual information and artificial intelligence (AI) to calculate a

response tailored specifically for an individual.

It is envisaged that these conversational interfaces will not be staffed with human call centre

operators but by cloud based servers running sophisticated artificial intelligence routines. In the short

to medium term, however, it is likely that the AI routines will hand off to a human operator once some

initial conversation filtering has taken place.

Until very recently it was difficult for a brand or a market researcher to participate directly in the

creation and distribution of DAs, as companies such as Apple (SIRI) and Microsoft (Cortana) were

closed platforms. A lot changed when Facebook announced at its 2016 developer conference that

anyone could build their own DA within Facebook Messenger. Facebook is not the only company

opening up to developers. Line (Japan), WeChat (China), KIK (US teens) also operate chatbot stores.

The use of DAs and chatbots offer the potential to reach billions of potential customers with a

consistent conversational experience.

So what is the difference between a chatbot and a Digital Assistant? A DA is akin to an electronic

helper; it is the butler working for you. Chatbots are like dealing with a company representative

helping with a specific task such as booking travel or buying insurance. Facebook’s stance is that they

make little differentiation between conversing with a human and a chatbot. Both are resident and

searchable in Facebook’s contacts book. The fact that Facebook provides an open platform means

that a brand could develop a general purpose DA as a chatbot if they wanted to. For instance, Nike

could create a general fitness DA, broader in scope than solely e-commerce sportswear.

Why should brands be interested?

There are two key opportunities for brands.

Create a standalone chatbot within a third party platform (such as Facebook Messenger or

WeChat).

Work with the large internet companies who run general DAs such as Apple (SIRI) or

Microsoft (Cortana) to ensure that they have access to and can interact with the brand’s best

content.

Chatbots offer a number of specific advantages:

Chatbots will move public brand to consumer service conversations, which often occur

publically on Facebook and Twitter pages, into a private and more controlled chat

2

environment. Chatbots can learn, and profile the user over time as Messenger platforms

provide a persistent ID.

Chatbots offer significant scalability. For instance, a Facebook chatbot can effectively

connect a company’s customer service system with 900 million people within a consistent

User Interface (UI). Chatbots are therefore a potential challenger to telecom customer

care lines (see Figure 1)

Chatbots allow the seamless sending of documents and forms. For instance a digital

telecoms sales assistant could forward contract documentation directly to the user, real-

time and within the chat (without needing to ask for a postal, or email address).

In short, digital agents play to automation and scale opportunities.

Why should market researchers be interested?

Messenger-based chatbots offer similar advantages to the market research industry.

Messenger platforms have a significant and increasing population reach. The largest of these

platforms now have more active users than social networks (see Figure 2)

A significant youth population engages with these platforms

Research can now be presented in a UI framework already understood by millions of people

(this will also present a challenge to existing well known question types and formats)

FIGURE 1. GLOBAL MESSAGE VOLUME TELCO VERSUS FACEBOOK MESSENGER AND WHATSAPP

FIGURE 2. MONTHLY ACTIVE USERS FOR TOP 4 SOCIAL NETWORKS ** AND MESSEAGING APPS **

3

Testing the hypothesis

When considering the notion of the Digital Assistant as a brand’s best friend?” a number of high level

questions come to mind:

Do people want to converse with DAs in the first place?

How do people converse with brands via a digital channel such as a Facebook brand page?

How do people converse? Is it the same as with a friend?

What sort of subjects would someone want to converse with a DA about?

Even though AI techniques have advanced significantly since the launch of SIRI in 2011, is

the goal of fulfilling a truly open conversation using AI still a twinkle in the developer’s eye?

I had several hypotheses I wanted to test:

Conversations with a DA are likely to be succinct. This would mean that a DA needs to

accurately interpret intent and have the content on hand to satisfy the request. If this was not

the case it could be to the detriment of the brand experience.

Conversations between a brand and a consumer via a digital channel are already brief

Current interactions with a DA use simple question conventions which do not requires

contextual information

Given the challenge of creating a conversational UI, it is likely that interactions need to be

highly templated

As I did not have access to one definitive single source of data to answer all these questions, I carried

out a series of experiments to test the hypotheses and attempt to answer the core question.

Usability test: 10 London-based panellists were invited to interact with a voice based DA

(Apple’s SIRI and Google’s Voice Search). Tasks were administered by an on-site moderator.

SIRI + Messenger analysis: examination of SIRI and Messenger usage on Kantar’s US

mobile behavioural panel. In the case of SIRI, the aim was to deduce the current types of

interactions occurring. For Facebook Messenger, the work involved looking at the session

lengths and deduce the amount of time spent in this mobile environment.

Facebook Brand Page analysis: an examination of interactions taken from 13,658 posts

made to the Facebook pages of 15 UK retailers during the period 1 November to 31

December 2015. The objective was to analyse the types of conversations occurring and the y

implications for DAs.

Messenger chatbot analysis: I created a working chatbot called Kat and had 70 of my

colleagues interact with it. The chatbot had a mixture of closed and open questions. All data

from the interactions were recorded and stored in a database for analysis.

4

METHOD I: USABILITY TEST “HOW DO PEOPLE INTERACT WITH MOBILE BASED

DIGITAL ASSISTANTS?”

The usability test involved 10 London-based respondents undertaking a series of voice-triggered

brand related tasks using SIRI or Google Voice.

The questionnaire was managed by a moderator, and a video of the exercise was recorded. The

mobile phone, either an iPhone (for SIRI), or a Samsung (Google Voice) was provided to the

respondent.

The script was broken into three sections. At the end of each section the respondent was asked for

their thoughts on the interaction.

Location based questions:

Can you find the nearest Pizza Express?

Can you find the nearest Costa Coffee?

Can you recommend a good Indian restaurant?

Where is the nearest petrol station?

Where is the nearest Aldi supermarket?

Where is the nearest free Wi-Fi hotspot?

Weather and time questions:

What is today’s weather forecast?

What is the temperature right now?

What will the weather will be like tomorrow?

What is the weather forecast for the weekend?

What is the time in New York?

What is the time in Moscow?

Information searches:

Can you find me the best Windows tablets?

Find me the best baked chicken recipes.

Find me the best smartphone deals.

Find me news about Coca Cola.

Show me upcoming movies.

Find me new car deals.

Find me new VW car deals.

Find me this week’s ASDA deals.

Location Tasks: Summarized Observations

Nearest…

The first two tasks were simple ‘find the nearest’ requests. All respondents had success with this task.

Respondent X insisted on just saying the brand i.e. “Pizza Express”, “Costa Coffee”. His reasoning

was that because he was on a mobile device with GPS, it would automatically know his location and

deduce that he would want the closest one. He was correct as he got the same responses as the

other respondents.

“Good Indian restaurant”…

5

There was some discussion re whether ‘good’ was the right question, as no one would search for a

‘bad’ Indian restaurant. There were concerns at the sort order of the results. It seemed to sort them in

order of location rather than quality or rating.

SIRI provided TripAdvisor reviews, although some respondents noted that there were so few reviews

for some restaurants that they were not useful.

Respondent V asked for 5 star rated restaurants, instead of a “good restaurant”. He thought stars

were the accepted rating currency of restaurant reviews.

Respondent X would not use the term “good”. He thought that by using this term, Google would return

paid for search results. Instead, he just said “Indian Restaurant” and received the same results as the

others.

Respondent Y didn’t like the fact that SIRI would not allow for further tuning of the response. He

wanted it to be more ‘conversational’ and its response to be able to be tuned.

Overall, in terms of the ‘nearest’ requests, respondents thought that the DAs struggled with more

complex tasks, particularly where weighting is required (i.e. location versus quality of goods).

As regards the weather and time questions, respondents found that the DAs performed very well on

these tasks. Some of the respondents commented that when they voiced a question, they expected a

voice response rather than text on the screen.

In terms of information searches, the information is best summarised as follows:

Best Windows tablets

Respondents found the question too generic. Respondent X said he would only trust results from reputable sources such as PC Magazine.

Baked chicken recipes Respondents thought this worked well, though Respondent Y would have preferred the result to be a YouTube video. Respondent X pointed out, that he would be more likely to use a PC or Tablet for this type of task.

Best smartphone deals and best new car deals

It was thought that these questions were too vague. Both searches linked to comparison sites or local retailers with the term “best” ignored. One person struggled with the smartphone question as SIRI repeatedly insisted the only phone worth having is an iPhone.

New VW car deals

A mixed bag, SIRI linked to VW dealer sites. Google performed better by linking to the deal section of the VW site. However, every respondent would have preferred to see the deal first hand rather than having to click on a link.

This week’s ASDA deals

The results of this task were interesting for several reasons. Firstly, for half the respondents, the DA did not understand what “ASDA” was. Perhaps “ASDA” is phonetically confusing for an AI. Secondly, when the DA did understand, it linked to a deal site as opposed to the ASDA website.

Further respondent comments from the usability testing

“If I do a location search I want a map in the result (Siri didn't always do this)"

“I like it when Siri replies to me (via voice), not when it gives me a list of tiny links"

6

“I want it more conversational, like if SIRI asks me additional questions, to help it get me the best

information"

“I only use Siri at home - I'd be too embarrassed when I go shopping"

“If I ask for the best restaurant near me, it gives all restaurants, no matter how bad the review (which

it also shows in the results). Just give me the ones with good reviews you have this information"

“If I ask for a baked chicken recipe, or how to remove a stain, why not link straight to a 'how to' video

on YouTube from a reputable brand?”

“I need to adjust the way I talk, so Google understands me".

“I like it when there is a photo in the result”

7

Findings

Half of the respondents already use a DA voice service, however, they only do so at home. For one of

the respondents, English was their second language and the DA would often fail to understand them.

When carrying out such UI tests, it is worthwhile to ensure that respondents are participating in their

first language.

One struggled due to having no prior experience with a DA.

Three key points emerged from the usability work. Firstly, it was notable that more experienced users

would continually reword the question to a format they thought would work best. Often this involved

simplifying the question and trusting the DA (thanks to location sensors and contextual information) to

fill in the gaps.

Secondly, respondents did not expect a DA to handle difficult questions. However their expectation

was the return of a high quality result, maps if required, deals surfaced, results ordered correctly and

recipes in video form.

Thirdly, they wanted the results to be conversational. In short if they talked to a DA, they wanted the

DA to talk back.

Two respondents expressed their surprise at the improvements in Siri. One commented “Wow, it

understands me way better than it used to”. They had tried SIRI when it was first launched and never

came back to it. All respondents found it frustrating when the DA did not understand them.

Can voice activated Digital Assistants be considered a brand’s best friend?

In the US, Google says that 20% of its queries on a mobile device are voice searches(1). The fact that

a voice DA struggled to understand “ASDA” should be of concern for that brand.

Respondents also clearly wanted information to be surfaced without clicking through to web links.

They expected the DA to filter out content and only provide the best response. DAs did this well for

simple time and weather queries but not for more complex questions.

However the real issue is that at the time of testing Apple and Google had complete control over the

DA with no direct way for brands to participate. For instance, in the case of ASDA, they would have to

make a formal request to Google to request for the speech recognition system to recognize their

brand.

Under the current circumstances, the best a brand can do is to format their content in a way that

mirrors the approach used by Google or Apple to find and index content. One possible technique

could be to use Google’s keyword suggestion tool to find the types of searches and frequency that

appear to be conversational.

Finally, the testing shows that while brands provide content, they are not involved in the conversation.

As a result we cannot consider DAs such as SIRI to be a brand’s best friend.

(1) http://searchengineland.com/google-reveals-20-percent-queries-voice-queries-249917

http://searchengineland.com/google-reveals-20-percent-queries-voice-queries-249917

8

METHOD II: “WHAT CLUES CAN MOBILE BEHAVIOURAL DATA GIVE US ABOUT

DIGITAL ASSISTANT USAGE?”

To answer this question I reviewed app and web data logged from Kantar’s mobile behavioural panel

to isolate and quantify:

SIRI usage patterns

Mobile Search terms indicative of an interaction that a digital assistant might be involved in

Mobile Messenger usage patterns

SIRI Usage Patterns

To get an idea of how people are using DAs in practice, I identified a group of more than 3000

panellists who have used SIRI over the past year. Whilst I could not measure when a panellist had

made an internal device call, like setting an alarm, I could evaluate searches that require SIRI to

connect to one of its content partners (i.e. Bing for web searches, Wikipedia for factual information,

Wolfram Alpha if calculations are required).

I was able to capture almost 70K of these SIRI connections and classify their purpose. This revealed

that 63% of the searches from SIRI resulted in a Bing search for information. The next largest

category was Maps/Location at 23% of searches (Figure 3).

FIGURE 3. BREAKDOWN OF SIRI USAGE FROM 70K INTERACTIONS RECORDED FROM 3K US MOBILE PANELISTS

For 10% of SIRI searches processed by Bing (~4.5 K photos), a photo was also displayed, and I was

able to capture and analyse these. Often photos are displayed if a factual question is posed of SIRI,

particularly if the response is sourced from Wikipedia (Figure 4).

FIGURE 4. BREAKDOWN OF PHOTOS SENT TO A RESPONDENT AS A RESULTS OF A SIRI INTERACTION

Maps/Location 29%

Wolfram Calculations 8%

Bing Searches

63%

Itunes (find music) 0%

Sports Faces 48%

Sports Logos 18%

Nature 6%

Music 1%

Movies 18%

History or Maths 3%

Flags 0%

Famous/Celeb 4%

Corporate Logos 2%

9

For the photos that were captured, sports data figured prominently. The largest category was that of

portrait photos of sports stars. It is very likely this would have been for searches relating to player

stats and information and that this sports related information service has been heavily integrated into

SIRI. Sports logos and movie posters were the next largest photo categories.

The nature category was also interesting with sharks, spiders and snakes featuring prominently!

However, as interesting as the photos were, they only represented 10% of the Bing calls, and give

only an indication of the things people search for, and the visuals they are used to experiencing.

Mobile Searches

Whilst I did not have access to SIRI/Bing search queries made via SIRI, I did have access to Google

mobile searches. Whilst I could not separate whether these searches were initiated via voice or text, I

could find searches that could be brokered by a DA. To do this I took inspiration from Google’s work

on ‘Micro Moments’, defined as the instant when someone reaches for their mobile device to find

something out. Two of the specific moments that Google said a brand should look for are “How…?”

moments and “Near….?” moments.

After isolating searches that contained “How…?” (3% of searches on our panel) and “Near..?” (1% of

search), I was able to analyse the results for word frequencies.

Near..?

Google searches containing “Near..?” related strongly to hotels and restaurants/shopping (figure 5).

FIGURE 5. WORD FREQUENCY FOR MOBILE SEARCHES THAT CONTAIN “NEAR”

HO

TE

LS

RE

ST

AU

RA

NT

S

ST

OR

ES

ST

OR

E

FO

OD

AP

AR

TM

EN

TS

CA

R

PIZ

ZA

CH

EA

P

SH

OP

S

RE

ST

AU

RA

NT

SH

OP

AIR

PO

RT

MA

LL

NO

RT

HL

AK

E

SE

RV

ICE

CH

INE

SE

PA

RK

ING

BR

EA

KF

AS

T

OP

EN

MO

VIE

SO

UT

H

SA

LE

ICE

BA

NK

NE

W

RO

SA

KO

HL

S

MO

TE

L

RE

PA

IR

PA

RK

TA

RG

ET

GA

S

RE

NT

DR

IVE

ST

AT

ION

BU

FF

ET

CR

EA

M

WA

TE

R

BA

RS

DE

LIV

ER

Y

PO

ST

DE

PO

T

NY

BE

AC

H

SC

HO

OLS

JA

PA

NE

SE

SA

LO

N

UN

IVE

RS

ITY

BE

ST

BU

Y

CE

NT

ER

GO

LF

WA

LM

AR

T

10

How..?

Interestingly, TV was the most searched item (figure 6).

Particularly apparent were technical questions such as:

“How do I connect YouTube from Phone to TV?” Phone to TV connectivity was a significant

trend in the data.

“How do I edit my contacts on my Samsung Galaxy?”

Are device makers now ceding the customer relationship to search engines?

What was striking was the number of technical questions relating to mobile devices that are being fed

through search engines. This in turn begs the question of whether device manufacturers should be

on-boarding this information into the device or creating a technical support chatbot.

FIGURE 6. WORD FREQUENCY FOR MOBILE SEARCHES THAT CONTAIN “NEAR”

Findings

It was surprising how many SIRI request are processed by the Bing search engine. I had expected

that requests for directions via Apple would be the largest usage category.

For sports brands it is worth nothing that SIRI brokers a significant amount of sporting related

questions and that the images were often returned as part of these.

Once again, as brands are not directly involved in these DA consumer interactions, SIRI cannot be

considered a brand’s best friend.

The brief analysis of ‘How’ and ‘Near’ searches did show how companies are inadvertently ceding

their consumer interactions to search. It’s a risky strategy for two reasons. Firstly, the search engine

could easily pass the enquiry to a competitor as a result of sponsored advertising. Secondly, brands

miss the opportunity to be involved in and learn from these interactions.

TV

FIX

MO

NE

Y

WO

RK

RE

MO

VE

CA

R

AN

DR

OID

PLA

Y

PH

ON

E

FR

EE

HA

IR

IPH

ON

E

CO

OK

RE

SE

T

AP

P

XB

OX

CA

RD

WA

TE

R

CL

EA

N

UN

LO

CK

GA

LA

XY

CA

LO

RIE

S

BA

BY

DO

G

HO

ME

INS

TA

LL

ON

LIN

E

BE

CO

ME

GO

OG

LE

FA

CE

BO

OK

OP

EN

WO

RT

H

BO

X

GR

OW

MU

SIC

OIL

CO

NN

EC

T

MO

VIE

S

PA

Y

WIN

DO

WS

11

Messenger Data

Given the significant scale of messenger apps and the fact that the platforms have begun opening up

to third party developers via “Bot Stores”, it is highly likely that messengers will be the key distribution

channel for branded DAs.

I thought it worthwhile to quantity the messenger app usage patterns of 3.5K mobile panellists for the

month of March 2016. For this exercise I analysed Facebook Messenger patterns (Figures 7 and 8).

The data shows that the average individual interaction/session out of home is 85 seconds versus in

home 113 seconds. Notable is how brief the average session is, especially when out of home.

FIGURE 7. FB MESSENGER AVERAGE SESSION LENGTH BY MESSENGER IN HOME VERSUS OUT OF HOME

FIGURE 8. FB MESSENGER FREQUENCY OF INDIVIDUAL MESSENGER ACTIVITY LENGTHS

Findings

The fact that the data shows messenger sessions to be very brief means that brand consumer

interactions will need to be succinct. It is conceivable that the interactions will be in chains and while

each link in the chain might be brief, a single conversation could conceivably last some time.

Unfortunately our mobile behavioural data can only isolate app usage and not individual

conversational chains.

To address the issue of chain measurement, Ted Livingstone CEO of KIK (a youth orientated

messenger with 300 million active users) has proposed that messenger conversations require a new

set of metrics.

Active: A chat on one topic, where interaction, responses happen in rapid fire (i.e. a <= second

interval between messages). This could be an intense chat between girlfriend, boyfriend.

0

20

40

60

80

100

120

140

FacebookMessenger

Snapchat Google Talk Kik Whatsapp Groupme Textnow AndroidMessenger

Pinger

Sessio

n S

eco

nd

s

- Out of home

0-30 seconds 52

18% 30-60 seconds

60-90 seconds 5%

12

Passive: An on-again off-again conversation. I.e. continued tweaking of a travel arrangement.

Sporadic: Occasional messages, sent during the day, or week. This might be the style of

conversation you would expect with an entertainment service.

It is likely that messenger based brand interactions such as customer service would tend to be in the

“Active” camp however we do not have data to prove this.

The DA should also be contextually aware and reduce the requirement for user input, if it detects the

individual is out and about and likely to be distracted.

I believe that due to the sheer reach of chatbots that they will be a Brand’s best friend when used in a

messenger environment. RC: Don’t you need to add the bit in yellow highlight? However the data

indicates that interactions will need to be well designed and accurate. If messenger session times

swell due to the chatbot being non-intuitive, it is easy to imagine it becoming a source of frustration.

13

METHOD III: WHAT DOES A DIRECT DIGITAL BRAND TO CONSUMER

CONVERSATION LOOK LIKE?

One of the limitations with method II was that the behavioural data did not allow us to analyse

conversation chains. Facebook brand pages, however, provide an avenue to collecting this type of

data.

For this exercise we extracted 13,658 posts made to the Facebook pages of 15 UK retailers during

the period 1 November to 31 December 2015.

We define a post as when a user sends a new comment to a brand page. For each of these posts we

can extract the ensuing conversation chain. Taking both the number of posts and chains together

gave us a total of over 105,000 comments to analyse.

Each chain is classified as one of these three types:

u = user (the person who first posted to the page and started a conversation chain)

p = page owner (i.e. the brand or business who operates the Facebook page)

0 = other users have decided to comment within the conversation chain.

FIGURE 9. EXAMPLE POSTS, COMMENTS, CHAIN LENGTH

Once each post had been tagged, we were left with a conversation signature. We then aggregated

the occurrence of these signatures per brand.

14

Findings

The table below (Figure 10) shows a brief extract of that data. For instance you can see on the Boots

Facebook page the most frequent signature was where the user posted and received no response

(32% of the time).

FIGURE 10. SAMPLE EXTRACT FROM UK RETAILER FB PAGE DATA

Signatures % of signature occurrence per brand

Boots LidlUK AldiUK Tesco Marks and Spencer All Retail Brands Combined

u 32 49 12 7 35 26 up 24 27 35 29 18 22 upu 5 11 13 5 6 7 upup 9 0 0 9 4 5 uo 2 1 1 1 4 3 uop 3 0 2 3 2 2 upo 1 1 3 2 1 1 upupu 1 0 0 1 1 1 uou 0 0 1 0 1 1 upupup 1 0 0 2 1 1 uoo 1 0 0 0 2 1 uoup 2 0 0 1 1 1 upuo 1 1 1 0 0 1 upuu 0 1 1 0 1 1 uu 1 0 0 0 1 0 upou 0 0 1 1 0 0 uup 0 0 0 1 0 0 uopu 0 0 1 0 1 0 uooo 1 0 0 0 1 0 upuup 0 0 0 1 1 0

Looking across all retail brands the most common chain was a user post and no brand response

(26%) followed by a user post and a single brand response (22%).

15

What was notable in the data was the significant long tail for the chain types recorded, 942 for the

13K posts. There are several factors that need to be considered when making message chain

measurements. For instance, brands that promote their Facebook page as a communication channel

will receive more messages and therefore more variety in conversation chains. Brands which do not

respond to initial user posts would also not be expected to receive a variety of chain types.

Tesco (Figure 11) received the most posts and also had the lowest number of unanswered posts. For

this dedication they also had to manage by far the largest variety of conversation chains. This would

certainly have cost implications.

FIGURE 11. COUNT OF CHAIN TYPES VERSUS % UNANSWERED POSTS

Brands which do not respond to user requests face the issue that the user’s friends will often hold

conversations on the brand’s page without any brand involvement. For example, on the Lidl UK page

there were 10 public conversations between a user and their Facebook friends without any brand

intervention.

A brand’s best friend?

The obvious advantage of chatbots in this instance is that communication is private.

We did not have access to determine the number of private versus public brand page interactions.

KLM airways shared this graph (Figure 12) that indicated they receive 7 times more private Facebook

messages than public ones.

FIGURE 12. KLM FACEBOOK SOCIAL CUSTOMER CARE MESSAGE TYPES

16

However, the data indicate that if a chatbot is well promoted and successful it will garner a significant

number of messages. In addition, if these interactions are not templated there will be a significant

variety of conversations to manage.

To be a best friend a brand will therefore need to develop systems to manage this communication

channel cost effectively.

17

METHOD IV: MEASURING CHATBOT INTERACTION VIA A PROTOTYPE HEALTH BOT

NAMED KAT

The last method necessitated the development of a Facebook chatbot to gather first hand data of user

interactions.

How do you build said chatbot?

As of April 2016 Facebook allowed third parties to create a chatbot on their Messenger platform. A

chatbot follows a simple messaging format (the same as SMS) where it is able to receive and send

messages. The messages can contain text, multi-media or both.

Facebook also allows messages to contain simple structured elements such as buttons.

The messages can also contain structured template elements (figure 13).

FIGURE 13. FACEBOOK UI ELEMENTS TAKEN FROM HTTPS://DEVELOPERS.FACEBOOK.COM/DOCS/MESSENGER-

PLATFORM

Buttons

18

We took the decision to name our chatbots Kat as we wanted a short name with a connection to the

Kantar brand.

The questions were taken from a Kantar Health diary study (Figure 14). The survey was modified to

make it more akin to a general health survey and finished with an open conversation (figure 15

describes the flow).

FIGURE 14. CHATBOT FIXED QUESTIONS

Original Survey Questions Chatbot Survey

You are .. (ask once M F) Delete Facebook provides this information.

How old are you? (ask once)

How would you evaluate your overall health? Would you say you are: (ask once)

This was added as an “Additional Question”.

Which of the following best describes your capacities to perform everyday activities: (ask once)


How would you evaluate your overall level of activity. Would you say you are: (ask once)


How do you feel today? (ask daily in the morning and allow people to answer this question throughout the day if their mood changes)

How are you feeling right now? Unwell, Cruising, Awesome

Did you get a good sleep last night? Did you get a good sleep last night? Not great, Average, Great Sleep

Did you exercise today? Have you had any exercise within the last two hours? I worked out!, A nice walk, No mostly stationary.

What have you consumed in the last two hours – Snacks, A Meal, Nothing

FIGURE 15. CHATBOT USER FLOW

Distribution of the chatbot

19

A link to the Kat was sent out to a company mailing list “Mobile Insight Group”. In the email was a link

that would directly launch Kat.

70 people clicked on the link and launched Messenger and Kat. 10 of the 70 recipients did not interact

with the Chatbot. Four of them experienced technical issues as they were using an older version of

Messenger.

Findings

For four of the respondents the conversation spanned 16 hours! The reason is one of the challenges

to talking with an assistant in a messenger environment. A Facebook chat is not deleted; it sits there

ready to start again where it left off. This is a usability challenge. Do we insist that we start the

conversation afresh, is it rude to do so?

FIGURE 15. TOTAL DURATION OF A SINGLE CHATBOT CONVERSATION

Messenger does not have a mechanism to remove previous conversations. Deletion is at the

discretion of the user. This obviously has PR implications if the chatbot or a human operator

managing the bot sends an inappropriate message.

After removing outliers such as saying Hi, the average time to answer the four set questions (one

being dynamic if answered again during the day) was 91 seconds.

After the four fixed questions, Kat would attempt to create a more ‘open’ conversation. To do this, it

would look through the respondent’s answers, and based on a decision tree select a response. For

example, if someone had slept badly, Kat would ask why? If a respondent had said they did not have

any exercise, it would ask what their favourite form of exercise was.

0

10000

20000

30000

40000

50000

60000

70000

80000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67

Total Duration of Chat (seconds)

Conversations that lastest 16 hours

Conversations that span 1-2 hours

The majority Conversations 1-15 Minutes.

Conversation ID

Se

co

nd

s

20

FIGURE 16. TIME SPENT OF STRUCTURED QUESTIONS VERSUS DECISION TREE CONVERSATION PER RECORDED

CONVERSATION

The data indicated that respondents who attempted the fixed questions did so efficiently. Most

engaged in conversation with Kat but the limitations of a primitive AI engine would often make for

unsatisfying and frankly confusing conversations.

Did people share photos with the bot? And what was the subject matter?

Kat was designed to encourage some photo uploads.

24 of the 60 respondents uploaded a photo with 108 photos collected in total.

We used an automated, image classification system (clarifai.com) to auto tag the pictures. It would

tag the photos in milliseconds, and then ask a simple question based on the tag.

20 of the 108 were stickers (which a general image system struggles with). If stickers are an important

part of communication, in a messenger environment they will need classification. Of the 20 sticker

images (of which there are hundreds of options), the most common was the “thumbs up”.

This was followed by smiley face variations.

0

100

200

300

400

500

600

Seconds

Duration of ad-hoc.

Duration of Set Questions (seconds)

21

The rest of the photos shared were an even mix of food and selfies (Figure 17).

FIGURE 17. SMALL SAMPLE OF FOOD PHOTOS UPLOADED TO KAT)

22

Did respondents build a connection with Kat?

As can be seen from Figure 18, 52% of responses from the internal pilot were one word only. In the

word count table underneath (Figure 19) a rapid drop in frequency is evident.

FIGURE 18. FREQUENCY OF WORD COUNT CONTAINED IN USER VERBATIM RESPONSES

Of the 684 unique respondent user responses 40 had ten or more words as an input. These

respondents were testing Kat out.

FIGURE 19. EXMAPLES OF USER RESPONSES IN DIFFERENT WORD COUNT BANDS

No. Words/Symbols in Response

Five interesting examples per count

1 ? (Translation – “what the heck are you on about robot!”) Wha (Translation – “They meant to say hi) Gym Baby Beer

2 soup :) (note emoticon symbol) dancing zumba feeling cold did already noise, heat

3 blocked nose, cold am i done? am i healthy? holiday in Scotland needing the bathroom.

4 3 cups of coffee. you getting freaky bot usually to celebrate sunset noises and bad dreams eat sleep rave repeat

5 enormous amounts of road noise the pretzels or the watermelon? i have a broken ankle no my bike is stolen quinoa bites, and lemon water

6 baby got unwell during the night pretty hot with the windows closed what would you like to know so what should we do ? ok, i have to go. bye

7 i don’t like being i in pictures. i mean my job and my house feed it, put it to bed, repeat do you not feel sorry for me after a few fingers of whiskey, yes.

1 52%

2 15%

3 9%

4 6%

5 5%

6 3%

7 2%

9 2%

23

8 event location one day prior to the event i will after the event in 2 days

9 go to bed at the same time every night ? slices of bread if thats what you mean?

10 i like to cycle when i can to stay fit this is the 3rd time you Ave asked me this question probably insomnia, but it could be any number of things

11 i drank a coke with my lunch which was some curry hmm favourite exercise, walking probably as does not feel like exercise

12 my son was sick, so he woke up 3 times at time my favourite fruit is mango, but the bar was a strawberry one

Summary of Findings

As Kat was a first version prototype there were numerous lessons learnt.

Verbatim conversation is difficult to manage

Keep it simple and use buttons and other templated elements where possible

A chatbot needs to respond quickly otherwise the respondent will leave the conversation or

repeatedly send the same message.

It is easy for a chatbot with a limited knowledge base to repeat the same response. Users find

this frustrating

Kat did not have a sentiment scoring engine so on several occasions would appear

insensitive. Empathy is important when holding a conversation.

Chat sample 1

Respondent: “I have a broken ankle”.

Kat: ” What do you do to treat it?

Respondent: “I am wearing a boot.”

Kat: “Has it been a tough few days?”

Respondent: “Yes.”

Kat: “Ok thanks. Please remember to say Hi again!”

Respondent “Do you not feel sorry for me?”

Kat: “I am a Healthbot called Kat”.

Respondent: “XXXXX!”

Chat sample 2

Kat: "OK do you know what caused the lack of sleep?"

Varti: "I just told you...cold, blocked nose"

Kat: “ ”

The chatbot owner needs to reflect very carefully about letting artificial intelligence source and

compile responses on the fly.

24

CONCLUSION – ARE DIGITAL ASSISTANTS OR CHATBOTS A BRAND’S BEST

FRIEND?

Chatbots combined with the vast distribution of Messenger networks offer an unprecedented

opportunity for brands to connect with their customers at scale (see Appendix I for reach and brand

access).

Facebook’s chatbot store only launched in April and already more than 11,000 brand chatbots have

been launched. China’s WeChat opened their messenger to developers well before Facebook and

are reportedly launching thousands of new channels (their term for chatbots) each week.

While chatbots represent a significant opportunity for brands, their creation is not a straight forward

task.

Users will expect brief and accurate interactions. Users are very demanding of new technology as

evidenced by our usability tests.

Everything a chatbot says must be consistent with brand values. Once AI integration with brand DAs

becomes more mainstream, complex tests will be required to ensure the AI’s text generation is

consistent with a brand’s values.

Early adopters and influencers will try to see if they can stump a chatbot. In Microsoft’s case users

gleefully posted the conversation online when their chatbot Tay became confused. This means initial

offerings will be highly templated and often remove the requirement for open text.

Respondents indicated throughout the 3 their willingness to engage with a DA. However these

conversations can be complicated and nuanced as indicated by the significant variety of conversation

chains identified in the brand page experiment.

In short I believe that DAs/chatbots will become a brand’s best friend in the longer term. It is clear,

however, that as brands move into the world of digital conversation, they will need to start with simple

and templated experiences.

25

Appendix I

The following table, details a selection of the leading DAs, and DA distribution channels.

Messenger Type

DA TYPE USER INTERFACE Reach Developer Access

Facebook Messenger

Bot (built inside of Messenger)

Primarily text, but multimedia, can also be shared with a Bot.

1 Billion Monthly Users (MAU’s)

Developer API (Messenger Developer).

SIRI General Digital Assistant -

Voice. But the voice can trigger website links, that can then be interacted with. Voice Interface - Able to integrate apps, with SIRI commands. I.e. “SIRI, please ask Easyjet, what is the status of my flight?”

~500 million

Developer API. Slightly different interpretation,

Whatsapp None yet Primarily Text 1 Billion MAU’s.

TBD

WeChat Bot Primarily Text 700 million users.

Developer API.

Line Bot Primarily Text 215 million users.

“Bot Store”, create fully featured conversational Bot.

Snapchat None 150 million daily users.

None, brands make filters etc

KIK Bot Primarily text. Developer API

Telegram Bot Primarily text. “Bot Store”, create fully featured conversational Bot.

Microsoft Cortana

General Digital Assistant – text and voice.

Microsoft Bot Framework

Duer

Google “Allo” TBD launches sohtly

TBD Brand new!

TBD

Amazon Echo Bot (called Skills) Voice 3 million units sold in the US.

Developer API (Skills).

26

Appendix II: Measurements

This table summarises the various DA measurement ideas surfaced throughout this paper.

User Interface Testing

If you needed to find out X, how would you ask a DA to help you?

If you needed to find out X, how would you ask a friend to help you?

What would be the most useful response (to the question) from a friend?

What would be the most useful response (to the question) from a DA?

Did it present the information in a useful format?

Is this the sort of question you would ask a DA?

Could the information have been presented better and how?

How would you expect this DA, to respond (prior)?

How did the DA respond?

How did the DA response make you feel?

Did the DA meet your needs?

Did the Da respond to you in a timely manner?

Script/Persona Testing

How did interacting with the DA, make you feel about the brand? (a pre-question would ask about brand favourability);

Conversational personification – did the DA require a persona?

Does the persona, fit the brand?

Conversation Measurement (assuming access behavioural/log data)

Average interaction length

Time of day of interaction

Location of interaction (at home, out of home – or more granular i.e. at the mall)

DA analytics for DA owner

Number of unique users (assume filtered by time)

Number of unique conversations

Length of conversation

Notification response time (how quickly someone enters a conversation from a notification or other prompt

Repeat usage (and repeat usage frequency)

Successful Conversation completes (i.e. did the conversation successfully conclude?)

Un-successful Conversation completes (i.e. the conversation stopped prior to a transaction, or prior to information being shared. The

conversation became cyclic and the user gave up etc..)

Confusion versus clarity – (i.e. did the user have to ask for clarification during the interaction)

Conversational Variety – (i.e. the different conversation chains - as per TNS Facebook page analysis user-brand-user)

The sentiment of the user interactions at each stage of the conversation

The sensitivity of the user interactions (for instance if a user mentioned medical conditions)

Intensity of conversation (inspired by KIK’s suggestions) -

o Active: A chat on one topic, where interaction, responses happen in rapid fire (i.e. a <= second interval between messages).

This could be an intense chat between girlfriend, boyfriend.

o Passive: An on-again off-again conversation. I.e. continued tweaking of a travel arrangement.

o Sporadic: Occasional messages, sent during the day, or week. This might be the style of conversation you would expect with

an entertainment service.

Artificial Intelligence Script Testing

A little futuristic, but I would imagine that pure AI generated conversations, would actually to be tested an AI routine with human over site.

Conversational personification – this is a measure of how much people interact with your Bot as if it is human (i.e. Turing test) by

checking the language of user responses, for interactions that are human like, or contain empathy.

Conversational economy score – i.e. when adding AI to a conversation flow, did the AI element assist in achieving a conversational

“task complete”.

What sentiment spread of the AI generated conversation responses?

Did the AI responses include stop-words (i.e. words or phrases that a Brand would not want to be associated with)?

Confidence level % that the AI will meet brand guidelines, for a given subject.

digital assistants and chatbots - a brands best friend?

Technology