Interaction with a social robot like Furhat is inherently different from interaction with a voice assistant. The difference is similar to that between talking to someone face-to-face in the same room compared to a phone conversation. Most people prefer the former. In fact, we are willing to travel long distances and pay a lot of money to be able to have physical meetings. One way of describing this difference is that physical meetings (either with a human or a robot) are situated. This means that the physical situation in which the interaction takes place is of importance, and can be referred to during the interaction.
If I ask a concierge at an airport where the bathroom is, the person can point or look in the direction towards the bathroom, since we share the same space.
This interaction would be impossible with a voice assistant, or even with a virtual character on a screen. If I am in a shop, the clerk can put objects on the counter and we can look at them together and discuss the different options. Another benefit of situated interaction is that multi-party interaction becomes much easier. Since the different persons involved in the interaction are physically situated together, we can easily look at each other to signal who we are addressing, and who is supposed to be the next speaker. As you probably have experienced, this is something that often causes a lot of confusion in online meetings.
Just like the clerk in our earlier example can put things on the counter that can be discussed, we can put a touch screen in front of the robot, where virtual “objects”, or other types of information. can be shown. (Actually manipulating physical objects is still a very challenging task for robots). The robot can then look at the different objects during the interaction in the same way we do with real objects. Having a screen is convenient, since certain information is much easier to present visually than through speech. This includes graphs, maps and tables. Given that speech recognition can be problematic in noisy situations (and with certain users), the touch screen can also be used to select options if the robot has problems understanding what the user is saying. So, it seems like a robot and a touch screen make up a perfect combination, where the two interfaces complement each other.
The face provides a social interface and allows for spoken interaction when that is most convenient. The touch screen can be used as a fallback when speech recognition fails, and can display visual information. In reality, things are more complicated.
We have deployed Furhat in many different situations and have learned many lessons of when the screen contributes to the interaction, and when it is rather “standing in the way” between the user and the robot. One example of the latter is shown in the picture below. This was a general-purpose application that we developed to be placed in, for example, trade fairs, museums and receptions. As can be seen in the picture, the robot presents a set of “cards” on the screen that the user can select from, and the robot will tell the user some information about a topic, or present further information, such as a video or a map. While it is also possible to select cards through speech, the touch screen adds robustness in noisy situations, as discussed above.
However, the question is why the user should talk to the robot at all, when it is much easier to just tap on the cards? And then comes the next question: Why even look at the robot? Indeed, as we have observed, when users interact with this application, they might look at the robot in the very beginning, but then almost exclusively stare at the screen. Initially, users find the robot interesting, but then lose interest after some time. If this is the case, you could of course ask yourself why the robot should be there in the first place, and not just the touch screen.
When deciding how (or whether) to use a touch screen together with the robot, you should ask yourself: Does the screen really complement the social interaction with the robot, or are they redundant? If they are only redundant, chances are that users will neglect the robot and start to engage with the screen instead, since it is something they are more used to and is typically less error prone.
Another way of asking the question is: What kind of user experience can the social robot provide that the touch screen can not? The answer is a sense of social connection that only face-to-face interaction can provide. For this to work, it is important to get eye contact with the robot every now and then, and exchange facial expressions, such as a smile. Of course, if we look for too long at each other without breaking eye contact it can have an opposite, awkward effect (as humans, we typically do not keep eye contact for too long unless we are in love or threatening each other). But anyone who has interacted with Furhat knows that mutual gaze creates a very special sense of social connection. It is therefore paramount that the user’s attention is not distracted too much by another interface.
Thus, when designing an interaction with Furhat, the face-to-face interaction should be central and provide a value that the screen can not. Used in the right way, the screen might still serve a complementary function. An example of where we think the screen serves this kind of function is ‘Card Game’. This is an application we often use when demonstrating the capabilities of Furhat. In this game, two players have the task of sorting a set of cards together with Furhat, for example to sort a set of inventions in chronological order. This is a collaborative game, which means that they have to discuss the solution together with each other and Furhat.
If you watch the video below, you can observe how the players switch between attending the game and attending Furhat, exchanging smiles and checking if Furhat seems to agree with their moves. Of course, there is a lot of individual variation here. Some users look mostly at the cards, and some look more at Furhat. But our research has shown that Furhat’s facial gestures and gaze direction during the game generally has a large impact on the users’ behaviour, so they must be paying attention.
As you can see, in this setup, the screen is integrated into the table, so the cards serve a similar function as objects put on the table, and the robot has the same relationship to these items as you have. They are not elements through which you interact with the robot.
Touching a card on the screen is not equivalent to saying something to the robot
However, from a robustness perspective, the situated nature of the interaction can still really help, as it will be much easier to predict and constrain what the user is likely to say. When playing this game, the cards on the table will most likely be mentioned a lot, so we can prime the speech recognizer with their names. Again, our research shows that users mostly talk about things that are related to the game in this setting. Even if speech recognition fails, we can assume that a card that has been recently moved is the topic of the discussion. These tricks are all used to make this game work in a robust way.
To sum up, combining a touch screen with the robot can really enhance the interaction, if used properly. Furhat SDK also makes it very convenient to integrate a GUI with the interaction flow. If you consider using a touch screen in your Furhat application, we recommend you to think about the following design guidelines:
- Start thinking like this: If you were standing at a counter to serve the same role as Furhat, how would you make use of a screen in front of you?
- Think about the placement of the screen. It is important that the robot can continue to track the faces of the users while they are looking at the screen (otherwise the robot might think that the users have left). If the screen is placed beside the robot, there is a bigger risk that the users’ faces will be lost by the camera (especially if users have to lean forward to read the screen). Placing it in front of the robot is often better, but make sure it is not too small, or placed too low, to avoid having users lean too much. Do experiments with different setups and watch the dashboard to verify that users continue to be tracked.
- It really helps if the robot can also look at the objects on the screen, and if the users easily can see which object the robot is looking at (to achieve joint attention). Consider integrating the screen into a table placed between the users and the robot, to allow for joint attention in a natural way.
- Some information (like maps) are indeed much better to present on a screen. If you want to do that, consider leaving the screen blank during most of the interaction, until the visual information is displayed, to not distract the mutual gaze between the user and the robot.
- If you want to use the screen as a fallback for speech recognition in noisy environments, do not present the options until after you have detected that there are speech recognition problems.
Finally, there is so much more that could be said about situated interaction, beyond touch screens, that we haven’t touched upon here. As in the initial example with the concierge at the airport, the robot can look at and talk about things in the wider environment. A robot museum guide can look at a piece of art while presenting it. A projector could be used to display things on the walls. A kitchen assistant robot could look at the food while instructing you how to prepare a meal. Let your imagination loose, but always put the social connection first!
Gabriel Skantze, Co-founder & Chief Scientist
Gabriel Skantze is Chief scientist and co-founder of Furhat Robotics. Gabriel is also a Professor in Speech Technology with a specialization in Conversational Systems at KTH. He is leading several research projects and has published 100+ papers on conversational systems and human-robot interaction.
Originally published at https://furhatrobotics.com on May 26, 2020.