I can see clearly now — the visual perception of a social robot
For a robot to interact socially with human beings, it indeed must be able to not only detect, but also understand human faces and what they represent. Here is a glimpse into how we make this possible.
Humans have evolved over millions of years as social beings. This evolution has equipped us with, among other things, an astounding capability of processing and understanding the faces of other humans. In fact, our brains have dedicated neural pathways for detecting faces. We are so programmed to do this that we quite often see “faces” also when they are not there — in everyday inanimate objects, in the clouds or in nature.
Humans are also very good at inferring where another person is looking — in particular if we are the target of their attention. In fact, there is a finely calibrated gaze detection system, with dedicated neurons that fire when we make eye contact with another person. We are also highly skilled at identifying people based on their facial appearance. It’s easy to tell from a face that someone is familiar to us (a person in our “herd”), even if we don’t remember their name or the exact context where we have seen them before.
Finally, we can often gather a person’s attitude by looking at their face. Also here, the evolutionary benefit is clear as illustrated by the fact that angry faces — which indicate potential threats — are detected faster than other expressions, whereas a friendly smile often triggers a reflex to smile back. Human beings are simply programmed to be social.
So, what does all of this have to do with robots? Quite a bit as it turns out.
At Furhat Robotics, we have set out to build the world’s most advanced social robot. And for a robot to interact socially with human beings, it indeed must be able to not only detect, but also understand human faces and what they represent.
The first and most important condition that the vision system of a social robot must fulfil is robustness in real-world conditions. In contrast to, say, a voice activated speaker that uses a hot-word such as “Hey Google” to initiate the dialogue, we want the robot to be able to engage in interaction with humans based on visual presence, as well as disengaging when the person leaves.
For this approach to work well, it is absolutely essential that the face detection has a very low risk of false positives (that could inadvertently initiate a conversation with an imaginary person) and an even lower risk of false negatives, that would risk sudden termination of a conversation in the middle, because the robot believed the person left. And this should be true regardless of the visual conditions — in an office or public space, if the user is in a dimly lit environment, or a back-lit setting in front of a window. Another crucial constraint is time. Face detection in the robot must be more or less instantaneous for the social interaction to not be impeded, so any solution that requires more than fractions of a second to process a frame is unusable. This also means that any cloud-based solution is disqualified, because network round-trip-times alone would incur a delay that exceeds the allowed limitations.
This has been quite a tough challenge for us. In 2011, when we showcased the first version of the Furhat Robot in a public exhibition, we had to rely on crude solutions such as ultrasonic proximity sensors to know if there was anyone in front of the robot to talk to. Later we also turned to depth sensors such as the Microsoft Kinect, but neither reliability nor usability — with bulky external sensors — was satisfactory.
The real challenge has been to build sufficiently advanced and robust face real-time processing capabilities directly into a device with comparatively modest computational resources. And finally, we believe we have succeeded.
The Furhat vision pipeline
The first step is to get hold of the images. CamCore receives a stream of color images from the robot’s built-in camera at a steady pace of 10 frames per second. Each camera frame is scanned for faces using an SSD (single shot detector) deep neural network. This is the most critical and computationally heavy step of the process, since it always needs to take the full image into account, and it requires about 80 ms. Once the regions containing faces have been identified in the camera frame, they can be extracted and passed on to subsequent processing steps in order to retrieve additional information about each of the faces.
The next step is to estimate position and orientation for each of the detected heads. The location and size of the bounding is used in a simple calculation based on the known camera parameters to estimate the location of the user in three-dimensional space. Then a neural network is used to extract the orientation of each head. This network takes a cropped face image as input, and outputs three rotational angles (yaw, pitch, roll). The estimated position makes it possible for the robot to follow users and make eye contact, and the head orientation provides an estimate of where the user is directing their attention.
After this, CamCore extracts a face descriptor for re-identification, or a faceprint. The faceprint is a numerical vector, output by a neural network, that makes it possible to compare facial images from subsequent frames to tell whether they represent the same or different persons, and is used in the tracking step (see below). The faceprint can also be used for face recognition, if it is coupled with a database of known faces.
The last analysis step that is applied is to try to estimate the facial expression of the user, which is done by yet another neural network. While the network actually analyzes the face in terms of five separate expressions (happy, sad, angry, surprised and neutral), the one we find most reliable, and also most useful from an interaction perspective is Happy, which will be available as an event to skill developers in an upcoming version of the Furhat SDK.
In an interaction where more than one person is present — either a multi-party interaction or a single-party interaction with visible on-lookers or bystanders — it is important to know who is who. To this end, CamCore assigns a user ID to each face it encounters, and ensures that the same person keeps their ID over time.
This user tracking is based on two principles. Firstly, CamCore continuously predicts the next position of each face. It does so using a Kalman filter, which continuously estimates the velocity as well as the measurement noise. For every detected face in a camera frame, CamCore compares the observed positions with the predicted ones and carries out an optimal assignment between the two.
This works well as long as every face is constantly visible. If a face disappears, however, because the user gets blocked, turns away or temporarily leaves the scene, another strategy has to be used. This is where the faceprints come in to play. Whenever a face appears that can’t be matched by frame-to-frame predictive tracking, it is compared by faceprint to the previously tracked faces, so that each user maintains their ID even if they are temporarily lost to the system.
To summarize; to function well in a social setting, a robot needs the same fundamental visual perception capabilities as a human. The visual perception of the Furhat robot focuses on human faces, because that’s where the most critical information lies. It can detect the presence of people, it can estimate their location and follow them with its gaze. It can tell where their visual focus of attention is based on their head orientation. It can recognize people it has seen before, so that it can remember what they have already talked about. It can tell if you are happy, and smile back at you.
As any well-behaved robot should.
Originally published at https://furhatrobotics.com on April 19, 2020.