Multimedia processing
Multimodal Interface

Multimodal interfaces are the current very popular technology. Everyone is talking about multimodal interfaces—how natural they are, and how users like them. Multimodal interfaces offer solutions to many of our user interface problems, as well as enabling new classes of applications.

Multimodal interface represent combination of multiple modalities, it means several ways how to communicate or interact with computer systems. Multimodal interface has been addressed by speaker identification and face recognition. The multimodal interface is responsible for seamless user recognition and authentication using modalities (voice, face detection, etc.). Beside this the multimodal interface serves for commands using gestures or voice to control the set top box (STB).

Real project where multimodal interface is integrated is HBB-Next. The project seeks to facilitate the convergence of the broadcast and Internet world by researching user-centric technologies for enriching the TV-viewing experience with social networking, multiple device access, group-tailored content recommendations, as well as the seamless mixing of broadcast content, of complementary Internet content and of user-generated content.

HBB-Next is based on modular architecture. The modules in HBB (Hybrid Broadcast Broadband) project are designed to cooperate together. Here is an example: a user enters the room, the system will recognize the user and system is automatically set according user’s requirements. The user opens the AppStore application and the system allows him to choose, open, buy and install a desired application. For each activity or operation of the user, the system may ask multilevel authentication based on secure identification with satisfactory validation and security.

Multi-speaker identification aims to identify possibly more speakers based on a recorded signal that may contain utterances of more individuals. This general task can be divided into several categories based on additional refinements. If the speakers that may appear in given conversation are known in prior, i.e. they were present in some sort of training phase. The task resembles a single speaker identification problem, even though additional algorithms must be applied, tuned and enhanced. However when the set of possible speakers is unknown then the techniques of speaker segmentation and clustering (diarization) must be used. The aim of the most of application is to continuously run and “listen” to an incoming stream of PCM samples (sound waves), detect voice activity (VAD), silence and background noises, possibly recognize speech overlap, and if substantially long voice period is caught then identify the speaker with certain confidence measure. The aim of a speaker identification system is to decide the identity of a speaker upon an utterance, regardless of what he or she said. A speaker identification system consists of two main parts. The first one is feature extraction from recorded signal and the second one is classification method, which determines the speaker based on extracted paramters. These systems are usually designed for specific task, so the designer has to select proper methods and their modifications for a given application which may differ depending on type and setting of particular task.

In case of voice commands recognition the systems belonging to the group of isolated word recognition would be an option. The most successful and used ones are those based on HMM statistical speech modelling, especially those using tight context dependent phonemes as a basic modelling unit. In case of a fixed set of commands and abundance of test samples whole word models can be used to achieving potentially higher accuracies (better capturing coarticulation effects).

Generally, there are two categories of approaches when tracking person’s gestures, appearance- and 3D model-based approach. The 3D model-based approach compares the input parameters of a limb with 2D projection of a 3D limb model. The appearance-based approach uses image features to model the visual appearance of a limb and compare it with extracted image features from the video input. When focusing on the latter approach, the results depend on the capabilities of the capturing device. If an RGB camera is used the methods focus on tracking the skin colour or shape of the gesturing body part. The approach, however, depends highly on the lighting conditions, as well as stability of foreground and background of the tracked subject. Also, no other skin-coloured or limb-shaped objects can appear in the examined area as they would trick the algorithm. An infra-red light depth camera uses its own IR light emitter and is thus much more resilient to lighting conditions of the scene. Moreover, the camera is capable of providing a depth map, a pseudo 3D image of the scene which can be very useful when tracking gesturing body parts, i.e. hands.

Currently in world, there are few methods how eye controlling can be implemented. The simplest and the most natural way was chosen. This system has only one part. It is the static RGB Kinect camera. The person sitting in the front of the Kinect (monitor) will be asked to look with its eyes to the highlighted points on the screen with the still head. Simultaneously the application by Kinect depth camera measure the head distance from the Kinect (monitor). For the purpose of the triangle calculation (Pythagorean Theorem) we will determine the dimensions of the screen. Application will calculate the variance of the pupil movement exactly the same like in the section calibration and also it will calculate the angles by which the pupil must be diverted from straight position to see the edge of the monitor. With knowing the head distance from monitor we can recalculate the variance and angles when the distance is changed to ensure the accuracy of the controlling.

Face recognition methods were mentioned in chapter Image recognition. Usually in real systems a list of requirements is defined for single local user identification based on a human face that the systems must/should/may implement: