1MMI architecture

Currently, the most widely used interfaces for human–computer communication are keyboard, mouse, or touch tablet. These devices represent human’s adaptation to computer limitations rather than the natural communication with computer. In the last few years a requirement began to pop up that humans need to communicate with machines in the same way as they do with each other: by speech, mimics or gestures, since they conceive much more information than peripheral devices approach. This leads us to the term multimodal interface (MMI).

Multimodal interface consists of several topics and modules which serve for natural and user-friendly communication with the system. Altogether, these modules represent the functionality of MMI. These modules can be part of the multi-modal interface:

  • Multi voice identification
  • Speech and voice command recognition
  • Multi face recognition
  • Gesture recognition and navigation
  • Eye navigation
  • Speech synthesis
  • Recommendation engine

The general architecture of multimodal interface consists of several layers. Physical layer represents hardware input and output devices which enables interaction with real-world. Multimodal data provided by input devices (camera, sensor, microphone, etc.) are processed in parallel by each module separately. The MMI controller collects output data from all modules, evaluates and combines it into one output data stream. The stream contains information about recognized users and their requested actions.