3 Different ways of system control
3.2 System control via voice commands

Voice recognition represents an uprising trend in interaction with consumer devices [2]. Voice is the most natural form of human-to-human communication and contains most of communicated information.

Voice commands are a valuable tool to control devices and systems when gestures or touch interfaces are not suitable. Their usage ranges from home entertainment systems to car infotainment control to control for the physically impaired.

Voice recognition covers several sub-fields, namely speaker identification and voice command recognition. The latter is in focus of today’s researchers thanks to significant advances in neural network technology.

Generally, a voice recognition system works in these two modes:

During learning, the system learns about all the possible inputs and their meaning. This usually happens in a parametric domain; whether are they parameters for individual voice commands or speaker-specific information. During recognition, an unknown input pattern is matched to a closest match from the learned parametric patterns. Both of these steps perform better with higher quality and quantity of input data.

Speech recognition is prone to incorrect recognition due presence of noise or other speakers talking simultaneously.

However, the more data a system has to process, the more time it takes. And time is crucial when we want to achieve pleasant, seamless speech recognition.

If we look back a few years, most speech recognition systems allowed recognizing only a limited set of isolated commands, or a speaker from a limited database. This would lead to highly specialized command set.

With cloud-based services becoming available widely and affordably, speech recognition systems could make use of fast server solutions. This, combined with widely available high-speed Internet connection, allows current user interfaces to process more complex voice inputs (generally, this applies to any input signal pattern). The combination allows utilizing complex decisions performed by neural networks server-side, which eliminates need for powerful user hardware and software preparation. Moreover, neural networks make recognition of isolated commands so efficient that they can now be used to recognize complex commands comprising multiple commands or command types.

The progress in utilizing neural networks over more and more powerful hardware allows improvements in several areas. Firstly, the system grows more environment-independent. The deep speech parameters are distinguishable in changing audio conditions [15]. Next, the system is able to recognize not only words or specific phrases, but to recognize whole sentence utterances, with nuances and variations of the used words. Moreover, by incorporating the previously recognized speech, systems can deduce the meaning of the current sentence or command, even if they are vague, unspecific. Systems currently start to understand actual context in which the speech was recognize and allow reacting more appropriately. This means systems start to comprehend not the actual speech, but the idea hidden behind the words.