Speech recognition is like the fusion reactor of computer science. Fusion research is always “just 20 years away from practical applications”. This was the case in 1960, 1980, 1990 and 2000. All along we learned a lot of useful and interesting stuff, but we still don’thave access to clean, virtually limitless electricity. At one point in the early days of nuclear energy people actually talked about electricity being so plentiful that it would be “too cheap to meter”. And they were serious, too.
Speech recognition has a shorter history, but in many respects just as discouraging. They were already talking to computers on Star Trek in the late 60’s. Progress was steady through the 1970’s and 80’s, and by the early 90’s they had speech-driven IVRs, that is, interactive voice-response programs. IVRs were found primarily in call centers, where you could “type 1 or say yes” for yes, and “type 2 or say no” for no. It wasn’t fancy, but people were pretty excited about it. When I got into the game they were thinking about using speech recognition modules for more than just call centers and personal phone assistants. The goal was to employ speech recognition in any and all circumstances where a person’s hands or eyes were busy. Even back then it was understood that it would take a long time for speech recognition to take over in typical applications like spreadsheets or business applications. But any application that involved a sequence of questions and answers, or where the number of possible utterances in a given context was sufficiently constrained, was a candidate for a speech recognition interface.
It didn’t turn out that way. Speech recognition is employed in many situations, but in situations where the number of possible utterances is more or less unconstrained you have to train the recognizer to recognize your voice. After you train the system it will work for you but not anyone else. In situations where the number of possible utterances is expected to be more manageable you can train a recognizer to understand most people fairly well. Unfortunately there are a few major problems with such constained systems.
- Users tend to stray from their scripts, and you realize that you must expend a lot of effort in training the users. This really flies in the face of modern user interface methodologies, where the user expects to start using an application quickly with a minimum amount of training.
- The system will simply never understand a certain fraction of the users who have thick accents, or who talk slowly, or who get confused by the whole thing. This is not necessarily a large fraction, but it’s large enough that you must deploy at least two mechanisms for user input – speech recognition, plus whatever normal methods you have anyway. It wouldn’t be so bad if 2-d mouse interaction was equivalent to speech recognition, but the reality is that speech is a linear medium, so all those interactions you take for granted in a 2-d medium must be funnelled into a single dimension, and that’s hard.
- The system performs poorly if you have a bad microphone, and most people have bad microphones.
Given these considerations, it is no surprise that we don’t use speech recognition for business applications. Ideally the vocabulary is unconstrained, so training would be required. Either the speech system would need to be trained, or the user would need to be trained, or both. Second, the application will simply not work well for some users. Third, and perhaps most fundamentally, when your application is rich in data, how do you communicate the context appropriately to the user? The system must understand the nuances of meaning in the data before it can decide which particular pieces are the ones that will most effectively help the end user make decisions and proceed. We are so far away from this level of sophistication in our understanding of rudimentary data systems that I can’t imagine how long it will take to get there. At least, what – 20 years, right?
In the absence of a system that truly understands the world and is able to convey that understanding to you – that is, until the semantic web is finished, easily 20 years from now – we must convey information to the user in a format that is compelling, as complete as possible yet not too cluttered, able to distinguish between small differences and yet with a huge range of possible values to present. User interface design as a science is still in its infancy. There are a lot of good ideas out there, but I am constantly amazed at the kinds of frustrations I encounter each day.
Take Windows 7, for example. Great system, clearly better than Vista, but then again so is Windows 3.1. On a Windows 7 system, if you hold down Alt Tab, you can move your mouse over the list of icons that appears in the center of your window. When you hover over an icon the application it represents is brought to the front and all the others are minimized. This seems like a good idea but I find it vaguely unsatisfying in ways I can’t easily articulate. I suppose the best way to describe it is that it is a jarring experience. I feel like it should be speeding me up, but I am uncomfortable enough that it feels like it is slowing me down. But without a doubt it is lightning fast compared to a speech recognition interface.
