Giving Technology A Voice (part 2) – Speech recognition
This week, I’ll be discussing how speech recognition can be used to control computer technology. This is nothing new, however, it is often misunderstood and as it’s being rolled out to different computer platforms with differing success and considerations, we thought we’d better take a fresh look.
A quick history lesson
Speech recognition software has been around for longer than many may realise. The idea of speaking to a computer and having the speech translated into text for letters or for use with controlling the computer is not a new one. Speech recognition programs have been around and commercially available since the mid 1990’s.
One of the biggest advances over the years has been the level of accuracy and the amount of training that a person needs to go through to get results.
I remember a friend of mine trying to train a speech recognition program on his Apple Mac in the mid 1990’s. The process was supposed to take around 4 hours and involved a long list of words being repeated, one at a time, parrot fashion, 3 times for each word.
The reason he went down this route is that he’d lost the use of one of his arms and therefore could only type with one hand. After 2 hours of ‘house, house, house’ click next, ‘car, car, car’ click next, ‘hello, hello, hello’ my friend said “*@!£/ &*%*, I’ll stick with one hand!”.
Fast forward 5 years or so and through my job, I’d be helping others to train computers with newer versions of speech recognition software. The 4 hours had turned into about 30 minutes, enabling most to get through to the point of actually using it to type and control their computers. At this point, they would say “*@#*/ &*%*!” as the software started to type text that had no relation to what was said.
It took a few years before speech recognition started to work with any decent degree of accuracy. This tends to be measured by time being saved. The software needs to have a relatively high accuracy level before it becomes a worth-while venture. People who can’t type by hand due to a disability would tend to accept a much lower level as the alternative of typing by hand is not an option.
Is it as good as it gets?
There has been various constraints on the ability of a computer to have a high level of accuracy.
The first is the prowess of the programmer, the logic behind piecing together the clues on what is being said, against what the computer has learnt.
The next constraint has been the computing power and quality of the systems available. For speech to be recognised, the sound needs to be of a high quality. For the speech to be processed in ‘real-time’, the computer’s CPU (brain, if you like), needs to be fast and the computer needs lots of memory. Early systems would spend so much time working out a single sentence, that the time-lag meant that you were on page 2 of dictating before you saw the results of that sentence. It just wasn’t a practical experience at all.
Now these problems have been ironed out. Even your entry-level computers will happily process speech. Not only slapping text onto the screen with relatively good results, but you can control your computer as well. Open programs, copy and paste text, switch between programs, print documents and much more, just by using your voice.
The above statement has been written with a computer running Windows in mind, as that has been my experience (and Apple Mac OS) for the past years. However, things aren’t that simple any more are they.
Since the iPad came out 4 years ago in 2010, we have had an influx of tablets and smart phones swamp the market and all of them now boast speech recognition capabilities. Let’s forget about the makes and models. The main thing to look at is the operating system they run on. Here’s a list (but by no means an exhaustive one):
- Windows tablets – (Microsoft Slate among others)
- iOS – (iPad, iPod Touch, iPhone)
- Android – Samsung Galaxy, Google Nexus
- Windows Phone – Nokia phones
The term ‘ON-BOARD’ is a vital term here, especially when comparing speech recognition with the other platforms (operating systems). Windows tablets, laptops and desktops will process speech using its own processor on its own circuit board. All the other platforms (iOS, Android and Windows Phone (not to be confused with Windows)) will send off the speech it has recorded, to be processed by another, big and powerful machine over the Internet.
So, if we had the 4 different devices using the 4 different platforms side-by-side, without a WIFI or mobile data signal, then the Windows tablet would be happy, but the others would throw a tantrum, as they can’t process the speech themselves.
Windows tablets are the easiest to talk about. Over the past couple of years, tablet computers have been released with full versions or Windows on them. They are, in essence, fully fledged Windows computers. They have USB ports in them, so you can plug keyboards, mice and so on into them and most importantly, they run the same software as a desktop or laptop PC, they are just lighter, thinner and have a touchscreen. Therefore, the above points made in the previous section hold true.
Windows tablets are work-horses with a racehorse look and feel to them. They have to be powerful to run Windows, which is a powerful and greedy operating system. But, that same power, will process speech ‘on-board’.
iOS – Apple decided they needed to give their speech recognition engine a name. Say hello to Siri. Siri, like the non-Windows alternatives, needs a data connection to work. It’s pretty good for some and lame for others. Unlike Windows speech recognition, you aren’t asked to (or given the option to) train Siri to your voice. This has the positive point of being able to use it straight away and by anyone, not just the person who trained it. However, it does make it less accurate. In the United Kingdom, regional accents effect how some words are pronounced hugely.
Android – Android doesn’t seem to have called their speech recognition by a name. Depending on the device, may depend on how well speech is supported. Our Galaxy Tab is stuck on Android 4.1 and can’t be upgraded due to the amount of Samsung add-ons that come with it. Our Nexus, on the other hand works much better as it can be upgraded and we are just waiting to lick the Lollipop that is Android 5.
As long as your contacts are integrated with your Google account, then you can email away. However, the star in the pack is the google search integration. Many search results are read back to you as long as you don’t wish to go into too much detail, otherwise you will just be provided with a page of search results.
Here’s an example: saying “what’s the population of Sweden?”, I will be told verbally that “the population in Sweden in 2013 was 9.593 million”. However, if I ask “what was the population in Sweden in 2000?” I will be shown a page of results “here is information from Wikipedia” and it displays on the screen that the population was 8.872 million. Not bad, just missing the verbal feedback.
Windows Phone – The Windows Phone operating system must not be confused with Windows that is installed on tablets, desktops and laptops. Even though Windows Phone is now on version 8.1 and Windows Pro and RT version is on 8.1, they are not the same. Putting the word ‘Phone’ after the word Windows means a lot – they may look a bit similar, but they’re not compatible.
With that cleared up, Windows Phone has speech recognition called Cortana. Apparently named after the computer in the game Halo (as my colleague informs me). Cortana can set reminders, take notes, email and text people as well as carrying out web searches.
A twist in the tail is that Cortana will only work in the UK, if you change the language and location settings to English (US). This doesn’t really have any major drawbacks apart from the keyboard leans towards the American .com and .co.uk isn’t available. When I ask what the weather will be like, I’m told in Fahrenheit and not Celsius (unless I ask specifically).
A weird twist in the twist in the tail is that Cortana seems to struggle with my English (Cheshire) accent. I became really frustrated with her and put on a really fake American accent – worked perfectly. When I crucify the American accent, me and Cortana get on like a house on fire.
I really can’t get over the accuracy of the Google speech interface. Taking into account that this is without the need for training, the results are truly awesome!
If you need a speech solution for controlling your device from start to finish and one that won’t go on strike if the Internet’s down, then a Windows Pro tablet is really the way to go. Yes, you have to train it, though it’ll do almost anything you wish to achieve with a computer. Plus, it’s a computer. Not a mobile surfing, mailing, tweeting machine, it’s everything in one. It’ll do your work for you as well (you need to do the thinking though, technologies not that good yet).
Who knows? I reckon that if you could mix Google’s recognition with Microsoft Windows’ global use, then you’ve got a winner. Let’s face it, at the moment a Star Trek episode would not have the same impact if Commander Riker had to change user accounts before the ships computer would recognise him or the ships computer would look a bit weak if it refused to put the shields up because it couldn’t process the command due to them being too far away from Star base 9.
If a tablet or phone had the guts to process speech on its own and could effectively recognise anyone’s speech without each person having to go through a training session, then speech recognition would’ve reached warp 9 with phasers on STUNNING!
Still to come…
Next week I’ll be detailing the considerations that the various speech platforms pose against various disability types.
Trackback from your site.