Were on the Brink of a Revolution in Crazy-Smart Digital Assistants


Heres a quick story youve probably heard before, followed by one you probably havent. In 1979 a young Steve Jobs paid a visit to Xerox PARC, the legendary R&D lab in Palo Alto, California, and witnessed a demonstration of something now called the graphical user interface. An engineer from PARC used a prototype mouse to navigate a computer screen studded with icons, drop-down menus, and windows that overlapped each other like sheets of paper on a desktop. It was unlike anything Jobs had seen before, and he was beside himself. Within 10 minutes, he would later say, it was so obvious that every computer would work this way someday.

As legend has it, Jobs raced back to Apple and commanded a team to set about replicating and improving on what he had just seen at PARC. And with that, personal computing sprinted off in the direction it has been traveling for the past 40 years, from the first Macintosh all the way up to the iPhone. This visual mode of computing ended the tyranny of the command linethe demanding, text-heavy interface that was dominant at the timeand brought us into a world where vastly more people could use computers. They could just point, click, and drag.

In the not-so-distant future, though, we may look back at this as the wrong PARC-related creation myth to get excited about. At the time of Jobs visit, a separate team at PARC was working on a completely different model of human-computer interaction, today called the conversational user interface. These scientists envisioned a world, probably decades away, in which computers would be so powerful that requiring users to memorize a special set of commands or workflows for each action and device would be impractical. They imagined that we would instead work collaboratively with our computers, engaging in a running back-and-forth dialog to get things done. The interface would be ordinary human language.

Pipe Down, Jarvis

One of the scientists in that group was a guy named Ron Kaplan, who today is a stout, soft-spoken man with a gray goatee and thinning hair. Kaplan is equal parts linguist, psychologist, and computer scientista guy as likely to invoke Chomskys theories about the construction of language as he is Moores law. He says that his team got pretty far in sketching out one crucial component of a working conversational user interface back in the 70s; they rigged up a system that allowed you to book flights by exchanging typed messages with a computer in normal, unencumbered English. But the technology just wasnt there to make the system work on a large scale. It wouldve cost, I dont know, a million dollars a user, he says. They needed faster, more distributed processing and smarter, more efficient computers. Kaplan thought it would take about 15 years.

Forty years later, Kaplan says, were ready. And so is the rest of the world, it turns out.

Today, Kaplan is a vice president and distinguished scientist at Nuance Communications, which has become probably the biggest player in the voice interface business: It powers Fords in-car Sync system, was critical in Siris development, and has partnerships across nearly every industry. But Nuance finds itself in a crowded marketplace these days. Nearly every major tech companyfrom Amazon to Intel to Microsoft to Googleis chasing the sort of conversational user interface that Kaplan and his colleagues at PARC imagined decades ago. Dozens of startups are in the game too. All are scrambling to come out on top in the midst of a powerful shift under way in our relationship with technology. One day soon, these companies believe, you will talk to your gadgets the way you talk to your friends. And your gadgets will talk back. They will be able to hear what you say and figure out what you mean.

If youre already steeped in todays technology, these new tools will extend the reach of your digital life into places and situations where the graphical user interface cannot safely, pleasantly, or politely go. And the increasingly conversational nature of your back-and-forth with your devices will make your relationship to technology even more intimate, more loyal, more personal.

But the biggest effect of this shift will be felt well outside Silicon Valleys core audience. What Steve Jobs saw in the graphical user interface back in 1979 was a way to expand the popular market for computers. But even the GUI still left huge numbers of people outside the light of the electronic campfire. As elegant and efficient as it is, the GUI still requires humans to learn a computers language. Now computers are finally learning how to speak ours. In the bargain, hundreds of millions more people could gain newfound access to tech.

Voice interfaces have been around for years, but lets face it: Thus far, theyve been pretty dumb. We need not dwell on the indignities of automated phone trees (If youre calling to make a payment, say payment). Even our more sophisticated voice interfaces have relied on speech but somehow missed the power of language. Ask Google Now for the population of New York City and it obliges. Ask for the location of the Empire State Building: good to go. But go one logical step further and ask for the population of the city that contains the Empire State Building and it falters. Push Siri too hard and the assistant just refers you to a Google search. Anyone reared on scenes of Captain Kirk talking to the Enterprises computer or of Tony Stark bantering with Jarvis cant help but be perpetually disappointed.

Ask around Silicon Valley these days, though, and you hear the same refrain over and over: Its different now.

One hot day in early June, Keyvan Mohajer, CEO of SoundHound, shows me a prototype of a new app that his company has been working on in secret for almost 10 years. You may recognize SoundHound as the name of a popular music-recognition appthe one that can identify a tune for you if you hum it into your phone. It turns out that app was largely just a way of fueling Mohajers real dream: to create the best voice-based artificial-intelligence assistant in the world.

The prototype is called Hound, and its pretty incredible. Holding a black Nexus 5 smartphone, Mohajer taps a blue and white microphone icon and begins asking questions. He starts simply, asking for the time in Berlin and the population of Japan. Basic search-result stufffollowed by a twist: What is the distance between them? The app understands the context and fires back, About 5,536 miles.

Mohajer rattles off a barrage of questions, and the app answers every one. Correctly.

Then Mohajer gets rolling, smiling as he rattles off a barrage of questions that keep escalating in complexity. He asks Hound to calculate the monthly mortgage payments on a million-dollar home, and the app immediately asks him for the interest rate and the term of the loan before dishing out its answer: $4,270.84.

What is the population of the capital of the country in which the Space Needle is located? he asks. Hound figures out that Mohajer is fishing for the population of Washington, DC, faster than I do and spits out the correct answer in its rapid-fire robotic voice. What is the population and capital for Japan and China, and their areas in square miles and square kilometers? And also tell me how many people live in India, and what is the area code for Germany, France, and Italy? Mohajer would keep on adding questions, but he runs out of breath. Ill spare you the minute-long response, but Hound answers every question. Correctly.

Hound, which is now in beta, is probably the fastest and most versatile voice recognition system unveiled thus far. It has an edge for now because it can do speech recognition and natural language processing simultaneously. But really, its only a matter of time before other systems catch up.

After all, the underlying ingredientswhat Kaplan calls the gating technologies necessary for a strong conversational interfaceare all pretty much available now to whoevers buying. Its a classic story of technological convergence: Advances in processing power, speech recognition, mobile connectivity, cloud computing, and neural networks have all surged to a critical mass at roughly the same time. These tools are finally good enough, cheap enough, and accessible enough to make the conversational interface realand ubiquitous.

But its not just that conversational technology is finally possible to build. Theres also a growing need for it. As more devices come online, particularly those without screensyour light fixtures, your smoke alarmwe need a way to interact with them that doesnt require buttons, menus, and icons.

When I started using Alexa late last year, I discovered it could tell me the weather, answer basic factual questions, create shopping lists that later appear in text on my smartphone, play music on commandnothing too transcendent. But Alexa quickly grew smarter and better. It got familiar with my voice, learned funnier jokes, and started being able to run multiple timers simultaneously (which is pretty handy when your cooking gets a little ambitious). In just the seven months between its initial beta launch and its public release in 2015, Alexa went from cute but infuriating to genuinely, consistently useful. I got to know it, and it got to know me.

This gets at a deeper truth about conversational tech: You only discover its capabilities in the course of a personal relationship with it. The big players in the industry all realize this and are trying to give their assistants the right balance of personality, charm, and respectful distanceto make them, in short, likable. In developing Cortana, for instance, Microsoft brought in the videogame studio behind Halowhich inspired the name Cortana in the first placeto turn a disembodied voice into a kind of character. That wittiness and that toughness come through, says Mike Calcagno, director of Cortanas engineering team. And they seem to have had the desired effect: Even in its early days, when Cortana was unreliable, unhelpful, and dumb, people got attached to it.

Theres a strategic reason for this charm offensive. In their research, Microsoft, Nuance, and others have all come to the same conclusion: A great conversational agent is only fully useful when its everywhere, when it can get to know you in multiple contextslearning your habits, your likes and dislikes, your routine and schedule. The way to get there is to have your AI colonize as many apps and devices as possible.

To that end, Amazon, Google, Microsoft, Nuance, and SoundHound are all offering their conversational platform technology to developers everywhere. The companies know that you are liable to stick with the conversational agent that knows you best. So get ready to meet some new disembodied voices. Once you pick one, you might never break up.

David Pierce (@piercedavid) is a senior writer at WIRED.

Read more: https://www.wired.com/2015/09/voice-interface-ios/