Featured Article

Find a Translator or Interpreter
Search for:

Featured Article from The ATA Chronicle (May 2012)

Automated Speech Recognition: Translator Friend or Foe?

By Hassan Sawaf and Jonathan Litchman

While automated speech recognition (ASR) is a technology originally developed decades ago, the commercial success of Apple’s virtual personal assistant Siri and the trivia success of IBM’s Watson on Jeopardy! in 2010 have demonstrated significant breakthroughs in the field.1

Integrated with machine translation (MT), ASR helps produce speech-to-speech and speech-to-text translation in near real time. The most advanced translation systems already integrate the technologies on one platform.
It does not take much imagination to see that this technology will affect the translation industry. However, just how the industry will be affected has yet to be fully defined. By taking a closer look at the technology, the challenges, and the likely applications of ASR, a clearer picture of its impact begins to emerge. Ultimately, this new form of translation automation will provide more opportunity for translators by generating additional content requiring human translation.

Evolution of ASR
ASR’s roots go back to the 1970s and the Defense Advanced Research Project Agency project “SUR.”2 The project produced a system that could recognize about 1,000 words. Many of us are familiar with the modern versions of this technology. For example, we have all been frustrated by the simple systems used by airlines or banks: “If you would like to speak to a customer representative, say ‘representative.’” These systems understand simple commands and have a limited vocabulary.

Today, a new wave of ASR and natural language understanding technology are able to gather more context around data to determine the most likely output. This breakthrough came about in large part from the increase in electronically translated data and a vast amplification in computational power.

ASR systems can now “think” and interpret the user’s meaning, rather than simply recognize and transcribe what is said. These systems can also be programmed to act on this meaning—whether it is to provide map directions to the nearest grocery store, score a win on Jeopardy!, or translate from one language to another.

ASR Systems: How Do They Stack Up?

At first glance, it would appear that judging ASR systems is fairly straightforward. ASR technology is typically measured along two metrics: speed and accuracy. Speed is measured by its “real-time factor,” or the time it takes for the computer to process what was spoken. Accuracy is usually measured by the word error rate (WER), defined as the number of insertions, deletions, and substitutions divided by the total number of words.
Unfortunately, getting a true measurement on ASR systems is not that simple. The main reason for this is because there are many variables that influence the quality of ASR. These include vocabulary size, the speaker, fluency/spontaneity of speech, background noise, and whether or not multiple languages are being spoken. Systems will be faster and more accurate depending on how they are employed. The more tailored a system is for a specific task, the better.

An ASR system that performs well with a large vocabulary and continuous, spontaneous speech is ideal for translation and interpreting. Language services providers may want systems that work independently of a speaker so a wide variety of people could employ it effectively, whereas freelance translators may want something tailored to their clients’ specific speech patterns or their own. Keep in mind that accuracy statistics and other forms of ASR measurement can be incredibly misleading. Be sure to adopt technology specifically tailored to the type of task you are doing.

Where ASR and MT Integration Fall Short
As recently as the 1990s, ASR and MT were two very divergent fields of study. Integrating the two technologies was the logical next step as scientists and academics realized speech was central to language and human communication.

However, when translation enters the picture, the obstacles facing ASR multiply exponentially. The largest challenge by far is a lack of context. For example, if the user would like to translate an English sentence into Arabic, the system does not know if it is addressing a single male or a group of females. This knowledge is critical for the machine to translate the sentence correctly, as some languages use gender and number to generate the correct form of a term. An unwitting mistake can offend the addressee, as the linguistic differentiators have significant cultural meaning.

A lack of context is also to blame for the classic automatic translation mistake, when a word has multiple meanings and the machine does not know which to select. For example, in translating “Go to the bank,” the user could be referring to the financial institution or the river, depending on his location or whether or not he has a fishing pole or a checkbook in hand. If the system is customized specifically for financial institutions, then the potential for error is significantly reduced.

Hybrid MT systems, which combine a rule-based and statistical approach into a single engine, are frequently the best systems for integrating with ASR. This is because these systems have the highest rates of accuracy, especially on noisy data (speech and inaccurate text input), and also because they are able to do more with less. Hybrid systems can translate phrases or sentence fragments, as well as develop new languages faster with less training data. Systems that fully integrate ASR and MT also prevent the error that occurs when one technology uses a word that is not contained in the other’s lexicon.

The Bottom Line: Big Data Means Big Opportunity
This overview of ASR technology has attempted to show that automatic translation systems with speech-to-speech and speech-to-text capability must be tailored for high accuracy levels and are sensitive to incorrect use.
What this means for future applications of the technology is that, while it will be an incredibly helpful tool for multilingual communication on an informal level, an “automated interpreter” will not likely be used in situations where human interpreters are currently employed. Also, since ASR is measured in its proximity to “real time” and finds continuous, spontaneous speech more difficult to process, multilingual document processing is not likely to replace translators in the near future. However, while ASR will not replace human translators or interpreters, it will have a significant effect on the industry.

The Internet’s large amount of video and audio data drove the integration of ASR and MT technologies, and is also largely responsible for the growth of data and multilingual communications requiring translation. According to IBM, 90% of data in the world today was created in the past two years.3 Social media use has grown as well. It took three years, two months, and one day to reach one billion Tweets.4 Now, there are a billion Tweets sent every week. In other words, the rapid growth in data will create a larger demand for human translators to post-edit and analyze the extra data output by ASR translation systems that would not have existed previously.
This is one of the reasons the language services industry is projected to grow so robustly even in the current global economic environment. In fact, the global market for outsourced language services and technology was predicted to reach $31.4 billion in 2011, and the demand for language services continues to grow at a significant rate—7.4% annually.5

Not every translator wants to do post-editing work, but it does provide tremendous opportunity and freedom for translators to further define and specialize their role in the industry and within their career. Big data means big opportunity for translators
and the language services industry.

1. “Watson, One Year Later,”

2. “Speech Recognition: The Evolution,”

3. “Bringing Big Data to the Enterprise,”   

4. “Twitter Blog,”

5. “Facts and Figures: Language Services Market,” Common Sense Advisory,


Jonathan Litchman is a senior vice-president at SAIC and leads the company’s Linguistics and Cultural Intelligence Operation. He has degrees from Emory and Johns Hopkins Universities. Contact:

Hassan Sawaf is the chief scientist for SAIC’s Linguistics and Cultural Intelligence Operation. He completed his doctorate studies in computer science, with a specialization in translation and information extraction from speech and text, at the University of Aachen, Germany. Contact: