Microsoft researchers from the Speech & Dialog research group include, from back left, Wayne Xiong, Geoffrey Zweig, Xuedong Huang, Dong Yu, Frank Seide, Mike Seltzer, Jasha Droppo and Andreas Stolcke. (Photo by Dan DeLong)Microsoft announced that its speech recognition technology has achieved a word error rate (WER) of only 5.9%, which the company said was similar to what human transcribers are able to achieve.
Historic Achievement In Word Error Rates
The company also said that many have sought this milestone for decades, since the beginning of 1970s, when speech recognition technology was being researched by DARPA in the interest of national security.
Although Microsoft has constantly improved its speech recognition technology--even last month it hit a WER of 6.3%, which isn’t that far away from the 5.9% it achieved this month. However, the 5.9% milestone has much more significance because it’s as low as it is for humans, and it’s the first time any company has reached it.
Human-Level WER, But Achieved Differently
In terms of achieving this low word error rate metric, Microsoft is indeed right that it’s a significant milestone. However, just as CPU benchmarks that return a total score don’t tell you the whole story about a chip’s performance, neither does the “Switchboard” (SWB) benchmark Microsoft used to compare its software against human transcribers.
As you can see in the table below, taken from Microsoft’s paper, the overall WER may be exactly the same for humans and the company’s automatic speech recognition (ASR) system, but it’s quite different when you look deeper. The deletion rate is significantly smaller for the ASR system compared to humans; for substitution, the situation reverses.
Overall substitution, deletion and insertion rates - Microsoft's "Achieving Human Parity In Conversational Speech Recognition""Substitution" in this case refers to words being replaced with other words when the recording is being transcribed. "Deletion" refers to words being added wrongfully, and then deleted.
In another conversational telephone speech benchmark, CallHome (CH), the ASR system does significantly more substitutions and insertions than humans, but fewer deletions. However, the overall word rate is also similar here (11.1% for the ASR and 11.3% for human transcribers), although it’s higher than in the Switchboard test for both the ASR system and the human transcribers.
WER Parity, Not True Human Parity
Even assuming the word error rates are identical in every way, it still wouldn’t mean that machine speech recognition is just as good as human. Even if the number of word errors that machines make are on par with humans, machines can still make significantly different ones. Therefore, sentences transcribed by a machine could be much more confusing to humans than they would be if other humans transcribed them, even if the error rate is the same.
For instance, Microsoft’s paper also noted that the ASR system confused “backchannel” words such as “uh-huh,” which describe an acknowledgement to what the other speaker is saying, with hesitations such as “uh,” which describe a pause before continuing to speak. Humans don’t make these mistakes because they know intuitively what these spoken words represent.
Speech Recognition Keeps Getting Better
Human speech recognition isn’t perfect either, which is shown by the Switchboard and CallHome benchmarks. Machine learning-based speech recognition may not yet be quite as good as humans in real world usage, but just the fact that word error rates are now similar means that speech recognition software is getting close to achieving true human parity, or even surpassing humans in speech recognition.
No comments:
Post a Comment