For now, the system can just show three classes of sounds: Applause, music and laughter. “These were among the most frequent manually captioned sounds, and they can add meaningful context for viewers who are deaf and hard of hearing,” the company wrote.
As with the automatic captions, Google uses machine learning to pick out sounds and display them as text. It developed a “deep neural network (DNN)” model for ambient sound, and trained it with “thousands of hours of videos” to get the best results. The toughest part, it wrote in a technical blog, was separating and displaying events that tend to occur at the same, like laughter and applause.
You can see what that looks like in the clip from America’s Got Talent below. The sound effects are merged with the automatic speech recognition and “shown as part of the standard automatic captions,” much as you’d see in a close-captioned TV show.
YouTube’s team said its aware that the captions are “simplistic,” but adding features will be easier as it has built a solid back end foundation. In the future, it’ll introduce common sounds like barking, knocking or ringing. That will pose new challenges, as the AI will need to figure out if a ringing sound is coming from an alarm, phone or doorbell, for example.
It’ll be worth the effort, though, as Google says that two-thirds of participants in a study found that sound effect captions enhance the video experience. And while it’s bound to make mistakes no matter how good it gets (even humans are only about 95 percent accurate), users think that the odd error won’t detract from the benefits.