Home Opinions Captioning online videos without the headache

Captioning online videos without the headache

Ian Firth, VP of Products, Speechmatics
image1 (9K)
Share on

Video consumption has risen significantly in recent years and has only been exacerbated during the Covid-19 pandemic. With the rising popularity of video content, brands are investing more in their video content strategies, with forecasts of online video ad spend to reach $61 billion by 2021.

In 2019, video content accounted for 80% of all consumer Internet traffic. Video has, understandably, become a powerful medium for brands to engage with audiences online. This has been reinforced during the pandemic, where records have been broken in terms of the number of people consuming video content online, according to the Internet Advertising Bureau (IAB UK) UK’s Video Steering Group.

For content creators to make the most of this investment, they need to be able to leverage their video data, not only so they can optimise audience engagement but to comply with Federal Communications Commission (FCC) regulation. Online video clips from live and near-live TV programming must be captioned according to the FCC, but this can be challenging to do quickly and accurately.

However, innovation with voice technology has opened opportunities for content creators to put video data into a text format to optimise their videos and provide accurate captioning.


The problem with search 

Transcribing video data into text format is not only more accessible, but it also helps the content rank higher on search engines. The reason behind this is because search engines cannot process visual or auditory information from videos. They still rely on text to rank content. It’s the same for social media platforms, and so captions and metadata tags which describe the content of videos offer rich datasets that can be recognised by algorithms to maximise the reach of videos on social media.

Not only this, but another challenge stemming from video content is also neglecting those who are disadvantaged, such as people with a hearing disability. According to the World Health Organisation, there are over 466 million people across the world that are deaf or hard of hearing, and of course, online video content isn’t accessible to them without closed captions. Therefore, for video content to reach an even wider audience, this information needs to be converted into text. The challenge is in actually doing that quickly and accurately.


The need for better captioning tech

Despite speech data from a video being valuable, it’s often neglected because of the time and resources it takes to convert data into text. Manual transcription can take hours and is very costly, especially given that more than 500 million hours of videos are watched on YouTube each day. This isn’t scalable and manual tasks like this are prone to human error which can result in customer complaints from inaccurate captions.

This doesn’t need to be a challenge. Rather than wasting time having to replay videos and manually tag captions and subtitles, any-context speech recognition technology can automatically caption videos. Content creators can then leverage captioned soundbites from the videos when posting on social media to capture the attention of key audiences. This can potentially drive more traffic to websites and promote brands to targeted stakeholders.

It’s important to recognise that the quality of audio data captured in transcriptions will underpin the benefit of using this text for tasks such as video captioning. Automatic speech recognition (ASR) technology has become more sophisticated to recognise meaning, accents and dialects without the need for human correction.

Content creators are competing on a world stage through social channels and need to be able to stand out from the crowd. They also have to understand the limitations of search engine algorithms when it comes to video and the fact that a large proportion of their potential audience members are hard of hearing. Adding to that, the Office of Communications (Ofcom) data found that 7.5 million people in the UK (18% of the population) used closed captions. Of that 7.5 million, only 1.5 million were deaf or hard of hearing. This suggests that 80% of television viewers used closed captions and not just those who are hard of hearing. Therefore, as video content is on the rise, content creators need to look for scalable and accurate solutions to capture the value of data from video content and offer audiences accurate captioning solutions.


Share on