{"id":43,"date":"2023-01-24T06:19:35","date_gmt":"2023-01-24T06:19:35","guid":{"rendered":"https:\/\/todaysainews.com\/index.php\/2023\/01\/24\/introducing-whisper\/"},"modified":"2025-04-27T07:36:22","modified_gmt":"2025-04-27T07:36:22","slug":"introducing-whisper","status":"publish","type":"post","link":"https:\/\/todaysainews.com\/index.php\/2023\/01\/24\/introducing-whisper\/","title":{"rendered":"Introducing Whisper"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div>\n        <!--kg-card-begin: markdown--><\/p>\n<div class=\"js-excerpt\">\n<p>We\u2019ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy on English speech recognition.<\/p>\n<\/div>\n<section class=\"btns\"><a href=\"https:\/\/arxiv.org\/abs\/2212.04356\" class=\"btn btn-ypadded pl-0.125 d-block icon-paper\">Read Paper<\/a><\/p>\n<hr class=\"my-0\"\/><a href=\"https:\/\/github.com\/openai\/whisper\" class=\"btn btn-ypadded pl-0.125 d-block icon-code\">View Code<\/a><\/p>\n<hr class=\"my-0\"\/><a href=\"https:\/\/github.com\/openai\/whisper\/blob\/main\/model-card.md\" class=\"btn btn-ypadded pl-0.125 d-block icon-slides\">View Model Card<\/a><br \/>\n<\/section>\n<p>Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web.   We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.<\/p>\n<div class=\"d-md-none\">\n<img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/whisper\/asr-summary-of-model-architecture-mobile.svg\" class=\"w-100\"\/><\/div>\n<div class=\"d-none d-md-block wide my-1.5\">\n<div class=\"mx-xl-auto\" style=\"max-width:780px\">\n<img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/whisper\/asr-summary-of-model-architecture-desktop.svg\" class=\"w-100\"\/><\/div>\n<\/div>\n<p>The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.<\/p>\n<div class=\"d-lg-none\">\n<div style=\"max-width:500px\">\n<img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/whisper\/draft-20220919a\/asr-details-mobile.svg\" class=\"w-100\"\/><\/div>\n<\/div>\n<div class=\"d-none d-lg-block wide my-2\">\n<div class=\"mx-xl-auto\">\n<img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/whisper\/draft-20220919a\/asr-details-desktop.svg\" class=\"w-100\"\/><\/div>\n<\/div>\n<p>Other existing approaches frequently use smaller, more closely paired audio-text training datasets,<span class=\"js-rfref\" data-id=\"simply-mix\"\/><span class=\"js-rfref\" data-id=\"the-peoples-speech\"\/><span class=\"js-rfref\" data-id=\"gigaspeech\"\/> or use broad but unsupervised audio pretraining.<span class=\"js-rfref\" data-id=\"self-supervised-learning\"\/><span class=\"js-rfref\" data-id=\"unsupervised-speech-recognition\"\/><span class=\"js-rfref\" data-id=\"the-frontier\"\/> Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper\u2019s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.<\/p>\n<p>About a third of Whisper\u2019s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.<\/p>\n<div class=\"d-lg-none\">\n<img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/whisper\/draft-20220920a\/asr-training-data-mobile.svg\" class=\"w-100\"\/><\/div>\n<div class=\"d-none d-lg-block wide my-2\">\n<div class=\"mx-xl-auto\">\n<img decoding=\"async\" src=\"https:\/\/cdn.openai.com\/whisper\/draft-20220920a\/asr-training-data-desktop.svg\" class=\"w-100\"\/><\/div>\n<\/div>\n<p>We hope Whisper\u2019s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications. Check out the <a href=\"https:\/\/arxiv.org\/abs\/2212.04356\">paper<\/a>, <a href=\"https:\/\/github.com\/openai\/whisper\/blob\/main\/model-card.md\">model card<\/a>, and <a href=\"https:\/\/github.com\/openai\/whisper\">code<\/a> to learn more details and to try out Whisper.<\/p>\n<p><!--kg-card-end: markdown--><\/div>\n<p>[ad_2]<br \/>\n<br \/><a href=\"https:\/\/openai.com\/blog\/whisper\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] We\u2019ve trained and are open-sourcing a neural net called Whisper that approaches human level robustness and accuracy<\/p>\n","protected":false},"author":2,"featured_media":44,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-43","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/43","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/comments?post=43"}],"version-history":[{"count":1,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/43\/revisions"}],"predecessor-version":[{"id":62,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/posts\/43\/revisions\/62"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media\/44"}],"wp:attachment":[{"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/media?parent=43"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/categories?post=43"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/todaysainews.com\/index.php\/wp-json\/wp\/v2\/tags?post=43"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}