We release the BERSt Dataset for various speech recognition tasks including Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER).
The dataset contains almost 4 hours of English speech from 98 actors with varying regional and non-native accents. The data was collected on smartphones in the actors homes and therefore includes at least 98 different acoustic environments. The data also includes 7 different emotion prompts and both shouted and spoken utterances. The smartphones were places in 19 different positions, including obstructions and being in a different room than the actor. This data is publicly available for use and can be used to evaluate a variety of speech recognition tasks, including: ASR, shout detection, and speech emotion recognition (SER).
Overview
4526 single phrase recordings (~3.75h)
98 professional actors
19 phone positions
7 emotion classes
3 vocal intensity levels
varied regional and non-native English accents
nonsense phrases covering all English Phonemes
Further information and dataset access can be found on Hugging Face
The corresponding paper can be found here