Externally indexed torrent
If you are the original uploader, contact staff to have it moved to your account
Textbook in PDF format
Speech represents the most natural means of communication between humans. By using Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems, machines also become able to interact with humans using speech. This is of particular importance for building interactive robots or speech-enabled chatbots. This book starts by exploring state-of-the-art ASR and TTS approaches, making use of artificial neural networks, relevant also to low-resource scenarios. Then, it explores the application of speech technology to specific domains, such as the medical domain, human-robot interaction, and even interlinking of speech and text resources using linguistic linked open data (LLOD) principles. The book also provides punctuation restoration techniques, enabling the production of high-quality text transcripts. Included algorithms have low latency and can be parallelized, thus enabling their use in interactive systems. Chapter authors are professors and scientific researchers with experience in building and using Natural Language Processing (NLP) algorithms and speech applications.
Many spoken human-computer interactions start with an automatic speech recognition (ASR) system meant to transcribe the user voice and pass it to a natural language processor or to a command module. There are several known solutions built on various technologies, ranging from Hidden Markov Models to complex Deep Neural Networks (DNN) or hybrid architectures that mix two or more known methods. A common element across all models consists of the large number of transcribed and aligned text fragments required for training. We consider the best-known open-source ASR projects, namely, CMUSphinx, Deep-Speech, and Kaldi, each being representative of its underlying techniques, as well as audio augmentation before and after feature extraction.
Supervised learning is a bottleneck for developing more powerful Machine Learning (ML) systems due to the massive amounts of labeled data required to train high-performance models. Self-supervised learning is one of the most common approaches used to mitigate this problem by first training models on large amounts of unlabeled data with artificially created objectives and then transferring the acquired knowledge on a downstream task. This methodology has obtained exceptional results in natural language processing with architectures such as BERT, but it has been struggling to achieve the same performance in domains like computer vision or speech processing because, in comparison with the former, the two operate in a much higher dimensionality. This issue has been recently mitigated by using self-supervised learning on a contrastive objective, allowing such models to be pre-trained on highly dimensional data