Für diesen Artikel ist leider kein Bild verfügbar.

Techniques for Noise Robustness in Automatic Speech Recognition

Tuomas Virtanen, Rita Singh, Bhiksha Raj (Herausgeber)

Software / Digital Media

520 Seiten

2012
John Wiley & Sons Inc (Hersteller)
978-1-118-39268-3 (ISBN)

Keine Verlagsinformationen verfügbar

Artikel merken

With the growing use of automatic speech recognition (ASR) in everyday life, the ability to solve problems in recorded speech is critical for engineers and researchers developing ASR technologies. The only resource of its kind, this book presents a comprehensive survey of state-of-the-art techniques used to improve the robustness of ASR systems.

Automatic speech recognition (ASR) systems are finding increasing use in everyday life. Many of the commonplace environments where the systems are used are noisy, for example users calling up a voice search system from a busy cafeteria or a street. This can result in degraded speech recordings and adversely affect the performance of speech recognition systems. As the use of ASR systems increases, knowledge of the state-of-the-art in techniques to deal with such problems becomes critical to system and application engineers and researchers who work with or on ASR technologies. This book presents a comprehensive survey of the state-of-the-art in techniques used to improve the robustness of speech recognition systems to these degrading external influences. Key features: Reviews all the main noise robust ASR approaches, including signal separation, voice activity detection, robust feature extraction, model compensation and adaptation, missing data techniques and recognition of reverberant speech. Acts as a timely exposition of the topic in light of more widespread use in the future of ASR technology in challenging environments.
Addresses robustness issues and signal degradation which are both key requirements for practitioners of ASR. Includes contributions from top ASR researchers from leading research units in the field

Tuomas Virtanen, Tampere University of Technology, FinlandDr . Virtanen is a senior researcher at Tampere University of Technology. Previously, he has worked at Cambridge University, UK as a research associate. His main research contributions are in sound source separation and its application to robust speech recognition, audio content analysis, and music information retrieval. He is well-known for his work on non-negative matrix factorization based source separation, which is currently widely used in the field. He has published numerous journal and conference articles related to above topics. Rita Singh, Carnegie Mellon University, USA Dr. Singh is the CEO of a speech-technology startup but remains an adjunct faculty of the Language Technologies Institute at Carnegie Mellon University. She has been a major contributor to the open-source CMU sphinx and is one of the main architects of the popular Sphinx4 java-based open-source speech recognition system. In addition to her work on core speech recognition technology, she has also developed several algorithms for noise compensation, and was the prime architect of CMU's award-winning submission to the 2001 Naval Research Lab's challenge on automatic recognition of speech in noisy environments (SPINE). Bhiksha Raj, Carnegie Mellon University, USA Dr. Raj is an associate professor in the Language Technologies Institute and in Electrical and Computer Engineering at Carnegie Mellon University. He has worked extensively on robustness algorithms for speech recognition, and is very well-known for his contributions to the highly-popular VTS approach for noise compensation, as well as his contributions to missing-feature-based techniques for noise compensation. He has published extensively on and holds patents for algorithms for microphone array processing and signal separation.

List of Contributors xv Acknowledgments xvii 1 Introduction 1 Tuomas Virtanen, Rita Singh, Bhiksha Raj 1.1 Scope of the Book 1 1.2 Outline 2 1.3 Notation 4 Part One FOUNDATIONS 2 The Basics of Automatic Speech Recognition 9 Rita Singh, Bhiksha Raj, Tuomas Virtanen 2.1 Introduction 9 2.2 Speech Recognition Viewed as Bayes Classification 10 2.3 Hidden Markov Models 11 2.3.1 Computing Probabilities with HMMs 12 2.3.2 Determining the State Sequence 17 2.3.3 Learning HMM Parameters 19 2.3.4 Additional Issues Relating to Speech Recognition Systems 20 2.4 HMM-Based Speech Recognition 24 2.4.1 Representing the Signal 24 2.4.2 The HMM for a Word Sequence 25 2.4.3 Searching through all Word Sequences 26 References 29 3 The Problem of Robustness in Automatic Speech Recognition 31 Bhiksha Raj, Tuomas Virtanen, Rita Singh 3.1 Errors in Bayes Classification 31 3.1.1 Type 1 Condition: Mismatch Error 33 3.1.2 Type 2 Condition: Increased Bayes Error 34 3.2 Bayes Classification and ASR 35 3.2.1 All We Have is a Model: A Type 1 Condition 35 3.2.2 Intrinsic Interferences--Signal Components that are Unrelated to the Message: A Type 2 Condition 36 3.2.3 External Interferences--The Data are Noisy: Type 1 and Type 2 Conditions 36 3.3 External Influences on Speech Recordings 36 3.3.1 Signal Capture 37 3.3.2 Additive Corruptions 41 3.3.3 Reverberation 42 3.3.4 A Simplified Model of Signal Capture 43 3.4 The Effect of External Influences on Recognition 44 3.5 Improving Recognition under Adverse Conditions 46 3.5.1 Handling the Model Mismatch Error 46 3.5.2 Dealing with Intrinsic Variations in the Data 47 3.5.3 Dealing with Extrinsic Variations 47 References 50 Part Two SIGNAL ENHANCEMENT 4 Voice Activity Detection, Noise Estimation, and Adaptive Filters for Acoustic Signal Enhancement 53 Rainer Martin, Dorothea Kolossa 4.1 Introduction 53 4.2 Signal Analysis and Synthesis 55 4.2.1 DFT-Based Analysis Synthesis with Perfect Reconstruction 55 4.2.2 Probability Distributions for Speech and Noise DFT Coefficients 57 4.3 Voice Activity Detection 58 4.3.1 VAD Design Principles 58 4.3.2 Evaluation of VAD Performance 62 4.3.3 Evaluation in the Context of ASR 62 4.4 Noise Power Spectrum Estimation 65 4.4.1 Smoothing Techniques 65 4.4.2 Histogram and GMM Noise Estimation Methods 67 4.4.3 Minimum Statistics Noise Power Estimation 67 4.4.4 MMSE Noise Power Estimation 68 4.4.5 Estimation of the A Priori Signal-to-Noise Ratio 69 4.5 Adaptive Filters for Signal Enhancement 71 4.5.1 Spectral Subtraction 71 4.5.2 Nonlinear Spectral Subtraction 73 4.5.3 Wiener Filtering 74 4.5.4 The ETSI Advanced Front End 75 4.5.5 Nonlinear MMSE Estimators 75 4.6 ASR Performance 80 4.7 Conclusions 81 References 82 5 Extraction of Speech from Mixture Signals 87 Paris Smaragdis 5.1 The Problem with Mixtures 87 5.2 Multichannel Mixtures 88 5.2.1 Basic Problem Formulation 88 5.2.2 Convolutive Mixtures 92 5.3 Single-Channel Mixtures 98 5.3.1 Problem Formulation 98 5.3.2 Learning Sound Models 100 5.3.3 Separation by Spectrogram Factorization 101 5.3.4 Dealing with Unknown Sounds 105 5.4 Variations and Extensions 107 5.5 Conclusions 107 References 107 6 Microphone Arrays 109 John McDonough, Kenichi Kumatani 6.1 Speaker Tracking 110 6.2 Conventional Microphone Arrays 113 6.3 Conventional Adaptive Beamforming Algorithms 120 6.3.1 Minimum Variance Distortionless Response Beamformer 120 6.3.2 Noise Field Models 122 6.3.3 Subband Analysis and Synthesis 123 6.3.4 Beamforming Performance Criteria 126 6.3.5 Generalized Sidelobe Canceller Implementation 129 6.3.6 Recursive Implementation of the GSC 130 6.3.7 Other Conventional GSC Beamformers 131 6.3.8 Beamforming based on Higher Order Statistics 132 6.3.9 Online Implementation 136 6.3.10 Speech-Recognition Experiments 140 6.4 Spherical Microphone Arrays 142 6.5 Spherical Adaptive Algorithms 148 6.6 Comparative Studies 149 6.7 Comparison of Linear and Spherical Arrays for DSR 152 6.8 Conclusions and Further Reading 154 References 155 Part Three FEATURE ENHANCEMENT 7 From Signals to Speech Features by Digital Signal Processing 161 Matthias W¨olfel 7.1 Introduction 161 7.1.1 About this Chapter 162 7.2 The Speech Signal 162 7.3 Spectral Processing 163 7.3.1 Windowing 163 7.3.2 Power Spectrum 165 7.3.3 Spectral Envelopes 166 7.3.4 LP Envelope 166 7.3.5 MVDR Envelope 169 7.3.6 Warping the Frequency Axis 171 7.3.7 Warped LP Envelope 175 7.3.8 Warped MVDR Envelope 176 7.3.9 Comparison of Spectral Estimates 177 7.3.10 The Spectrogram 179 7.4 Cepstral Processing 179 7.4.1 Definition and Calculation of Cepstral Coefficients 180 7.4.2 Characteristics of Cepstral Sequences 181 7.5 Influence of Distortions on Different Speech Features 182 7.5.1 Objective Functions 182 7.5.2 Robustness against Noise 185 7.5.3 Robustness against Echo and Reverberation 187 7.5.4 Robustness against Changes in Fundamental Frequency 189 7.6 Summary and Further Reading 191 References 191 8 Features Based on Auditory Physiology and Perception 193 Richard M. Stern, Nelson Morgan 8.1 Introduction 193 8.2 Some Attributes of Auditory Physiology and Perception 194 8.2.1 Peripheral Processing 194 8.2.2 Processing at more Central Levels 200 8.2.3 Psychoacoustical Correlates of Physiological Observations 202 8.2.4 The Impact of Auditory Processing on Conventional Feature Extraction 206 8.2.5 Summary 208 8.3 "Classic" Auditory Representations 208 8.4 Current Trends in Auditory Feature Analysis 213 8.5 Summary 221 Acknowledgments 222 References 222 9 Feature Compensation 229 Jasha Droppo 9.1 Life in an Ideal World 229 9.1.1 Noise Robustness Tasks 229 9.1.2 Probabilistic Feature Enhancement 230 9.1.3 Gaussian Mixture Models 231 9.2 MMSE-SPLICE 232 9.2.1 Parameter Estimation 233 9.2.2 Results 236 9.3 Discriminative SPLICE 237 9.3.1 The MMI Objective Function 238 9.3.2 Training the Front-End Parameters 239 9.3.3 The Rprop Algorithm 240 9.3.4 Results 241 9.4 Model-Based Feature Enhancement 242 9.4.1 The Additive Noise-Mixing Equation 243 9.4.2 The Joint Probability Model 244 9.4.3 Vector Taylor Series Approximation 246 9.4.4 Estimating Clean Speech 247 9.4.5 Results 247 9.5 Switching Linear Dynamic System 248 9.6 Conclusion 249 References 249 10 Reverberant Speech Recognition 251 Reinhold Haeb-Umbach, Alexander Krueger 10.1 Introduction 251 10.2 The Effect of Reverberation 252 10.2.1 What is Reverberation? 252 10.2.2 The Relationship between Clean and Reverberant Speech Features 254 10.2.3 The Effect of Reverberation on ASR Performance 258 10.3 Approaches to Reverberant Speech Recognition 258 10.3.1 Signal-Based Techniques 259 10.3.2 Front-End Techniques 260 10.3.3 Back-End Techniques 262 10.3.4 Concluding Remarks 265 10.4 Feature Domain Model of the Acoustic Impulse Response 265 10.5 Bayesian Feature Enhancement 267 10.5.1 Basic Approach 268 10.5.2 Measurement Update 269 10.5.3 Time Update 270 10.5.4 Inference 271 10.6 Experimental Results 272 10.6.1 Databases 272 10.6.2 Overview of the Tested Methods 273 10.6.3 Recognition Results on Reverberant Speech 274 10.6.4 Recognition Results on Noisy Reverberant Speech 276 10.7 Conclusions 277 Acknowledgment 278 References 278 Part Four MODEL ENHANCEMENT 11 Adaptation and Discriminative Training of Acoustic Models 285 Yannick Est'eve, Paul Del'eglise 11.1 Introduction 285 11.1.1 Acoustic Models 286 11.1.2 Maximum Likelihood Estimation 287 11.2 Acoustic Model Adaptation and Noise Robustness 288 11.2.1 Static (or Offline) Adaptation 289 11.2.2 Dynamic (or Online) Adaptation 289 11.3 Maximum A Posteriori Reestimation 290 11.4 Maximum Likelihood Linear Regression 293 11.4.1 Class Regression Tree 294 11.4.2 Constrained Maximum Likelihood Linear Regression 297 11.4.3 CMLLR Implementation 297 11.4.4 Speaker Adaptive Training 298 11.5 Discriminative Training 299 11.5.1 MMI Discriminative Training Criterion 301 11.5.2 MPE Discriminative Training Criterion 302 11.5.3 I-smoothing 303 11.5.4 MPE Implementation 304 11.6 Conclusion 307 References 308 12 Factorial Models for Noise Robust Speech Recognition 311 John R. Hershey, Steven J. Rennie, Jonathan Le Roux 12.1 Introduction 311 12.2 The Model-Based Approach 313 12.3 Signal Feature Domains 314 12.4 Interaction Models 317 12.4.1 Exact Interaction Model 318 12.4.2 Max Model 320 12.4.3 Log-Sum Model 321 12.4.4 Mel Interaction Model 321 12.5 Inference Methods 322 12.5.1 Max Model Inference 322 12.5.2 Parallel Model Combination 324 12.5.3 Vector Taylor Series Approaches 326 12.5.4 SNR-Dependent Approaches 331 12.6 Efficient Likelihood Evaluation in Factorial Models 332 12.6.1 Efficient Inference using the Max Model 332 12.6.2 Efficient Vector-Taylor Series Approaches 334 12.6.3 Band Quantization 335 12.7 Current Directions 337 12.7.1 Dynamic Noise Models for Robust ASR 338 12.7.2 Multi-Talker Speech Recognition using Graphical Models 339 12.7.3 Noise Robust ASR using Non-Negative Basis Representations 340 References 341 13 Acoustic Model Training for Robust Speech Recognition 347 Michael L. Seltzer 13.1 Introduction 347 13.2 Traditional Training Methods for Robust Speech Recognition 348 13.3 A Brief Overview of Speaker Adaptive Training 349 13.4 Feature-Space Noise Adaptive Training 351 13.4.1 Experiments using fNAT 352 13.5 Model-Space Noise Adaptive Training 353 13.6 Noise Adaptive Training using VTS Adaptation 355 13.6.1 Vector Taylor Series HMM Adaptation 355 13.6.2 Updating the Acoustic Model Parameters 357 13.6.3 Updating the Environmental Parameters 360 13.6.4 Implementation Details 360 13.6.5 Experiments using NAT 361 13.7 Discussion 364 13.7.1 Comparison of Training Algorithms 364 13.7.2 Comparison to Speaker Adaptive Training 364 13.7.3 Related Adaptive Training Methods 365 13.8 Conclusion 366 References 366 Part Five COMPENSATION FOR INFORMATION LOSS 14 Missing-Data Techniques: Recognition with Incomplete Spectrograms 371 Jon Barker 14.1 Introduction 371 14.2 Classification with Incomplete Data 373 14.2.1 A Simple Missing Data Scenario 374 14.2.2 Missing Data Theory 376 14.2.3 Validity of the MAR Assumption 378 14.2.4 Marginalising Acoustic Models 379 14.3 Energetic Masking 381 14.3.1 The Max Approximation 381 14.3.2 Bounded Marginalisation 382 14.3.3 Missing Data ASR in the Cepstral Domain 384 14.3.4 Missing Data ASR with Dynamic Features 386 14.4 Meta-Missing Data: Dealing with Mask Uncertainty 388 14.4.1 Missing Data with Soft Masks 388 14.4.2 Sub-band Combination Approaches 391 14.4.3 Speech Fragment Decoding 393 14.5 Some Perspectives on Performance 395 References 396 15 Missing-Data Techniques: Feature Reconstruction 399 Jort Florent Gemmeke, Ulpu Remes 15.1 Introduction 399 15.2 Missing-Data Techniques 401 15.3 Correlation-Based Imputation 402 15.3.1 Fundamentals 402 15.3.2 Implementation 404 15.4 Cluster-Based Imputation 406 15.4.1 Fundamentals 406 15.4.2 Implementation 408 15.4.3 Advances 409 15.5 Class-Conditioned Imputation 411 15.5.1 Fundamentals 411 15.5.2 Implementation 412 15.5.3 Advances 413 15.6 Sparse Imputation 414 15.6.1 Fundamentals 414 15.6.2 Implementation 416 15.6.3 Advances 418 15.7 Other Feature-Reconstruction Methods 420 15.7.1 Parametric Approaches 420 15.7.2 Nonparametric Approaches 421 15.8 Experimental Results 421 15.8.1 Feature-Reconstruction Methods 422 15.8.2 Comparison with Other Methods 424 15.8.3 Advances 426 15.8.4 Combination with Other Methods 427 15.9 Discussion and Conclusion 428 Acknowledgments 429 References 430 16 Computational Auditory Scene Analysis and Automatic Speech Recognition 433 Arun Narayanan, DeLiang Wang 16.1 Introduction 433 16.2 Auditory Scene Analysis 434 16.3 Computational Auditory Scene Analysis 435 16.3.1 Ideal Binary Mask 435 16.3.2 Typical CASA Architecture 438 16.4 CASA Strategies 440 16.4.1 IBM Estimation Based on Local SNR Estimates 440 16.4.2 IBM Estimation using ASA Cues 442 16.4.3 IBM Estimation as Binary Classification 448 16.4.4 Binaural Mask Estimation Strategies 451 16.5 Integrating CASA with ASR 452 16.5.1 Uncertainty Transform Model 454 16.6 Concluding Remarks 458 Acknowledgment 458 References 458 17 Uncertainty Decoding 463 Hank Liao 17.1 Introduction 463 17.2 Observation Uncertainty 465 17.3 Uncertainty Decoding 466 17.4 Feature-Based Uncertainty Decoding 468 17.4.1 SPLICE with Uncertainty 470 17.4.2 Front-End Joint Uncertainty Decoding 471 17.4.3 Issues with Feature-Based Uncertainty Decoding 472 17.5 Model-Based Joint Uncertainty Decoding 473 17.5.1 Parameter Estimation 475 17.5.2 Comparisons with Other Methods 476 17.6 Noisy CMLLR 477 17.7 Uncertainty and Adaptive Training 480 17.7.1 Gradient-Based Methods 481 17.7.2 Factor Analysis Approaches 482 17.8 In Combination with Other Techniques 483 17.9 Conclusions 484 References 485 Index 487

Erscheint lt. Verlag	5.10.2012
Verlagsort	New York
Sprache	englisch
Maße	150 x 250 mm
Gewicht	666 g
Themenwelt	Informatik ► Theorie / Studium ► Künstliche Intelligenz / Robotik
Themenwelt	Technik ► Elektrotechnik / Energietechnik
ISBN-10	1-118-39268-X / 111839268X
ISBN-13	978-1-118-39268-3 / 9781118392683
Zustand	Neuware

Haben Sie eine Frage zum Produkt?

Print-Ausgabe

Buch | Hardcover

114,28 €

Sie befinden sich hier:

auf Facebook teilen

bei Twitter

Link zu dieser Seite kopieren