# Improving Speech Separation by Acoustic Echo Cancellation

Bachelorarbeit 2012 48 Seiten

## Leseprobe

## Contents

Danksagung

Abstract

1 Introduction

1.1 Motivation

1.2 Overview

2 Digital Filters

2.1 FIR-Filter

2.2 IIR-Filter

2.3 Wiener Filter

2.3.1 Solution in the Time Domain

2.3.2 Solution in the Frequency Domain

2.4 Adaptive Filters

2.4.1 LMS Algorithm

2.4.2 NLMS Algorithm

3 Acoustic Echo Cancellation

3.1 Problem Definition

3.2 Adaptive Filter

3.3 Voice Activity Detection VAD

3.4 Pre-Emphasis/De-Emphasis

3.5 Residual Echo Suppression

3.6 Matlab Results

4 Speech Separation

4.1 Beamforming

4.1.1 Diffuse Noise Field and Directivity

4.1.2 Delay-and-Sum Beamformer

4.1.3 MVDR Beamformer

4.1.4 Superdirective Beamformer

4.1.5 Zelinski Postfilter

4.2 Echo Suppression Postfilter ES

5 Experiments

5.1 Corpus

5.2 Automatic Speech Recognition System ASR

5.3 Experiments and Results

5.3.1 Superdirective Beamformer

5.3.2 Zelinski Postfilter

5.3.3 Echo Suppression Filter

5.3.4 Row of Echo Suppression Systems

5.3.5 Zelinski Postfilter and Echo Suppression System

6 Summary, Conclusions and Future Work

## List of Tables

5.1 WER for SDB

5.2 WER for Zelinski Postfilter

5.3 WER for Echo Suppression Filter

5.4 WER for Zelinski Postfilter and Echo Suppression Filter

## List of Figures

2.1 FIR Filter Structure [Tho11]

2.2 IIR Filter Structure [Tho11]

2.3 Structure of a Wiener Filter

2.4 Structure of an Adaptive Filter

3.1 Audio Conference System

3.2 Acoustic Echo Cancellation System

3.3 Speech Signal

3.4 Signal Power of a Signal

3.5 Voice Activity of a Signal

3.6 Example for Pre-Emphasis

3.7 Far-End Speaker Signal

3.8 Desired Signal

3.9 Estimated Impulse Response

3.10 Result without Near-End Speaker

3.11 Desired Signal with two Speakers

3.12 Result with Near-End Speaker

3.13 Echo Suppressed Signal with Postfilter

3.14 Echo Cancellation System

4.1 Task Speech Separation

4.2 Beamforming

4.3 Plane Wave on Linear Microphone Array

4.4 Delay-And-Sum Beamformer [VT02]

4.5 MVDR Beamformer

4.6 Zelinski Postfilter

5.1 SDB

5.2 Bar Plot of the WER for Zelinski Postfilter

5.3 SDB with Zelinski Postfilter

5.4 SDB with Echo Suppression Filter

5.5 SDB with up to 4 ES Filters

5.6 WER for different *β*

5.7 SDB with Zelinski and ES

6.1 Speech Separation System.

## Danksagung

Zunächst möchte ich mich bei Herrn Prof. Dr. Dietrich Klakow bedanken, der mir die Möglichkeit gab meine Arbeit am Lehrstuhl für Sprachsignalverarbeitung der Universität des Saarlandes anzufertigen.

Des Weiteren gilt mein Dank allen Mitarbeitern des Lehrstuhls, die mich bei tech- nischen, organisatorischen oder fachlichen Problemen großzügig und freundlichst unterstützten. Besonders ist an dieser Stelle mein Betreuer Friedrich Faubel zu nennen, der mich im kompletten Verlauf meiner Arbeit hervorragend betreute.

Ein ganz spezieller Dank gilt auch meiner Familie und meinen Freunden. Beson- ders die tatkräftige Unterstützung meiner Mutter während meiner schulischen und akademischen Laufbahn bildet die Grundlage für das Gelingen dieser Arbeit.

## Abstract

This bachelor thesis deals with acoustic echo cancellation and speech separation. An acoustic echo cancellation system is implemented and then parts of this system are used for the speech separation process. The speech separation process is evaluated with an automatic speech recognition system and the filter that is used leads to a significant improvement of the speech separation. The separation system achieved a word error rate of 44 *.* 20 %. This is an improvement of 24 % in comparison to a superdirective beamformer.

## Zusammenfassung

Diese Bachelorarbeit umfasst die Themen Acoustic Echo Cancellation und Speech Separation. Zunächst wird ein Acoustic Echo Cancellation System in Matlab implementiert und anschließend werden Teile dieses Systems für die Sprachtrennung genutzt. Die Experimente zur Sprachtrennung werden mit einem automatischen Spracherkennungsystem ausgewertet und mit Hilfe des benutzten Filters ist eine deutliche Verbesserung der Sprachtrennung zu beobachten. Das System erreicht eine Word-Error-Rate von 44 *,* 20 %. Dies entspricht einer Verbesserung von 24 % im Vergleich zum Superdirective Beamformer.

## 1 Introduction

### 1.1 Motivation

In recent years the communication between people changed. Hands-free communi- cation becomes a more and more important part nowadays. It is used for example in teleconferencing systems, mobile phones, home entertainment, and car informa- tion systems. It is very comfortable to use these systems, but the coupling between microphones and loudspeakers introduces echoes that can disturb the conversation. The solution to this problem is an acoustic echo cancellation system that can sup- press the disturbing echo.

The presence of more than one person in a room, like in teleconferencing systems, leads to new problems. Now, it is necessary to separate the speech of both speakers and this is a very challenging task in speech recognition.

In this thesis an acoustic echo cancellation system is explained and implemented and then parts of this echo cancellation system are used to solve the speech separation problem.

### 1.2 Overview

This thesis explores the speech separation problem under use of parts of an acoustic echo cancellation system. It is organized as follows:

** Section 2

This section gives an introduction to digital filters. It is very important to have knowledge about these filters in order to understand and implement an acoustic echo cancellation system.

** Section 3

This section first describes the basic concept of an acoustic echo cancellation system. Then, a voice activity detection system and a pre-emphasis filter are introduced. In order to improve the results a postfilter for the residual echo is also presented in this part. Furthermore, the system is implemented in Matlab and Section 3.6 shows the results of the implementation.

** Section 4

This section deals with the actual speech separation problem. Here, we first point a beamformer (spatial filter) at each of the acoustic sources in order to make use of the spatial diversity of the signal. Hence, we explain the beamforming process at the hand of the delay-and-sum beamformer in Section 4.1.2 and the superdirective beamformer in Section 4.1.4. Different postfilters that are used after the beamforming process are also part of this section.

** Section 5

This section presents the experiments and in particular the results of the different experiments.

** Section 6

The last section gives a summary of the main facts of this thesis and a prospect for future work is presented.

## 2 Digital Filters

Digital filters are an important topic in signal processing and they are used in many applications in communication technology. In this section, we will introduce the filters that are used in this thesis.

### 2.1 FIR-Filter

As the name suggests, finite impulse response filter have a finite impulse response. The output of a FIR-filter is bounded, because it is a finite sum of weighted, and bounded inputs. Since there is no feedback, the FIR filter can never oscillate and it is always stable. Figure 2.1 shows the structure of a FIR filter.

illustration not visible in this excerpt

Figure 2.1: FIR Filter Structure [Tho11]

The weighted sum of the inputs leads to the impulse response of an FIR filter in the z-domain:

illustration not visible in this excerpt

The z elements in equation 2.1 are also called taps and they are delays of the input signal. H(z) describes the z-transform of the impulse response h(n) and it is useful to check the stability of the system. In this case there are no poles and this implies stability of the system. We use a FIR filter later in the pre-emphasis part (Section 3.4) of the acoustic echo cancellation system.

### 2.2 IIR-Filter

An infinite impulse response filter has an infinite length of the impulse response h(n), because the output samples are fed back in order to compute the output y(n). In Figure 2.2 we can see the scheme of an IRR Filter.

illustration not visible in this excerpt

Figure 2.2: IIR Filter Structure [Tho11]

The z-transform of the impulse response h(n) is

illustration not visible in this excerpt

Equation 2.2 has poles at the roots of the denominator. Because of this poles an IRR filter can be unstable. For the design of such a filter we have to determine the coefficients *bi* and we will use this filter type in the de-emphasis part in Section 3.4.

### 2.3 Wiener Filter

The Wiener Filter theory was developed by Norbert Wiener (1949) and Andrei Kolmogorov (1941)[Vas96]. Wiener did his solution based on the time domain analysis and Andrei Kolmogorovs solution was based on the frequency domain analysis.

The goal of a Wiener filter is to reduce noise from a noisy signal. If you add noise n[k] to a clean signal s[k], a Wiener filter with the filter coefficients h[k] tries to reconstruct the clean signal from the noisy signal. Figure 2.3 shows the scheme of a Wiener Filter. For the acoustic echo cancellation task, a Wiener filter is used in the echo suppression system that is explained in Section 3.5.

illustration not visible in this excerpt

Figure 2.3: Structure of a Wiener Filter

As we can see in Figure 2.3, we can express the noisy signal x[k] as

illustration not visible in this excerpt

The filtered signal *s* [ *k* ] is a reconstruction of the clean signal s[k].

In the following part the solutions in the time and frequency domain are explained.

#### 2.3.1 Solution in the Time Domain

The signal *s* [ *k* ] can be written with a convolution as

illustration not visible in this excerpt

Then, the error between the clean and filtered signal is

illustration not visible in this excerpt

The idea of a Wiener filter is to minimize the expectation value of the squared error

illustration not visible in this excerpt

If we use the definition of the discrete convolution, we get

illustration not visible in this excerpt

In order to minimize equation 2.7 we have to calculate the derivative with respect to h[i] and set it to zero.

illustration not visible in this excerpt

Simplifying and substituting equation 2.3 into equation 2.9 leads to

illustration not visible in this excerpt

Now, we assume that noise and signal are uncorrelated

illustration not visible in this excerpt

and we define

illustration not visible in this excerpt

Then, we get the following equation

illustration not visible in this excerpt

If we use the convolution, we get the Wiener-Hopf-equation:

illustration not visible in this excerpt

Solving this system of equations gives the optimum impulse response at a complexity of *O* (*n* ^{2} ).

#### 2.3.2 Solution in the Frequency Domain

In the remaining part of the thesis uppercase letters mean a signal in the frequency domain. In the frequency domain, a convolution means a multiplication and therefore, the error *J* (*ω*) is

illustration not visible in this excerpt

The next step is again to minimize the expectation value of the squared error *E { J* (*ω*)^{2} *}* by taking the derivative with respect to *H* (*ω*):

illustration not visible in this excerpt

With

illustration not visible in this excerpt

and the assumption of uncorrelated noise and signal

illustration not visible in this excerpt

we can simplify equation 2.18 to

illustration not visible in this excerpt

With the power spectral densities

*E { S* (*ω*)^{2} + *N* (*ω*)^{2} *} .*

illustration not visible in this excerpt

we can formulate the solution of the impulse response in the frequency domain as

illustration not visible in this excerpt

This method in the frequency domain has a complexity of[Abbildung in dieser Leseprobe nicht enthalten]

### 2.4 Adaptive Filters

An adaptive filter is a filter whose filter coefficients can be changed so that the filter can be adapted to different environments. In order to realize this we need, in addition to the filter coefficients, an algorithm to update these coefficients. Figure 2.4 shows the scheme of such an adaptive filter.

illustration not visible in this excerpt

Figure 2.4: Structure of an Adaptive Filter

The input of the filter consists of N samples of the signal x(*n*)

illustration not visible in this excerpt

and with the filter coefficients w(*n*)

illustration not visible in this excerpt

we can calculate the output signal

illustration not visible in this excerpt

This signal will be compared with the desired signal *d* (*n*) and the error

illustration not visible in this excerpt

is used to calculate the new filter coefficients for the next N samples of the signal. The goal is to minimize the error because then you estimate the echo as good as possible and thus the echo can be removed from the signal.

#### 2.4.1 LMS Algorithm

The term LMS algorithm stands for Least-Mean-Squares algorithm. It denotes a certain way of updating the coefficients of the adaptive filter. It uses the method of steepest descent and was invented by Bernard Widrow and Marcian Edward Hoff in 1960. The mean square error

illustration not visible in this excerpt

will be calculated and minimized. Since the expectation value of the error is unknown, we have to assume that the expectation value of the error corresponds to the instantaneous value of the error

illustration not visible in this excerpt

Substituting equation 2.30 into equation 2.29 leads to

illustration not visible in this excerpt

To minimize this function we use the method of steepest descent. Therefore, we have to take the gradient with respect to the filter coefficients w(n).

illustration not visible in this excerpt

Substituting

in equation 2.32 leads to

illustration not visible in this excerpt

Applying the chain rule and simplifying results in

illustration not visible in this excerpt

The equation for updating the filter coefficients w(*n* + 1) at any given time n is given by the method of steepest descent by subtracting the gradient from the previous filter coefficients w(*n*):

illustration not visible in this excerpt

Where *μ* is the stepsize which represents a small constant and which affects the speed of updating. If we now substitute equation 2.35 into equation 2.36, we obtain the update formula of the filter coefficients for the LMS algorithm:

illustration not visible in this excerpt

In summary, we get the following scheme for the procedure of the LMS algorithm. Algorithm 1 LMS Algorithm [Kha07]

illustration not visible in this excerpt

Comment: The initialization of the filter coefficients is done with w(0) = 0. Re- garding the stepsize it holds that *μ* is 0 *< μ <* here *λ max* is the largest *λ max* w eigenvalue of the autocorrelation matrix *R* = *E {* x(*n*)x(*n*) *T } .* [Hay02] Each step of the algorithm requires 2N+1 multiplications and 2N additions. Consequently, the complexity has order O(N). Because of the simplicity and low complexity, the LMS algorithm is often used in adaptive filters.

#### 2.4.2 NLMS Algorithm

One problem with the LMS algorithm is the slow convergence of the filter coefficients, especially for fast changes of the input. To improve this, the normalized least mean-squares algorithm is used. It is an extension of the LMS algorithm and it adapts the stepsize in each iteration by normalizing with the signal power of x(n). The stepsize *μ* is given by

illustration not visible in this excerpt

Substituting equation 2.38 in the update formula of the LMS algorithm leads to

illustration not visible in this excerpt

This equation is often modified to

illustration not visible in this excerpt

where *μ* ^{1} and *ψ* are small positive constants that influence the speed of adaptation and prevent a division by zero.

Algorithm 2 NLMS Algorithm [FB98]

illustration not visible in this excerpt

## 3 Acoustic Echo Cancellation

Acoustic echo cancellation systems are used in a wide range of communication systems. For example, they can improve the speech quality of speakerphones or Voice over IP telephones. These techniques are more and more common and therefore acoustic echo cancellation is a very important and useful feature in signal processing. In this section we will explain the details of an AEC system and at the end we will have a look at some results of the Matlab implementation.

### 3.1 Problem Definition

Audio feedback is often a problem of speakerphones or audio conference systems. The microphone receives the speaker’s voice in addition to an input loudspeaker signal that is reflected by the walls or the ceiling. The result leads to a superposi- tion of speech and disturbing echo, which makes the utterance hard to understand for the receiver. Figure 3.1 shows the scheme of such an audio conference system.

illustration not visible in this excerpt

Figure 3.1: Audio Conference System

In order to avoid the superposition, an acoustic echo cancellation system tries to estimate the echo with certain filters and then subtracts the estimated echo from the microphone signal. So, at the end the echo is reduced and the receiver hears a clear speech signal.

illustration not visible in this excerpt

Figure 3.2 illustrates the scheme of an acoustic echo cancellation system. Two speakers with microphones and loudspeakers are placed in two different rooms. The microphone signal *x* (*t*) of the far-end room is sent to the loudspeaker in the near-end room. There the signal, which is reflected by the wall and the ceiling, is received at the microphone in addition to the voice of the second speaker *v* (*t*). This signal is sent to the far-end room and the speaker there would hear a disturbing echo. To solve this problem, we have to estimate the echo *y* (*t*) and then subtract it from the microphone signal. For this estimation an adaptive filter is used to estimate the impulse response of the room.

illustration not visible in this excerpt

Figure 3.2: Acoustic Echo Cancellation System

The error is

illustration not visible in this excerpt

with

illustration not visible in this excerpt

where *w* (*t*) describes the filter coefficients of the adaptive filter and *h* (*t*) describes the impulse response of the room.

### 3.2 Adaptive Filter

For the estimation of the room impulse response *y* (*t*), we use an adaptive filter as described in Section 2.4. The input signal *x* (*t*) is the signal of the far-end speaker and the desired signal *d* (*n*) is the near-end speaker signal in addition to the disturbing echo. So if we have a good estimate of the impulse response, we can reduce the echo by subtracting the filter output *y* (*t*) from the desired signal.

### 3.3 Voice Activity Detection VAD

Voice activity detection is a technique with which the presence or absence of human speech can be detected. It is an important feature of an acoustic echo cancellation system because we have to prevent the updating process of the adaptive filter when there is no far-end speech activity so that the impulse response is estimated correctly. Therefore, we implement a VAD algorithm in this work.

illustration not visible in this excerpt

Figure 3.3: Speech Signal

Figure 3.3 shows the amplitude of a speaker file. The first step now is to calculate the power of the signal. Therefore, we divide the signal into frames and calculate the power of these frames (Figure 3.4(a)).

illustration not visible in this excerpt

(a) Signal Power

illustration not visible in this excerpt

(b) Signal Power with Threshold

Figure 3.4: Signal Power of a Signal

The last step now is to compare the power to a certain threshold. In this case we choose a threshold of a third of the power. It means that the speaker is considered active if the value exceeds one third of the maximum power (Figure 3.4(b)) and it is assumed that there is only noise if the value is below. In Figure 3.5 you can see the result. If there is no voice activity, the value of the frame is 0 and in the other case it is 1.

illustration not visible in this excerpt

Figure 3.5: Voice Activity of a Signal

**[...]**

^{1} *μ* usually 0 * < μ <* 2 [Göt]

## Details

- Seiten
- 48
- Jahr
- 2012
- ISBN (eBook)
- 9783656347286
- ISBN (Buch)
- 9783656350682
- Dateigröße
- 7.7 MB
- Sprache
- Deutsch
- Katalognummer
- v207359
- Institution / Hochschule
- Universität des Saarlandes – Sprachsignalverarbeitung
- Note
- 1,0
- Schlagworte
- improving speech separation acoustic echo cancellation