IMPROVING DF-CONFORMER USING HYDRA FOR HIGH-FIDELITY GENERATIVE SPEECH ENHANCEMENT ON DISCRETE CODEC TOKEN

Shogo Seki, Shaoxiang Dang^†, Li Li (CyberAgent)
^†Work contributed by Shaoxiang during internship

ABSTRACT

The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.

EXPERIMENTS

Main results

Sample #1

Input

Target

Miipher (22.05 kHz) [1,2]

Genhancer (Softmax)

Genhancer (FAVOR+) [3]

Genhancer (Bi-Mamba)

Genhancer (Hydra)

Sample #2

Input

Target

Miipher (22.05 kHz) [1,2]

Genhancer (Softmax)

Genhancer (FAVOR+) [3]

Genhancer (Bi-Mamba)

Genhancer (Hydra)

Ablation study: comparison of different input lengths

Target

8s (training size)

Genhancer (Softmax)

Genhancer (FAVOR+) [3]

Genhancer (Bi-Mamba)

Genhancer (Hydra)

24s

Genhancer (Softmax)

Genhancer (FAVOR+) [3]

Genhancer (Bi-Mamba)

Genhancer (Hydra)

96s

Genhancer (Softmax)

Genhancer (FAVOR+) [3]

Genhancer (Bi-Mamba)

Genhancer (Hydra)

Input

REFERENCES

[1] Y.Koizumi et al., "Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations," in Proc. WASPAA, pp. 1-5, 2023.

[2] https://github.com/Wataru-Nakata/miipher.git

[3] H.Yang et al., "Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens," in Proc. Interspeech, pp. 1170-1174, 2024.