IMPROVING DF-CONFORMER USING HYDRA FOR HIGH-FIDELITY GENERATIVE SPEECH ENHANCEMENT ON DISCRETE CODEC TOKEN
Shogo Seki, Shaoxiang Dang†, Li Li (CyberAgent)†Work contributed by Shaoxiang during internship
ABSTRACT
The Dilated FAVOR Conformer (DF-Conformer) is an efficient variant of the Conformer architecture designed for speech enhancement (SE). It employs fast attention through positive orthogonal random features (FAVOR+) to mitigate the quadratic complexity associated with self-attention, while utilizing dilated convolution to expand the receptive field. This combination results in impressive performance across various SE models. In this paper, we propose replacing FAVOR+ with bidirectional selective structured state-space sequence models to achieve two main objectives:(1) enhancing global sequential modeling by eliminating the approximations inherent in FAVOR+, and (2) maintaining linear complexity relative to the sequence length. Specifically, we utilize Hydra, a bidirectional extension of Mamba, framed within the structured matrix mixer framework. Experiments conducted using a generative SE model on discrete codec tokens, known as Genhancer, demonstrate that the proposed method surpasses the performance of the DF-Conformer.
EXPERIMENTS
Main results
Sample #1
Input
Target
Miipher (22.05 kHz) [1,2]
Genhancer (Softmax)
Genhancer (FAVOR+) [3]
Genhancer (Bi-Mamba)
Genhancer (Hydra)
Sample #2
Input
Target
Miipher (22.05 kHz) [1,2]
Genhancer (Softmax)
Genhancer (FAVOR+) [3]
Genhancer (Bi-Mamba)
Genhancer (Hydra)
Ablation study: comparison of different input lengths
Target
8s (training size)
Genhancer (Softmax)
Genhancer (FAVOR+) [3]
Genhancer (Bi-Mamba)
Genhancer (Hydra)
24s
Genhancer (Softmax)
Genhancer (FAVOR+) [3]
Genhancer (Bi-Mamba)
Genhancer (Hydra)
96s
Genhancer (Softmax)
Genhancer (FAVOR+) [3]
Genhancer (Bi-Mamba)
Genhancer (Hydra)
Input
REFERENCES
[1] Y.Koizumi et al., "Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations," in Proc. WASPAA, pp. 1-5, 2023.
[2] https://github.com/Wataru-Nakata/miipher.git
[3] H.Yang et al., "Genhancer: High-Fidelity Speech Enhancement via Generative Modeling on Discrete Codec Tokens," in Proc. Interspeech, pp. 1170-1174, 2024.