IsoNet:复杂声学环境中具有空间感知能力的视听目标语音提取系统 / IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments
1️⃣ 一句话总结
本文提出了一种名为IsoNet的紧凑型麦克风阵列系统,通过融合多通道音频特征、空间定位线索和面部视觉信息,并辅以方向监督训练,在传统波束成形方法失效的短孔径条件下,显著提升了从复杂嘈杂环境中提取特定说话人语音的性能。
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.
IsoNet:复杂声学环境中具有空间感知能力的视听目标语音提取系统 / IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments
本文提出了一种名为IsoNet的紧凑型麦克风阵列系统,通过融合多通道音频特征、空间定位线索和面部视觉信息,并辅以方向监督训练,在传统波束成形方法失效的短孔径条件下,显著提升了从复杂嘈杂环境中提取特定说话人语音的性能。
源自 arXiv: 2605.14736