IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

📄 Abstract - IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.

IsoNet：复杂声学环境中具有空间感知能力的视听目标语音提取系统 / IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

1️⃣ 一句话总结

本文提出了一种名为IsoNet的紧凑型麦克风阵列系统，通过融合多通道音频特征、空间定位线索和面部视觉信息，并辅以方向监督训练，在传统波束成形方法失效的短孔径条件下，显著提升了从复杂嘈杂环境中提取特定说话人语音的性能。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要