静音检测主要区分语音帧与静音 / 噪声帧广泛用于通话降噪、录音分段、直播降噪、语音识别前置处理。一、静音检测的基础逻辑为分帧把连续音频切成短帧常见 10ms/20ms匹配人耳短时平稳特性提取一帧音频特征能量和阈值对比特征阈值 → 判定静音特征阈值 → 判定人声平滑防抖单帧误判会过滤连续多帧才切换静音 / 说话状态二、几种静音检测原理2.1 基础能量法利用时域幅值 / 短时能量判断静音时麦克风只有环境噪声波形幅度极小人声波形幅度显著变大。公式一帧 N 个采样点为采样值,每帧的短时能量公式则短时平均幅度步骤1通过以上公式计算出基准噪音值步骤2设置一个可以调灵敏度的阈值如公式, 其中K为可调节的值可根据实际情况调准该值。步骤3判断规则 如 E T, 则静音否则为有声音。通过公式和判断方法可以看到该计算方法比较暴力无法判断过小的人声也无法区分汽车声与人声。2.2 过零率辅助法过零率 ZCR采用一帧内采样信号穿过 0 电平的次数。该方法主要为了区别噪音和人声根据各自声音的特征进行区分白噪声 / 风扇噪声波形杂乱过零率极高人声音频低频为主波形平缓过零率低计算公式设一帧有 N 个采样点相邻两点符号相反即产生一次过零。其中符合函数​并且还可以简化理解 求和得到整帧总过零次数除以 2N 做归一化取值范围 [0,1]。采用该方法可以区分一些突兀的噪音但是对于部分平稳并且低声的噪音可能也无法区别。当然采用该方法可以与能量法进行叠加使用先用能量法进行初步判断然后再用过零率法去除部分噪音。2.3 频域能量法时域只看整体音量频域区分人声频段和噪声频段。使用FFT频谱可以使得效果大幅提升。原理为对音频帧做 FFT转换到频谱人声有效频段300Hz ~ 3400Hz语音基带只统计该频段内总能量低频风噪、高频电子噪音直接忽略人声频段能量超过自适应阈值则判定说话该方法有点类似心里声学模型只关注人能听到的频率隔离大量不在人声频段的干扰噪声嘈杂环境准确率远高于时域能量法。关键公式1FFT频谱设 FFT 输出复数频点 X[k]k 为频点索引 单频点功率单频点功率Re Real复数实部Im Imaginary复数虚部2,语音带总能量对应 300Hz 的 FFT 下标对应 3400Hz 的 FFT 下标分别对人听到的每个声段求和公式3噪声基线自适应更新持续多帧判定为静音时缓慢更新背景噪声能量平滑系数噪声变化越慢取值越大。4. 判决条件设定阈值系数一般 2~8 可调灵敏度三webrtc vadWebRTC 中的 VAD (Voice Activity Detection) 主要基于 GMM (高斯混合模型) 和 频谱特征分析。其核心思想是将音频帧的特征向量与预训练的“语音模型”和“噪声模型”进行比对计算似然概率从而做出判决。3.1 整体架构流程1. 预处理: 下采样、分帧、加窗。2. 特征提取: 从时域和频域提取区分语音和噪声的关键特征。3. 模型匹配:使用 GMM 计算特征属于语音或噪声的概率。4. 决策逻辑: 结合概率、能量阈值和历史状态Hangover做出最终判断。3.2 预处理• 下采样: 无论输入采样率是 8k, 16k, 32k 还是 48kVAD 内部通常会将信号下采样到 8kHz。• 原因: 人声的主要能量和信息集中在低频段 (0-4kHz)。降低采样率可以大幅减少计算量且对 VAD 精度影响很小。• 分帧: 将连续信号划分为重叠的短帧。WebRTC VAD 支持 10ms, 20ms, 30ms 的帧长。• 加窗: 通常使用汉明窗 (Hamming Window) 以减少频谱泄漏。3.3 特征提取1. 总能量 (Total Energy):• 计算帧内所有样本平方和的对数。•• 作用: 静音帧能量通常极低。2. 过零率 (Zero Crossing Rate, ZCR):• 信号穿过零轴的次数。• 作用: 清音如 /s/, /f/和噪声通常具有较高的 ZCR而浊音如 /a/, /o/ZCR 较低。3. 频谱斜率 (Spectral Slope):• 通过线性回归拟合频谱包络计算斜率。• 作用: 语音频谱通常随频率增加而下降负斜率而白噪声频谱较平坦。4. 频谱平坦度 (Spectral Flatness):• 几何均值与算术均值的比值。• 作用: 衡量频谱像音调峰值明显还是像噪声平坦。5. 子带能量比 (Sub-band Energy Ratio):• 将 8kHz 频谱分为几个子带例如低、中、高。• 计算各子带能量占总能量的比例。• 作用: 人声在低频子带如 0-500Hz, 500-1000Hz通常有较高的能量集中度而高频噪声则在高频子带能量较高。3.4 模型匹配高斯混合模型 (GMM) 分类这是 WebRTC VAD 的核心。它维护两个独立的 GMM 模型• Speech Model (): 由大量纯净语音数据训练而成。• Noise Model (): 由各种背景噪声数据训练而成。每个模型由多个高斯分布组成 其中 是特征向量 是权重 是均值 是协方差。当前WebRTC 也引入了基于 递归神经网络 (RNN) 的 VAD (modules/audio_processing/rnn_vad/) 这里不重点具体内容有兴趣的可以自行研究。计算过程:1. 对于当前帧的特征向量 分别计算其在语音模型下的对数似然概率 和在噪声模型下的对数似然概率 。2. 计算似然比 (Likelihood Ratio):3. 如果 则倾向于判定为语音否则为噪声。3.5 决策逻辑与平滑 (Decision Smoothing)原始的逐帧判决容易受到瞬时噪声干扰产生抖动。因此引入了状态机和平滑机制1. 自适应阈值:• 阈值不是固定的而是根据背景噪声电平动态调整。• 在安静环境下阈值较低容易检测到微弱语音。• 在嘈杂环境下阈值提高防止噪声误触发。2. Hangover 机制 (悬挂/滞后):• 语音到静音转换: 当连续几帧被判定为噪声后不会立即切换为静音状态而是进入 Hangover 状态继续判定为语音若干帧例如 3-5 帧。• 目的: 防止切断语音的尾部如辅音结尾。• 静音到语音转换: 需要连续几帧都判定为语音才正式切换为语音状态。• 目的: 防止瞬时突发噪声如关门声被误判为语音起始。3. 模式选择 (Modes): WebRTC 提供四种模式本质上是调整上述阈值和 Hangover 长度• Normal: 平衡。• Low Bitrate: 更激进地判定为静音节省带宽Hangover 较短。• Aggressive: 更保守地判定为语音保留更多声音阈值较低。• Very Aggressive: 极度保守几乎不切断任何疑似语音的声音。3.6 实现源码1提取能量、ZCR、频谱特征bool FeaturesExtractor::CheckSilenceComputeFeatures( rtc::ArrayViewconst float, kFrameSize10ms24kHz samples, rtc::ArrayViewfloat, kFeatureVectorSize feature_vector) { // Pre-processing. if (use_high_pass_filter_) { std::arrayfloat, kFrameSize10ms24kHz samples_filtered; hpf_.Process(samples, samples_filtered); // Feed buffer with the pre-processed version of |samples|. pitch_buf_24kHz_.Push(samples_filtered); } else { // Feed buffer with |samples|. pitch_buf_24kHz_.Push(samples); } // Extract the LP residual. float lpc_coeffs[kNumLpcCoefficients]; ComputeAndPostProcessLpcCoefficients(pitch_buf_24kHz_view_, lpc_coeffs); ComputeLpResidual(lpc_coeffs, pitch_buf_24kHz_view_, lp_residual_view_); // Estimate pitch on the LP-residual and write the normalized pitch period // into the output vector (normalization based on training data stats). pitch_info_48kHz_ pitch_estimator_.Estimate(lp_residual_view_); feature_vector[kFeatureVectorSize - 2] 0.01f * (static_castint(pitch_info_48kHz_.period) - 300); // Extract lagged frames (according to the estimated pitch period). RTC_DCHECK_LE(pitch_info_48kHz_.period / 2, kMaxPitch24kHz); auto lagged_frame pitch_buf_24kHz_view_.subview( kMaxPitch24kHz - pitch_info_48kHz_.period / 2, kFrameSize20ms24kHz); // Analyze reference and lagged frames checking if silence has been detected // and write the feature vector. return spectral_features_extractor_.CheckSilenceComputeFeatures( reference_frame_view_, {lagged_frame.data(), kFrameSize20ms24kHz}, {feature_vector.data() kNumLowerBands, kNumBands - kNumLowerBands}, {feature_vector.data(), kNumLowerBands}, {feature_vector.data() kNumBands, kNumLowerBands}, {feature_vector.data() kNumBands kNumLowerBands, kNumLowerBands}, {feature_vector.data() kNumBands 2 * kNumLowerBands, kNumLowerBands}, feature_vector[kFeatureVectorSize - 1]); }2 核心函数计算 GMM 概率。它使用预定义的系数数组针对不同采样率和模式优化来计算高斯分布的概率密度。static int16_t GmmProbability(VadInstT* self, int16_t* features, int16_t total_power, size_t frame_length) { int channel, k; int16_t feature_minimum; int16_t h0, h1; int16_t log_likelihood_ratio; int16_t vadflag 0; int16_t shifts_h0, shifts_h1; int16_t tmp_s16, tmp1_s16, tmp2_s16; int16_t diff; int gaussian; int16_t nmk, nmk2, nmk3, smk, smk2, nsk, ssk; int16_t delt, ndelt; int16_t maxspe, maxmu; int16_t deltaN[kTableSize], deltaS[kTableSize]; int16_t ngprvec[kTableSize] { 0 }; // Conditional probability 0. int16_t sgprvec[kTableSize] { 0 }; // Conditional probability 0. int32_t h0_test, h1_test; int32_t tmp1_s32, tmp2_s32; int32_t sum_log_likelihood_ratios 0; int32_t noise_global_mean, speech_global_mean; int32_t noise_probability[kNumGaussians], speech_probability[kNumGaussians]; int16_t overhead1, overhead2, individualTest, totalTest; // Set various thresholds based on frame lengths (80, 160 or 240 samples). if (frame_length 80) { overhead1 self-over_hang_max_1[0]; overhead2 self-over_hang_max_2[0]; individualTest self-individual[0]; totalTest self-total[0]; } else if (frame_length 160) { overhead1 self-over_hang_max_1[1]; overhead2 self-over_hang_max_2[1]; individualTest self-individual[1]; totalTest self-total[1]; } else { overhead1 self-over_hang_max_1[2]; overhead2 self-over_hang_max_2[2]; individualTest self-individual[2]; totalTest self-total[2]; } if (total_power kMinEnergy) { // The signal power of current frame is large enough for processing. The // processing consists of two parts: // 1) Calculating the likelihood of speech and thereby a VAD decision. // 2) Updating the underlying model, w.r.t., the decision made. // The detection scheme is an LRT with hypothesis // H0: Noise // H1: Speech // // We combine a global LRT with local tests, for each frequency sub-band, // here defined as |channel|. for (channel 0; channel kNumChannels; channel) { // For each channel we model the probability with a GMM consisting of // |kNumGaussians|, with different means and standard deviations depending // on H0 or H1. h0_test 0; h1_test 0; for (k 0; k kNumGaussians; k) { gaussian channel k * kNumChannels; // Probability under H0, that is, probability of frame being noise. // Value given in Q27 Q7 * Q20. tmp1_s32 WebRtcVad_GaussianProbability(features[channel], self-noise_means[gaussian], self-noise_stds[gaussian], deltaN[gaussian]); noise_probability[k] kNoiseDataWeights[gaussian] * tmp1_s32; h0_test noise_probability[k]; // Q27 // Probability under H1, that is, probability of frame being speech. // Value given in Q27 Q7 * Q20. tmp1_s32 WebRtcVad_GaussianProbability(features[channel], self-speech_means[gaussian], self-speech_stds[gaussian], deltaS[gaussian]); speech_probability[k] kSpeechDataWeights[gaussian] * tmp1_s32; h1_test speech_probability[k]; // Q27 } // Calculate the log likelihood ratio: log2(Pr{X|H1} / Pr{X|H1}). // Approximation: // log2(Pr{X|H1} / Pr{X|H1}) log2(Pr{X|H1}*2^Q) - log2(Pr{X|H1}*2^Q) // log2(h1_test) - log2(h0_test) // log2(2^(31-shifts_h1)*(1b1)) // - log2(2^(31-shifts_h0)*(1b0)) // shifts_h0 - shifts_h1 // log2(1b1) - log2(1b0) // ~ shifts_h0 - shifts_h1 // // Note that b0 and b1 are values less than 1, hence, 0 log2(1b0) 1. // Further, b0 and b1 are independent and on the average the two terms // cancel. shifts_h0 WebRtcSpl_NormW32(h0_test); shifts_h1 WebRtcSpl_NormW32(h1_test); if (h0_test 0) { shifts_h0 31; } if (h1_test 0) { shifts_h1 31; } log_likelihood_ratio shifts_h0 - shifts_h1; // Update |sum_log_likelihood_ratios| with spectrum weighting. This is // used for the global VAD decision. sum_log_likelihood_ratios (int32_t) (log_likelihood_ratio * kSpectrumWeight[channel]); // Local VAD decision. if ((log_likelihood_ratio * 4) individualTest) { vadflag 1; } // TODO(bjornv): The conditional probabilities below are applied on the // hard coded number of Gaussians set to two. Find a way to generalize. // Calculate local noise probabilities used later when updating the GMM. h0 (int16_t) (h0_test 12); // Q15 if (h0 0) { // High probability of noise. Assign conditional probabilities for each // Gaussian in the GMM. tmp1_s32 (noise_probability[0] 0xFFFFF000) 2; // Q29 ngprvec[channel] (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h0); // Q14 ngprvec[channel kNumChannels] 16384 - ngprvec[channel]; } else { // Low noise probability. Assign conditional probability 1 to the first // Gaussian and 0 to the rest (which is already set at initialization). ngprvec[channel] 16384; } // Calculate local speech probabilities used later when updating the GMM. h1 (int16_t) (h1_test 12); // Q15 if (h1 0) { // High probability of speech. Assign conditional probabilities for each // Gaussian in the GMM. Otherwise use the initialized values, i.e., 0. tmp1_s32 (speech_probability[0] 0xFFFFF000) 2; // Q29 sgprvec[channel] (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, h1); // Q14 sgprvec[channel kNumChannels] 16384 - sgprvec[channel]; } } // Make a global VAD decision. vadflag | (sum_log_likelihood_ratios totalTest); // Update the model parameters. maxspe 12800; for (channel 0; channel kNumChannels; channel) { // Get minimum value in past which is used for long term correction in Q4. feature_minimum WebRtcVad_FindMinimum(self, features[channel], channel); // Compute the global mean, that is the sum of the two means weighted. noise_global_mean WeightedAverage(self-noise_means[channel], 0, kNoiseDataWeights[channel]); tmp1_s16 (int16_t) (noise_global_mean 6); // Q8 for (k 0; k kNumGaussians; k) { gaussian channel k * kNumChannels; nmk self-noise_means[gaussian]; smk self-speech_means[gaussian]; nsk self-noise_stds[gaussian]; ssk self-speech_stds[gaussian]; // Update noise mean vector if the frame consists of noise only. nmk2 nmk; if (!vadflag) { // deltaN (x-mu)/sigma^2 // ngprvec[k] |noise_probability[k]| / // (|noise_probability[0]| |noise_probability[1]|) // (Q14 * Q11 11) Q14. delt (int16_t)((ngprvec[gaussian] * deltaN[gaussian]) 11); // Q7 (Q14 * Q15 22) Q7. nmk2 nmk (int16_t)((delt * kNoiseUpdateConst) 22); } // Long term correction of the noise mean. // Q8 - Q8 Q8. ndelt (feature_minimum 4) - tmp1_s16; // Q7 (Q8 * Q8) 9 Q7. nmk3 nmk2 (int16_t)((ndelt * kBackEta) 9); // Control that the noise mean does not drift to much. tmp_s16 (int16_t) ((k 5) 7); if (nmk3 tmp_s16) { nmk3 tmp_s16; } tmp_s16 (int16_t) ((72 k - channel) 7); if (nmk3 tmp_s16) { nmk3 tmp_s16; } self-noise_means[gaussian] nmk3; if (vadflag) { // Update speech mean vector: // |deltaS| (x-mu)/sigma^2 // sgprvec[k] |speech_probability[k]| / // (|speech_probability[0]| |speech_probability[1]|) // (Q14 * Q11) 11 Q14. delt (int16_t)((sgprvec[gaussian] * deltaS[gaussian]) 11); // Q14 * Q15 21 Q8. tmp_s16 (int16_t)((delt * kSpeechUpdateConst) 21); // Q7 (Q8 1) Q7. With rounding. smk2 smk ((tmp_s16 1) 1); // Control that the speech mean does not drift to much. maxmu maxspe 640; if (smk2 kMinimumMean[k]) { smk2 kMinimumMean[k]; } if (smk2 maxmu) { smk2 maxmu; } self-speech_means[gaussian] smk2; // Q7. // (Q7 3) Q4. With rounding. tmp_s16 ((smk 4) 3); tmp_s16 features[channel] - tmp_s16; // Q4 // (Q11 * Q4 3) Q12. tmp1_s32 (deltaS[gaussian] * tmp_s16) 3; tmp2_s32 tmp1_s32 - 4096; tmp_s16 sgprvec[gaussian] 2; // (Q14 2) * Q12 Q24. tmp1_s32 tmp_s16 * tmp2_s32; tmp2_s32 tmp1_s32 4; // Q20 // 0.1 * Q20 / Q7 Q13. if (tmp2_s32 0) { tmp_s16 (int16_t) WebRtcSpl_DivW32W16(tmp2_s32, ssk * 10); } else { tmp_s16 (int16_t) WebRtcSpl_DivW32W16(-tmp2_s32, ssk * 10); tmp_s16 -tmp_s16; } // Divide by 4 giving an update factor of 0.025 ( 0.1 / 4). // Note that division by 4 equals shift by 2, hence, // (Q13 8) (Q13 6) / 4 Q7. tmp_s16 128; // Rounding. ssk (tmp_s16 8); if (ssk kMinStd) { ssk kMinStd; } self-speech_stds[gaussian] ssk; } else { // Update GMM variance vectors. // deltaN * (features[channel] - nmk) - 1 // Q4 - (Q7 3) Q4. tmp_s16 features[channel] - (nmk 3); // (Q11 * Q4 3) Q12. tmp1_s32 (deltaN[gaussian] * tmp_s16) 3; tmp1_s32 - 4096; // (Q14 2) * Q12 Q24. tmp_s16 (ngprvec[gaussian] 2) 2; tmp2_s32 OverflowingMulS16ByS32ToS32(tmp_s16, tmp1_s32); // Q20 * approx 0.001 (2^-100.0009766), hence, // (Q24 14) (Q24 4) / 2^10 Q20. tmp1_s32 tmp2_s32 14; // Q20 / Q7 Q13. if (tmp1_s32 0) { tmp_s16 (int16_t) WebRtcSpl_DivW32W16(tmp1_s32, nsk); } else { tmp_s16 (int16_t) WebRtcSpl_DivW32W16(-tmp1_s32, nsk); tmp_s16 -tmp_s16; } tmp_s16 32; // Rounding nsk tmp_s16 6; // Q13 6 Q7. if (nsk kMinStd) { nsk kMinStd; } self-noise_stds[gaussian] nsk; } } // Separate models if they are too close. // |noise_global_mean| in Q14 ( Q7 * Q7). noise_global_mean WeightedAverage(self-noise_means[channel], 0, kNoiseDataWeights[channel]); // |speech_global_mean| in Q14 ( Q7 * Q7). speech_global_mean WeightedAverage(self-speech_means[channel], 0, kSpeechDataWeights[channel]); // |diff| global speech mean - global noise mean. // (Q14 9) - (Q14 9) Q5. diff (int16_t) (speech_global_mean 9) - (int16_t) (noise_global_mean 9); if (diff kMinimumDifference[channel]) { tmp_s16 kMinimumDifference[channel] - diff; // |tmp1_s16| ~0.8 * (kMinimumDifference - diff) in Q7. // |tmp2_s16| ~0.2 * (kMinimumDifference - diff) in Q7. tmp1_s16 (int16_t)((13 * tmp_s16) 2); tmp2_s16 (int16_t)((3 * tmp_s16) 2); // Move Gaussian means for speech model by |tmp1_s16| and update // |speech_global_mean|. Note that |self-speech_means[channel]| is // changed after the call. speech_global_mean WeightedAverage(self-speech_means[channel], tmp1_s16, kSpeechDataWeights[channel]); // Move Gaussian means for noise model by -|tmp2_s16| and update // |noise_global_mean|. Note that |self-noise_means[channel]| is // changed after the call. noise_global_mean WeightedAverage(self-noise_means[channel], -tmp2_s16, kNoiseDataWeights[channel]); } // Control that the speech noise means do not drift to much. maxspe kMaximumSpeech[channel]; tmp2_s16 (int16_t) (speech_global_mean 7); if (tmp2_s16 maxspe) { // Upper limit of speech model. tmp2_s16 - maxspe; for (k 0; k kNumGaussians; k) { self-speech_means[channel k * kNumChannels] - tmp2_s16; } } tmp2_s16 (int16_t) (noise_global_mean 7); if (tmp2_s16 kMaximumNoise[channel]) { tmp2_s16 - kMaximumNoise[channel]; for (k 0; k kNumGaussians; k) { self-noise_means[channel k * kNumChannels] - tmp2_s16; } } } self-frame_counter; } // Smooth with respect to transition hysteresis. if (!vadflag) { if (self-over_hang 0) { vadflag 2 self-over_hang; self-over_hang--; } self-num_of_speech 0; } else { self-num_of_speech; if (self-num_of_speech kMaxSpeechFrames) { self-num_of_speech kMaxSpeechFrames; self-over_hang overhead2; } else { self-over_hang overhead1; } } return vadflag; }