Fake Voices, Real Podcast Problems

Brian Stever

Deepfake Audio Detection · 2026 · PyTorch, torchaudio, ASVspoof, spectrogram CNNs

Abstract. Deepfake Audio Detection started as coursework and immediately wandered into familiar territory: podcast production, where a voice is not just audio but evidence. The project builds a synthetic speech detector around log-Mel spectrograms, a compact CNN with Squeeze-and-Excitation attention, and ASVspoof evaluation data. The final model is fast and genuinely useful as a research prototype. It is also not ready to guard the studio door by itself, which is a much better finding than pretending otherwise.

1.Why Podcast Voices

Podcasting does a strange thing to voices. A host's voice becomes a brand, a relationship, a production cue, and occasionally a shaky little authentication system wearing headphones. If a clip sounds like the right person, everyone is tempted to relax.

Deepfake audio makes that instinct risky. The imagined attacker does not need a movie-villain setup. Public episodes, guest spots, livestreams, or social clips can become source material, and the fake output can arrive looking like one more file in the publishing flow. The detector here is not the whole security plan. It is the extra person in the room asking, politely, whether the waveform seems too confident.

2.Pipeline

The system converts every clip to mono 16 kHz audio, pads or trims it to four seconds, and transforms it into a 128-bin log-Mel spectrogram. The main model treats that spectrogram like a single-channel image and uses a four-block CNN with Squeeze-and-Excitation channel attention.

Training used ASVspoof 2019 Logical Access train/dev data, then evaluation moved to the full ASVspoof 2021 LA set. That jump matters. The friendly split can make a model look like it has been lifting weights. The 2021 benchmark asks it to carry furniture up stairs.

Sample audio spectrogram used by the deepfake audio detector
Figure 1. A sample spectrogram view from the preprocessing pipeline. The detector is really looking for suspicious time-frequency artifacts, not listening like a person.

3.Results

The main CNN + SE model reached 18.28% EER and 0.8979 AUC-ROC on the full ASVspoof 2021 LA evaluation set. At a fixed threshold it had strong F1, but EER is the more security-relevant metric, and it keeps the story honest: working detector, yes; deployable countermeasure, not yet.

A podcast-style stress test added noise, room response, and MP3-like compression. The model got worse, because audio production is where clean signals go to develop personality, but it did not fall apart. EER moved from 18.28% to 18.48%, while AUC dropped from 0.8979 to 0.8852.

Table 1. Selected evaluation results.

ExperimentEERAUCF1
CNN + SE, clean 2021 LA18.28%0.89790.9479
CNN + SE, podcast stress18.48%0.88520.9430
ResNet-18 baseline17.21%0.89570.9145
LFCC-GMM baseline30.89%0.74000.8128
Bar chart comparing EER results for deepfake audio detection models
Figure 2. Equal Error Rate comparison across the main CNN + SE model, podcast-stress evaluation, ResNet-18, and LFCC-GMM baseline.

4.Baselines and Tradeoffs

ResNet-18 had the best EER at 17.21%, so the smaller model does not get a parade. The more interesting result is practical: the CNN + SE model stayed competitive while using far fewer parameters and running faster in local MacBook timing.

The LFCC-GMM baseline was much weaker at 30.89% EER. That gave the project a useful floor: a traditional cepstral-feature approach was not enough for this version of the task, while the learned spectrogram models handled the benchmark much better.

ROC curve for the CNN plus Squeeze-and-Excitation deepfake audio model
Figure 3. ROC curve for the CNN + SE model on ASVspoof 2021 LA. The curve is good enough to be interesting and not good enough to pretend the problem is solved.

5.What It Proved

The project worked in the engineering sense: data loading, preprocessing, augmentation, training, checkpointing, baseline comparison, evaluation, plotting, and metric export all ran end to end. It also failed in the useful research sense: the 2021 benchmark exposed a generalization gap that the 2019 development split politely kept hidden.

That is probably the right lesson. A detector can look heroic in the lab and still get weird when codecs, channels, and synthesis systems shift. The next version would need threshold calibration, stronger anti-spoofing architectures, and tests using real podcast export chains from the tools production teams actually touch.