Speaker Diarization Online - Free Speaker Identification
Upload audio. Get a transcript that labels every speaker with timestamps. No signup required for the first three files, no credit card, no watermark on the output.
The tool runs speaker diarization in the browser. Drop an MP3, WAV, M4A, or FLAC file (up to 500MB), and the system returns a timestamped transcript with Speaker 1, Speaker 2, Speaker 3, and so on - up to ten distinct voices per file. A one-hour podcast finishes in about four minutes.
ChatGPT and Claude cannot diarize audio. They can summarize a transcript once it exists, but the step of separating voices in a raw recording needs a dedicated speech model. That is what this page does.
What you get:
- 96-98% diarization accuracy with 2-5 speakers in clear audio
- Up to 10 speakers per file, with accuracy declining to roughly 90-93% at the top end
- Timestamps to the second on every speaker turn
- MP3, WAV, M4A, FLAC input, up to 500MB
- TXT, DOC, PDF, and SRT export
- Free tier of 3 files per month, up to 45 minutes each
The model identifies speakers by voice characteristics - pitch, timbre, speaking rate, and prosody - not by matching faces or names. Each speaker gets a generic label that you can rename after processing.
How the Diarization Works
Three steps:
- Upload or paste a URL. Drag a file in, or paste a link from Dropbox, Google Drive, or a podcast host. The tool reads the audio directly.
- The model separates voices. It segments the audio, clusters segments with similar voice fingerprints, and assigns a speaker ID to each cluster. Overlapping speech is detected and tagged with both speaker IDs.
- Download the transcript. Pick TXT for notes, SRT for subtitles, DOC for editing, or PDF for sharing. Every speaker turn carries a timestamp.
Under the hood, the pipeline combines a speaker embedding model (similar to the pyannote.audio approach used by most diarization research) with a transcription layer comparable to Deepgram Nova-3 and AssemblyAI’s speaker intelligence stack. For mono recordings it relies entirely on voice embeddings. For stereo recordings with speakers panned to separate channels, it uses channel cues to boost accuracy further.
Processing time scales roughly linearly with file length. A 30-minute file takes about 2 minutes, a 60-minute file about 4 minutes, and a 90-minute file about 6-7 minutes.
Speaker Diarization Compared
| Feature | ScreenApp | AudioPod | Happy Scribe | Descript | Sonix |
|---|---|---|---|---|---|
| Free tier | 3 files (45 min each) | None | 10 min trial | 1 hour free | 30 min trial |
| Max speakers | 10 | 8 | 10 | Unlimited | 10 |
| Diarization accuracy | 96-98% | 94-96% | 95-97% | 96-99% | 95-98% |
| Overlapping speech | Yes | Limited | Yes | Yes | Yes |
| File upload | Yes | Yes | Yes | Yes | Yes |
| Live diarization | No | Yes | No | No | No |
| Export formats | TXT, DOC, PDF, SRT | TXT only | TXT, PDF, SRT | Multiple | Multiple |
| Languages | 100+ | 40+ | 120+ | 50+ | 100+ |
| Paid pricing | $19/mo | $29/mo | $17/mo | $12/mo | $22/mo |
Quick notes on the alternatives:
- AudioPod handles real-time speaker separation but starts at $29/month with no free tier. This tool gives 3 free files monthly and supports up to 10 speakers instead of 8.
- Happy Scribe’s free trial caps at 10 minutes. This tool gives 45 minutes per file, three times per month.
- Descript is strong for editing workflows and handles unlimited speakers, but the free tier ends after one hour.
- Sonix costs $22/month and limits the free trial to 30 minutes total.
For a broader comparison across 10 transcription services, see the guide to the best audio transcription tools.
Who Uses Speaker Diarization
Podcasters
Multi-host shows need speaker-separated transcripts for show notes, chapter markers, and SEO. Upload the raw episode, get a transcript split by host and guest, paste it into Substack, Buzzsprout, or the episode description.
Meeting and interview notes
Remote teams use diarization to attribute action items and decisions. When video is off, the transcript still shows who spoke. Interviewers use it to separate questions from answers automatically.
Researchers
Focus group moderators and qualitative researchers need speaker attribution for coding. Consistent speaker IDs across a recording make it possible to tally contributions per participant without manual labeling.
Legal and healthcare
Depositions, client calls, and consultations need speaker-labeled transcripts with timestamps. The export includes timestamps to the second, which is enough for citation in most case files.
FAQ
What is speaker diarization?
Speaker diarization is the process of determining “who spoke when” in an audio recording. The system analyzes voice characteristics - pitch, timbre, speaking rate - and clusters the audio into speaker turns. Output is a transcript with Speaker 1, Speaker 2, and so on, each segment timestamped.
How accurate is it?
On clear audio with 2-5 speakers, accuracy is 96-98%. With 6-10 speakers or moderate background noise it drops to 90-94%. Phone recordings and outdoor audio typically land in the 85-90% range. Accuracy also depends on how distinct the voices are - two speakers with similar voices are harder to separate than two with different pitches.
Does it work for podcasts?
Yes. MP3 and M4A podcast files upload directly. Paste a URL from your podcast host and the tool fetches the audio. Each host and guest gets a separate speaker ID, and you rename them in the transcript.
How many speakers can it identify?
Up to 10 per file. Best results are with 2-5 speakers (96-98% accuracy). With 6-7 speakers, accuracy is 92-95%. With 8-10 speakers, expect 90-93% as voice overlap grows.
Does it do real-time diarization?
No. This is a file-upload tool. Most one-hour recordings process in about four minutes. For live meetings use the meeting recorder, which captures and transcribes in real time.
What audio formats work?
MP3, WAV, M4A, and FLAC, up to 500MB. Mono and stereo both work. Multi-track recordings with one speaker per track should be mixed down to stereo before upload - the model expects all speakers in the same audio stream.
How does overlapping speech get handled?
The model detects overlapping segments and tags them with every active speaker ID. In the transcript, cross-talk sections show both IDs at the same timestamp. This is useful for spotting interruptions and moments where multiple people agreed at once.
Can it identify specific people by name?
No. The system assigns generic IDs (Speaker 1, Speaker 2) from voice characteristics alone. It does not match voices to known identities. After processing, rename the labels in the transcript - change “Speaker 1” to “Alex” and so on.
What languages are supported?
Over 100 languages, including English, Spanish, French, German, Portuguese, Chinese, Japanese, Korean, Hindi, Russian, and Arabic. Language is detected automatically. Accent handling works across major dialects for each language.
Is there a free tier?
Yes. Three files per month, up to 45 minutes each, no credit card. Free users get the full diarization feature set - timestamps, export, up to 10 speakers. The Growth plan at $19/month (billed annually) removes the file cap.
How does this compare to pyannote, NeMo, and Whisper diarization?
pyannote.audio and Nvidia NeMo are open-source diarization toolkits that researchers run locally. They require Python, GPU setup, and tuning. OpenAI’s Whisper transcribes audio but does not diarize on its own - it needs a separate diarization stage. This tool packages a production-grade diarization pipeline behind a browser upload, so you skip the setup entirely.