Speaker Diarization Online

Identify who is speaking when in multi-person audio. Up to 10 speakers, 96-98% accuracy, free online tool.

or

Loved by over 3 million people

Speaker Diarization Online - Free Speaker Identification

Upload audio. Get a transcript that labels every speaker with timestamps. No signup required for the first three files, no credit card, no watermark on the output.

The tool runs speaker diarization in the browser. Drop an MP3, WAV, M4A, or FLAC file (up to 500MB), and the system returns a timestamped transcript with Speaker 1, Speaker 2, Speaker 3, and so on - up to ten distinct voices per file. A one-hour podcast finishes in about four minutes.

ChatGPT and Claude cannot diarize audio. They can summarize a transcript once it exists, but the step of separating voices in a raw recording needs a dedicated speech model. That is what this page does.

What you get:

  • 96-98% diarization accuracy with 2-5 speakers in clear audio
  • Up to 10 speakers per file, with accuracy declining to roughly 90-93% at the top end
  • Timestamps to the second on every speaker turn
  • MP3, WAV, M4A, FLAC input, up to 500MB
  • TXT, DOC, PDF, and SRT export
  • Free tier of 3 files per month, up to 45 minutes each

The model identifies speakers by voice characteristics - pitch, timbre, speaking rate, and prosody - not by matching faces or names. Each speaker gets a generic label that you can rename after processing.

How the Diarization Works

Three steps:

  1. Upload or paste a URL. Drag a file in, or paste a link from Dropbox, Google Drive, or a podcast host. The tool reads the audio directly.
  2. The model separates voices. It segments the audio, clusters segments with similar voice fingerprints, and assigns a speaker ID to each cluster. Overlapping speech is detected and tagged with both speaker IDs.
  3. Download the transcript. Pick TXT for notes, SRT for subtitles, DOC for editing, or PDF for sharing. Every speaker turn carries a timestamp.

Under the hood, the pipeline combines a speaker embedding model (similar to the pyannote.audio approach used by most diarization research) with a transcription layer comparable to Deepgram Nova-3 and AssemblyAI’s speaker intelligence stack. For mono recordings it relies entirely on voice embeddings. For stereo recordings with speakers panned to separate channels, it uses channel cues to boost accuracy further.

Processing time scales roughly linearly with file length. A 30-minute file takes about 2 minutes, a 60-minute file about 4 minutes, and a 90-minute file about 6-7 minutes.

Speaker Diarization Compared

FeatureScreenAppAudioPodHappy ScribeDescriptSonix
Free tier3 files (45 min each)None10 min trial1 hour free30 min trial
Max speakers10810Unlimited10
Diarization accuracy96-98%94-96%95-97%96-99%95-98%
Overlapping speechYesLimitedYesYesYes
File uploadYesYesYesYesYes
Live diarizationNoYesNoNoNo
Export formatsTXT, DOC, PDF, SRTTXT onlyTXT, PDF, SRTMultipleMultiple
Languages100+40+120+50+100+
Paid pricing$19/mo$29/mo$17/mo$12/mo$22/mo

Quick notes on the alternatives:

  • AudioPod handles real-time speaker separation but starts at $29/month with no free tier. This tool gives 3 free files monthly and supports up to 10 speakers instead of 8.
  • Happy Scribe’s free trial caps at 10 minutes. This tool gives 45 minutes per file, three times per month.
  • Descript is strong for editing workflows and handles unlimited speakers, but the free tier ends after one hour.
  • Sonix costs $22/month and limits the free trial to 30 minutes total.

For a broader comparison across 10 transcription services, see the guide to the best audio transcription tools.

Who Uses Speaker Diarization

Podcasters

Multi-host shows need speaker-separated transcripts for show notes, chapter markers, and SEO. Upload the raw episode, get a transcript split by host and guest, paste it into Substack, Buzzsprout, or the episode description.

Meeting and interview notes

Remote teams use diarization to attribute action items and decisions. When video is off, the transcript still shows who spoke. Interviewers use it to separate questions from answers automatically.

Researchers

Focus group moderators and qualitative researchers need speaker attribution for coding. Consistent speaker IDs across a recording make it possible to tally contributions per participant without manual labeling.

Depositions, client calls, and consultations need speaker-labeled transcripts with timestamps. The export includes timestamps to the second, which is enough for citation in most case files.

FAQ

What is speaker diarization?

Speaker diarization is the process of determining “who spoke when” in an audio recording. The system analyzes voice characteristics - pitch, timbre, speaking rate - and clusters the audio into speaker turns. Output is a transcript with Speaker 1, Speaker 2, and so on, each segment timestamped.

How accurate is it?

On clear audio with 2-5 speakers, accuracy is 96-98%. With 6-10 speakers or moderate background noise it drops to 90-94%. Phone recordings and outdoor audio typically land in the 85-90% range. Accuracy also depends on how distinct the voices are - two speakers with similar voices are harder to separate than two with different pitches.

Does it work for podcasts?

Yes. MP3 and M4A podcast files upload directly. Paste a URL from your podcast host and the tool fetches the audio. Each host and guest gets a separate speaker ID, and you rename them in the transcript.

How many speakers can it identify?

Up to 10 per file. Best results are with 2-5 speakers (96-98% accuracy). With 6-7 speakers, accuracy is 92-95%. With 8-10 speakers, expect 90-93% as voice overlap grows.

Does it do real-time diarization?

No. This is a file-upload tool. Most one-hour recordings process in about four minutes. For live meetings use the meeting recorder, which captures and transcribes in real time.

What audio formats work?

MP3, WAV, M4A, and FLAC, up to 500MB. Mono and stereo both work. Multi-track recordings with one speaker per track should be mixed down to stereo before upload - the model expects all speakers in the same audio stream.

How does overlapping speech get handled?

The model detects overlapping segments and tags them with every active speaker ID. In the transcript, cross-talk sections show both IDs at the same timestamp. This is useful for spotting interruptions and moments where multiple people agreed at once.

Can it identify specific people by name?

No. The system assigns generic IDs (Speaker 1, Speaker 2) from voice characteristics alone. It does not match voices to known identities. After processing, rename the labels in the transcript - change “Speaker 1” to “Alex” and so on.

What languages are supported?

Over 100 languages, including English, Spanish, French, German, Portuguese, Chinese, Japanese, Korean, Hindi, Russian, and Arabic. Language is detected automatically. Accent handling works across major dialects for each language.

Is there a free tier?

Yes. Three files per month, up to 45 minutes each, no credit card. Free users get the full diarization feature set - timestamps, export, up to 10 speakers. The Growth plan at $19/month (billed annually) removes the file cap.

How does this compare to pyannote, NeMo, and Whisper diarization?

pyannote.audio and Nvidia NeMo are open-source diarization toolkits that researchers run locally. They require Python, GPU setup, and tuning. OpenAI’s Whisper transcribes audio but does not diarize on its own - it needs a separate diarization stage. This tool packages a production-grade diarization pipeline behind a browser upload, so you skip the setup entirely.

FAQ

What is speaker diarization?

Speaker diarization is the process of determining "who spoke when" in an audio recording. The system analyzes voice characteristics - pitch, timbre, speaking rate - and clusters the audio into speaker turns. Output is a transcript with Speaker 1, Speaker 2, and so on, each segment timestamped.

How accurate is it?

On clear audio with 2-5 speakers, accuracy is 96-98%. With 6-10 speakers or moderate background noise it drops to 90-94%. Phone recordings and outdoor audio typically land in the 85-90% range. Accuracy also depends on how distinct the voices are - two speakers with similar voices are harder to separate than two with different pitches.

Does it work for podcasts?

Yes. MP3 and M4A podcast files upload directly. Paste a URL from your podcast host and the tool fetches the audio. Each host and guest gets a separate speaker ID, and you rename them in the transcript.

How many speakers can it identify?

Up to 10 per file. Best results are with 2-5 speakers (96-98% accuracy). With 6-7 speakers, accuracy is 92-95%. With 8-10 speakers, expect 90-93% as voice overlap grows.

Does it do real-time diarization?

No. This is a file-upload tool. Most one-hour recordings process in about four minutes. For live meetings use the meeting recorder, which captures and transcribes in real time.

What audio formats work?

MP3, WAV, M4A, and FLAC, up to 500MB. Mono and stereo both work. Multi-track recordings with one speaker per track should be mixed down to stereo before upload - the model expects all speakers in the same audio stream.

How does overlapping speech get handled?

The model detects overlapping segments and tags them with every active speaker ID. In the transcript, cross-talk sections show both IDs at the same timestamp. This is useful for spotting interruptions and moments where multiple people agreed at once.

Can it identify specific people by name?

No. The system assigns generic IDs (Speaker 1, Speaker 2) from voice characteristics alone. It does not match voices to known identities. After processing, rename the labels in the transcript - change "Speaker 1" to "Alex" and so on.

What languages are supported?

Over 100 languages, including English, Spanish, French, German, Portuguese, Chinese, Japanese, Korean, Hindi, Russian, and Arabic. Language is detected automatically. Accent handling works across major dialects for each language.

Is there a free tier?

Yes. Three files per month, up to 45 minutes each, no credit card. Free users get the full diarization feature set - timestamps, export, up to 10 speakers. The Growth plan at $19/month (billed annually) removes the file cap.

How does this compare to pyannote, NeMo, and Whisper diarization?

pyannote.audio and Nvidia NeMo are open-source diarization toolkits that researchers run locally. They require Python, GPU setup, and tuning. OpenAI's Whisper transcribes audio but does not diarize on its own - it needs a separate diarization stage. This tool packages a production-grade diarization pipeline behind a browser upload, so you skip the setup entirely.

Real Results from Real Users

Aaron photo

Aaron

Project Manager

★★★★★

Our overall experience with ScreenApp has been nothing but pleasant! Their support is terrific, and ScreenApp is a great recording system.

JP photo

JP

Operations Manager

★★★★★

Finally, a screen recorder that doesn't slap watermarks on everything. The free plan gives me 45 minutes of AI processing monthly - that's enough for most of my training videos.

Trina photo

Trina

Founder

★★★★★

I was skeptical about another AI notetaker, but ScreenApp's generous free tier completely won me over. The quality is professional-grade, and the AI features actually work as advertised. Now I use it for all my client presentations and team demos.

Kelvin photo

Kelvin

Software Engineer

★★★★★

The desktop and mobile apps are fantastic. Recording meetings while I'm mobile has never been easier, and the dictation feature is a huge time-saver.

Millie photo

Millie

Director

★★★★★

Our team was drowning in client feedback until we found ScreenApp. Now we record every presentation and client call, and the AI summaries are spot-on.

Tanmay photo

Tanmay

Marketing Guru

★★★★★

Makes recording and sharing guides effortless. I love how I can capture my screen and instantly turn it into step-by-step guides in any format I need. Smart, simple, and a brilliant use of AI.

Sav photo

Sav

Project Manager

★★★★★

Users consistently praise our web-based platform that requires no installation. Start recording in seconds, not minutes.

Nate photo

Nate

Video Creator

★★★★★

The ability to automatically transcribe and summarize recordings is a major time-saver, turning video content into searchable, useful data.

User
User
User
Join 2,147,483+ users

Ready to boost your productivity?

Try Speaker Diarization and 300+ other AI-powered features for free.

Start Free →

Start using in 60 seconds • No credit card required