Transcription

Attendee offers two methods for real-time meeting transcription: Third-party-based Transcription and Closed Caption-based Transcription. Both methods allow you to receive real-time updates via webhooks. Also, both methods have perfect speaker identification, also known as diarization.

For an example of a simple web application that uses Attendee to transcribe meeting audio in real time, see the real time transcription example repository.

Third-party-based TranscriptionCopied!

This method relies on access to a per-speaker audio stream for each participant. Attendee identifies when a participant starts and stops speaking. When a participant pauses for a few seconds, the audio segment is sent to a third-party transcription provider for processing.

LatencyCopied!

The latency of third-party-based transcription is dependent on two main factors: the time it takes for the third-party provider to transcribe the audio, and the size of the audio segment itself. If a participant speaks for a long time without pausing, the audio segment sent for transcription will be large, increasing processing time. These two factors mean that third-party-based transcription generally has higher latency compared to closed caption-based transcription.

QualityCopied!

Third-party-based transcription is generally of higher quality than closed caption-based transcription.

Supported providersCopied!

  • Deepgram
  • OpenAI
  • Gladia
  • Assembly AI

See the API reference for supported parameters for configuring the transcription providers.

CostCopied!

Third-party-based transcription incurs costs from the transcription provider. Attendee uses an API key to call the transcription provider that you provide in the Credentials section of the dashboard.

Closed Caption-based TranscriptionCopied!

This method takes advantage of the built-in closed captioning feature of the meeting platfom. Attendee captures these captions as they are generated by the platform.

LatencyCopied!

This method offers lower latency. Captions are captured as soon as they are generated by the platform.

CostCopied!

Closed caption-based transcription is free.

Choosing the Right MethodCopied!

Feature Third-party-based Transcription Closed Caption-based Transcription
Source Per-participant audio segments Built-in captions from the meeting platform (Zoom, Google Meet)
Transcription Quality High (depends on the provider, e.g., OpenAI, Deepgram) Generally lower than third-party-based transcription
Word-level timestamps Supported by all providers except OpenAI No.
Speaker Diarization Yes, perfect speaker identification. Yes, perfect speaker identification.
Latency Higher latency due to provider processing and segment size. Lower latency, near-instantaneous.
Cost Incurs costs from third-party transcription providers. No additional costs.
Setup Requires configuring a third-party transcription provider. No setup required.

Adding transcription providers in the dashboardCopied!

For third-party-based transcription, you need to add your API Key for a provider like Deepgram, OpenAI, Gladia, or Assembly AI in the Settings > Credentials page.

Transcription errorsCopied!

If you are using third-party-based transcription, you may encounter errors from the transcription provider. These errors are visible in the bot detail page in the dashboard, in the transcription section.

Additionally, the post-processing complete bot event will contain a list of transcription errors in the event metadata.

Configuring transcription in the API callCopied!

You can configure transcription settings when creating a bot. This includes selecting the transcription provider and provider-specific options like language, model, etc. See the API reference for details. You will set the parameters in the transcription_settings object of the create bot request body. It will have the form

{
    "chosen transcription provider": {
        "provider-specific parameters"
    }
}

For example, if you want to use Deepgram with english and the nova-2 model, you will set the transcription_settings to:

{
    "deepgram": {
        "language": "en-US",
        "model": "nova-2"
    }
}

Setting up webhooks for real time transcriptionCopied!

You can set up webhooks for real time transcription in the dashboard. Go to the Settings > Webhooks page and click the 'Create Webhook' button.

Make sure the transcript.update trigger is enabled for your webhook. This will fire a webhook event every time a new utterance is added to the transcript. See the webhooks page for more details on the webhook payload.

Fetching transcripts during and after the meetingCopied!

You can fetch transcripts during and after the meeting, by calling the /transcript endpoint. See the API reference for details.

Multilingual transcriptionCopied!

All transcription methods can transcribe audio in different languages, but some methods support different languages than others. See the API reference for details on how to specify the language.

All third-party transcription providers support automatic language detection, but closed caption-based transcription does not. Some third-party providers have the ability to transcribe audio where the speaker is switching languages in the middle of a sentence, see the list below for details.

Choosing the right transcription providerCopied!

DeepgramCopied!

Cheap price, good quality, and fast, the only downside is it doesn't support as many languages as some of the other providers.

Can transcribe audio where the speaker is switching languages in the middle of a sentence.

$200 in free credits for new users.

GladiaCopied!

Similar to Deepgram, but more expensive and supports more languages.

Can transcribe audio where the speaker is switching languages in the middle of a sentence.

10 hours of free transcription each month.

Assembly AICopied!

Similar to Deepgram in price and quality but lacks the ability to transcribe audio where the speaker is switching languages in the middle of a sentence. Very accurate word-level timestamps.

$50 in free credits for new users.

OpenAICopied!

Cheaper then the other providers, but less accurate and often chooses the wrong language when the language is not specified in advance. Can transcribe audio where the speaker is switching languages in the middle of a sentence. Lacks word-level timestamps.