Transcribe multilingual audio files

Multilingual Speech to Text

Our advanced ASR technology is built on the robust foundation of OpenAI’s Whisper, known for its exceptional performance in multilingual speech recognition. However, we’ve significantly enhanced its capabilities with in-house innovations, including the implementation of phonetic time-stamps for 49 languages.

Benchmarks for top 10 supported languages

Rank	Code	Language	WER (%) on FLEURS
1	es	Spanish	3.0
2	it	Italian	4.0
3	en	English	4.2
4	pt	Portuguese	4.3
5	de	German	4.5
6	ja	Japanese	5.0
7	pl	Polish	5.6
8	ru	Russian	5.6
9	nl	Dutch	6.1
10	id	Indonesian	6.4

Transcirption is by default applied for the task protect. To have the transcription without any reduction use the task parameter and set it to transcribe.

Multilingual audio transcription without PHI reduction

BASE_URL = "https://voiceharbor.ai"
usage_token = "USAGE_TOKEN"
# Create a new job on the server via the class method.
job_id = VoiceHarborClient.create_job(BASE_URL, usage_token)

client = VoiceHarborClient(
    base_url=BASE_URL,
    job_id=job_id,
    token=usage_token,
    inputs_dir="./inputs/tests"
)

# Submit input files and the job file.and 
job_params = {"files": [], "task": "transcribe"}
job_params = client.submit_files(job_params)
job_file = client.submit_job(job_params)
logger.info(f"Job file created: {job_file}")

Set target transcription language


supported_codes = [
    "af", "ar", "az", "be", "bg", "bn", "bs", "br", "ca", "cs", "cy", "da", "de",
    "el", "en", "es", "et", "eu", "fa", "fi", "fr", "gl", "he", "hi", "hr", "hu",
    "hy", "id", "is", "it", "ja", "ka", "km", "kn", "ko", "kk", "la", "lt", "lv",
    "mi", "ml", "mn", "mr", "ms", "ne", "nl", "no", "oc", "pa", "pl", "pt", "ro",
    "ru", "si", "sk", "sl", "sq", "sr", "sn", "so", "sw", "ta", "te", "th", "tg",
    "tr", "uk", "ur", "vi", "yo", "zh"
]

BASE_URL = "https://voiceharbor.ai"
usage_token = "USAGE_TOKEN"
# Create a new job on the server via the class method.
job_id = VoiceHarborClient.create_job(BASE_URL, usage_token)

client = VoiceHarborClient(
    base_url=BASE_URL,
    job_id=job_id,
    token=usage_token,
    inputs_dir="./inputs/tests"
)

# Submit input files and the job file.and 
job_params = {"files": [],
              "task": "transcribe",
              "language":"en"}  
job_params = client.submit_files(job_params)
job_file = client.submit_job(job_params)
logger.info(f"Job file created: {job_file}")

Show Output example

{
  "speaker 1": {
    "transcription": [
      {
        "start": 0.05,
        "end": 5.27,
        "text": "The sandwich comes with ham, cheese, tomatoes, mayonnaise, pickles.",
        "words": [
          {
            "word": " The",
            "start": 0.05,
            "end": 0.55
          },
          {
            "word": " sandwich",
            "start": 0.55,
            "end": 0.93
          },
          {
            "word": " comes",
            "start": 0.93,
            "end": 1.33
          },
          {
            "word": " with",
            "start": 1.33,
            "end": 1.57
          },
          {
            "word": " ham",
            "start": 1.57,
            "end": 1.95
          },
          {
            "word": ",",
            "start": 1.95,
            "end": 2.13
          },
          {
            "word": " cheese",
            "start": 2.13,
            "end": 2.57
          },
          {
            "word": ",",
            "start": 2.57,
            "end": 2.73
          },
          {
            "word": " tomatoes",
            "start": 2.73,
            "end": 3.25
          },
          {
            "word": ",",
            "start": 3.25,
            "end": 3.55
          },
          {
            "word": " mayonnaise",
            "start": 3.55,
            "end": 3.93
          },
          {
            "word": ",",
            "start": 3.93,
            "end": 4.21
          },
          {
            "word": " pickles",
            "start": 4.21,
            "end": 4.51
          },
          {
            "word": ".",
            "start": 4.51,
            "end": 5.27
          }
        ]
      }
    ],
    "language": "en"
  }
}

Code-switching support

Our ASR component support’s for up to 4 languages and robust code-switching capabilities, it effortlessly transcribes audio that blends multiple languages. Its built-in automatic language detection ensures that users do not have to manually specify the language, streamlining the workflow, while precise time-stamps allow for easy navigation and review of audio content.

Benchmarks for top 4 supported languages for VoxPopuli

Language	Whisper-Large-V3 (WER)	Our’s (WER)	Δ (WER↓)
English	8.93	7.60	1.33
French	11.15	11.07	0.08
Spanish	11.06	8.52	2.54
German	17.75	12.41	5.34

Evaluation benchmark for the transcription of code-switched speech data.

Language	Whisper-Large-V2 (WER)	Our’s (WER)	Δ (WER↓)
English-Spanish	37.13	13.45	23.68

Transcirption is by default applied for the task protect. To have the transcription without any reduction use the task parameter and set it to transcribe.

Transcribe multilingual audio data in 4 supported languages.

BASE_URL = "https://voiceharbor.ai"
usage_token = "USAGE_TOKEN"
# Create a new job on the server via the class method.
job_id = VoiceHarborClient.create_job(BASE_URL, usage_token)

client = VoiceHarborClient(
    base_url=BASE_URL,
    job_id=job_id,
    token=usage_token,
    inputs_dir="./inputs/tests"
)

# Submit input files and the job file.and 
job_params = {"files": [],
              "task": "transcribe",
              "code-switch":True}
job_params = client.submit_files(job_params)
job_file = client.submit_job(job_params)
logger.info(f"Job file created: {job_file}")

Show Output example

{
  "speaker 1": {
    "transcription": [
      {
        "start": 0.02,
        "end": 4.54,
        "text": "Nunca hay, pues, violencia gratuita en sus tilmes.",
        "words": [
          {
            "word": " Nunca",
            "start": 0.02,
            "end": 1.04
          },
          {
            "word": " hay",
            "start": 1.04,
            "end": 1.36
          },
          {
            "word": ",",
            "start": 1.36,
            "end": 1.38
          },
          {
            "word": " pues",
            "start": 1.38,
            "end": 1.72
          },
          {
            "word": ",",
            "start": 1.72,
            "end": 2.44
          },
          {
            "word": " violencia",
            "start": 2.44,
            "end": 2.86
          },
          {
            "word": " gratuita",
            "start": 2.86,
            "end": 3.54
          },
          {
            "word": " en",
            "start": 3.54,
            "end": 3.64
          },
          {
            "word": " sus",
            "start": 3.64,
            "end": 3.94
          },
          {
            "word": " tilmes",
            "start": 3.94,
            "end": 4.5
          },
          {
            "word": ".",
            "start": 4.5,
            "end": 4.54
          }
        ]
      }
    ],
    "language": "es"
  },
  "speaker 2": {
    "transcription": [
      {
        "start": 4.14,
        "end": 9.16,
        "text": "He is considered a strategist in the process of transition towards Spanish democracy.",
        "words": [
          {
            "word": " He",
            "start": 4.14,
            "end": 4.64
          },
          {
            "word": " is",
            "start": 4.64,
            "end": 4.8
          },
          {
            "word": " considered",
            "start": 4.8,
            "end": 5.16
          },
          {
            "word": " a",
            "start": 5.16,
            "end": 5.4
          },
          {
            "word": " strategist",
            "start": 5.4,
            "end": 6.0
          },
          {
            "word": " in",
            "start": 6.0,
            "end": 6.14
          },
          {
            "word": " the",
            "start": 6.14,
            "end": 6.24
          },
          {
            "word": " process",
            "start": 6.24,
            "end": 6.62
          },
          {
            "word": " of",
            "start": 6.62,
            "end": 6.78
          },
          {
            "word": " transition",
            "start": 6.78,
            "end": 7.24
          },
          {
            "word": " towards",
            "start": 7.24,
            "end": 7.66
          },
          {
            "word": " Spanish",
            "start": 7.66,
            "end": 8.06
          },
          {
            "word": " democracy",
            "start": 8.06,
            "end": 8.7
          },
          {
            "word": ".",
            "start": 8.7,
            "end": 9.16
          }
        ]
      }
    ],
    "language": "en"
  }
}

Python SDK

Start using our Python SDK to connect with Voice Harbor and implement voice solutions

2025-07-01 Upcoming Release

NextGen Medical ASR

Changelog

Looking ahead, we’re also pushing the envelope by refining our model with an extensive trove of medical data to eliminate hallucinations and boost reliability in even the most demanding environments. The Q2 realease will not only improve the recognition of complex medical terminology but also significantly mitigate transcirption errors, ensuring that results are both accurate and reliable. Whether you need rapid transcription for global communications or precise documentation in critical healthcare settings, our ASR component is designed to deliver excellence.

PHI and Biometric reduction

Build with Voice Harbor

Speech to Text

Speaker Diarization

Playground

Multilingual Speech to Text

Benchmarks for top 10 supported languages

Multilingual audio transcription without PHI reduction

Set target transcription language

Code-switching support

Benchmarks for top 4 supported languages for VoxPopuli

Evaluation benchmark for the transcription of code-switched speech data.

Transcribe multilingual audio data in 4 supported languages.

Python SDK

Changelog

PHI and Biometric reduction

Build with Voice Harbor

Speech to Text

Speaker Diarization

Playground

​Multilingual Speech to Text

​Benchmarks for top 10 supported languages

​Multilingual audio transcription without PHI reduction

​Set target transcription language

​Code-switching support

​Benchmarks for top 4 supported languages for VoxPopuli

​Evaluation benchmark for the transcription of code-switched speech data.

​Transcribe multilingual audio data in 4 supported languages.

Python SDK

​Good news to share with you!

​Changelog

Multilingual Speech to Text

Benchmarks for top 10 supported languages

Multilingual audio transcription without PHI reduction

Set target transcription language

Code-switching support

Benchmarks for top 4 supported languages for VoxPopuli

Evaluation benchmark for the transcription of code-switched speech data.

Transcribe multilingual audio data in 4 supported languages.

Good news to share with you!

Changelog