How to Convert Audio to Text with Python for Free

Leandro Hirt

Atualizado em: 29/05/2026

Updated on: May 29, 2026

Reading time: 9 minutes

Turning audio files into text is one of the most useful and in-demand tasks in the age of artificial intelligence and automation. Whether you need to transcribe meetings, generate subtitles for videos, or build voice-driven assistants, knowing how to convert audio to text with Python for free saves hours of manual work. Python stands out in this area because of its rich ecosystem of Python libraries that tap into powerful speech recognition engines without any cost, as long as you know which tools to choose.

In this complete guide, you will explore the SpeechRecognition library, which acts as a unified wrapper for multiple voice recognition engines. While paid high-performance services exist, it is perfectly possible to achieve professional-grade results using open-source solutions and free APIs. This project is also a great stepping stone for anyone building larger Python automation systems that need to process multimedia data at scale.

Why Automate Audio Transcription?

Manual transcription is an extremely slow process. On average, a human transcriber takes about four hours to convert just one hour of audio. By automating this with Python, you can process large volumes of recordings in a fraction of that time. Beyond speed, automation allows you to embed this functionality inside larger systems such as databases, chatbots, or content pipelines.

Python is not limited to graphical interfaces or specific applications. You can build scripts that monitor an entire folder and automatically convert every new audio file that appears. This is the fundamental pattern behind many modern administrative and creative task automation workflows.

Prerequisites and Library Installation

To get started, you need a working Python installation on your computer. If you have not set that up yet, the guide on how to install Python walks through the full process. You will also need two primary libraries: SpeechRecognition for the transcription logic and pydub for converting audio formats before processing.

It is strongly recommended to install these inside a dedicated Python virtual environment to avoid dependency conflicts with other projects. Then run:

Bash

pip install SpeechRecognition
pip install pydub

pip install SpeechRecognition
pip install pydub

The SpeechRecognition library supports multiple engines. The most accessible free option is the Google Web Speech API, which recognizes speech in dozens of languages including English, without requiring a complex API key for small and medium usage volumes.

Step-by-Step: Converting Audio to Text with Python

The conversion process follows three logical steps: load the audio file, process it through a recognizer, and display or save the resulting text. It is important to note that SpeechRecognition works natively with WAV format files. If your audio is in MP3, it must be converted first, which is covered below.

Step 1: Importing the Library and Configuring the Recognizer

Python

import speech_recognition as sr

# Create the recognizer instance
recognizer = sr.Recognizer()

import speech_recognition as sr

# Create the recognizer instance
recognizer = sr.Recognizer()

Step 2: Loading and Reading the Audio File

Use Python’s context manager to open the file safely. The adjust_for_ambient_noise call helps the recognizer calibrate itself for any background noise present at the start of the recording, which improves accuracy significantly:

Python

with sr.AudioFile("my_audio.wav") as source:
    # Calibrate for ambient noise
    recognizer.adjust_for_ambient_noise(source)
    audio_data = recognizer.record(source)

with sr.AudioFile("my_audio.wav") as source:
    # Calibrate for ambient noise
    recognizer.adjust_for_ambient_noise(source)
    audio_data = recognizer.record(source)

Step 3: Sending to the API and Getting the Text

With the audio data captured, you send it to Google’s engine. Always use try and except in Python to handle the two most common failure modes: audio that was not understood (UnknownValueError) and network errors (RequestError):

Python

try:
    text = recognizer.recognize_google(audio_data, language='en-US')
    print("Transcribed text: " + text)
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand the audio.")
except sr.RequestError as e:
    print(f"Could not request results from the service: {e}")

try:
    text = recognizer.recognize_google(audio_data, language='en-US')
    print("Transcribed text: " + text)
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand the audio.")
except sr.RequestError as e:
    print(f"Could not request results from the service: {e}")

Handling Different Audio Formats with pydub

Your recordings will often arrive in compressed formats like MP3 or OGG. SpeechRecognition cannot read these directly, so you use pydub to convert them first. If you encounter codec errors during this step, make sure FFmpeg is installed on your operating system, as pydub depends on it for format conversion:

Python

from pydub import AudioSegment

# Convert MP3 to WAV
audio_mp3 = AudioSegment.from_mp3("interview.mp3")
audio_mp3.export("interview_converted.wav", format="wav")

from pydub import AudioSegment

# Convert MP3 to WAV
audio_mp3 = AudioSegment.from_mp3("interview.mp3")
audio_mp3.export("interview_converted.wav", format="wav")

Once the WAV file is ready, pass its path to the SpeechRecognition workflow shown above. This two-step approach handles virtually any common audio source.

Improving Transcription Accuracy

Accuracy depends directly on audio quality. Background noise, echo, and distant voices all reduce recognition precision. Beyond adjust_for_ambient_noise, a practical strategy for long recordings (like a one-hour podcast) is to split the audio into 30 to 60 second chunks, transcribe each one separately, and concatenate the results. This prevents network timeouts and avoids overwhelming your machine’s RAM. Working with very large audio files in a loop follows the same pattern as reading large files without freezing Python.

Transcribing an Entire Folder Automatically

The real power comes when you process multiple files at once. Using the Python os module, you can iterate over all WAV files in a directory and save each transcript to a corresponding text file:

Python

import os
import speech_recognition as sr

recognizer = sr.Recognizer()
folder = "./audio_files"

for filename in os.listdir(folder):
    if filename.endswith(".wav"):
        filepath = os.path.join(folder, filename)
        with sr.AudioFile(filepath) as source:
            audio = recognizer.record(source)
        try:
            text = recognizer.recognize_google(audio, language='en-US')
            output = filename.replace(".wav", ".txt")
            with open(output, "w", encoding="utf-8") as f:
                f.write(text)
            print(f"Saved: {output}")
        except Exception as e:
            print(f"Error on {filename}: {e}")

import os
import speech_recognition as sr

recognizer = sr.Recognizer()
folder = "./audio_files"

for filename in os.listdir(folder):
    if filename.endswith(".wav"):
        filepath = os.path.join(folder, filename)
        with sr.AudioFile(filepath) as source:
            audio = recognizer.record(source)
        try:
            text = recognizer.recognize_google(audio, language='en-US')
            output = filename.replace(".wav", ".txt")
            with open(output, "w", encoding="utf-8") as f:
                f.write(text)
            print(f"Saved: {output}")
        except Exception as e:
            print(f"Error on {filename}: {e}")

Complete Project Code

Here is the full unified script with proper error handling, file existence checks, and result saving. Replace the filename at the bottom with your own audio file path to run it immediately:

Python

import speech_recognition as sr
import os

def transcribe_audio(audio_path):
    recognizer = sr.Recognizer()

    if not os.path.exists(audio_path):
        print("Audio file not found. Check the path and try again.")
        return

    with sr.AudioFile(audio_path) as source:
        print("Reading audio and reducing background noise...")
        audio = recognizer.record(source)

    try:
        print("Transcribing with Google Speech Recognition...")
        text = recognizer.recognize_google(audio, language='en-US')

        output_file = "transcription_result.txt"
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(text)

        print("Transcription complete!")
        print("-" * 40)
        print(text)
        print("-" * 40)
        print(f"Result saved to: {output_file}")

    except sr.UnknownValueError:
        print("Error: The AI could not understand the audio.")
    except sr.RequestError as e:
        print(f"Error: Could not connect to the transcription service: {e}")

if __name__ == "__main__":
    audio_file = "my_audio.wav"  # Replace with your file path
    transcribe_audio(audio_file)

import speech_recognition as sr
import os

def transcribe_audio(audio_path):
    recognizer = sr.Recognizer()

    if not os.path.exists(audio_path):
        print("Audio file not found. Check the path and try again.")
        return

    with sr.AudioFile(audio_path) as source:
        print("Reading audio and reducing background noise...")
        audio = recognizer.record(source)

    try:
        print("Transcribing with Google Speech Recognition...")
        text = recognizer.recognize_google(audio, language='en-US')

        output_file = "transcription_result.txt"
        with open(output_file, "w", encoding="utf-8") as f:
            f.write(text)

        print("Transcription complete!")
        print("-" * 40)
        print(text)
        print("-" * 40)
        print(f"Result saved to: {output_file}")

    except sr.UnknownValueError:
        print("Error: The AI could not understand the audio.")
    except sr.RequestError as e:
        print(f"Error: Could not connect to the transcription service: {e}")

if __name__ == "__main__":
    audio_file = "my_audio.wav"  # Replace with your file path
    transcribe_audio(audio_file)

Offline Alternative: OpenAI Whisper

Although the Google API is excellent for most use cases, it requires an active internet connection. If you need full privacy or want to process audio without depending on external services, OpenAI Whisper is the best free, open-source alternative. It runs entirely on your local machine using your GPU or CPU, and its accuracy frequently surpasses paid commercial services, especially for technical vocabulary and multiple speakers.

Whisper requires installing PyTorch and the openai-whisper package, making it heavier than SpeechRecognition. But for large-scale projects that demand the highest possible accuracy, it is the current industry gold standard for free transcription.

Building a Desktop Tool with a File Picker

If you want to turn this script into a polished desktop application where anyone can browse for a file and click a button to transcribe it, the guide on creating graphical interfaces with Tkinter in Python covers exactly how to wrap backend scripts like this into a proper window with buttons and file dialogs. This is the difference between a developer tool and something you can share with non-technical colleagues.

Frequently Asked Questions

Can I convert MP3 files directly with SpeechRecognition?

Not natively. You need to use pydub to convert the MP3 to WAV format first, then pass the WAV file to the recognizer as shown in this guide.

Do I need to pay to use the Google Speech API through Python?

The SpeechRecognition library uses a public demonstration endpoint from Google that requires no API key. It is free but has limitations on audio duration and request volume for heavy usage.

How do I improve accuracy for technical terms or specialized vocabulary?

The free Google API offers limited control over vocabulary. For domain-specific terms, Whisper or a custom fine-tuned model will provide significantly better results.

Is there a file size limit for transcription?

Very large files can cause connection timeouts or memory errors. The best practice is to split long recordings into chunks of no more than 60 seconds before sending them to the recognizer.

Can I transcribe audio in real time from a microphone?

Yes. SpeechRecognition provides a sr.Microphone() class that captures and transcribes live audio from your input device in real time with just a few extra lines of code.

What is the difference between SpeechRecognition and OpenAI Whisper?

SpeechRecognition is an interface layer that connects to various online engines including Google’s. Whisper is a deep learning model that runs fully offline on your machine, with much higher accuracy for complex audio.

Why does my script return an UnknownValueError?

This typically occurs when the audio is too quiet, too noisy, or contains only silence during the analyzed segment. Try increasing the recording volume or improving the original audio quality before transcribing.

Do I need a powerful GPU to run Python transcription?

For the web API approach (Google), you only need an internet connection. For running Whisper locally, a GPU significantly speeds up processing, but the library works on CPU as well, just more slowly for longer recordings.

How do I save the transcribed text to a Word or PDF document?

Use python-docx to write the resulting text string into a Word document, or fpdf to generate a PDF. Both libraries accept a plain Python string as input and can be installed via pip in a few seconds.