Read Giant Files in Python Without Freezing

Leandro Hirt

Atualizado em: 19/05/2026

Published on: May 19, 2026

Reading time: 8 minutes

Dealing with large volumes of data is one of the most common challenges in programming and data science. When you try to open a text or CSV file that is several gigabytes in size, the computer slows to a crawl, the fan spins up, and Python terminates with a fatal error. This happens because most reading functions try to load the entire file content into RAM at once. If the file is larger than available memory, the operating system kills the process to prevent a complete system collapse. Knowing how to read giant files in Python without freezing is an essential skill for building robust and scalable systems.

The good news is that Python was designed to handle data streams intelligently. Instead of swallowing the entire file at once, you can digest it in small portions. This technique is called chunked processing and allows you to work with files of 10GB, 50GB, or even 100GB on a computer with limited memory. This guide covers everything from native language functions to powerful libraries like Pandas, ensuring your Python automation scripts work reliably at any scale.

The RAM Problem and the Dreaded MemoryError

When you use a simple read command, Python asks the operating system for a block of memory to store the data. If you have 8GB of RAM and try to open a 10GB file, the math does not work. The result is a MemoryError in Python, a signal that your hardware resources are exhausted. RAM is like a physical desk: if there are too many papers on it, you cannot perform any task.

Beyond the physical limitation, slowness occurs because the system starts using the “page file” or “swap”, which is a portion of the hard drive used as if it were RAM. Since a hard drive or SSD is much slower than dedicated memory, the script appears to freeze. To avoid this, you need to switch from “greedy” reading to a “lazy” reading approach, also known as lazy evaluation.

Reading Large Files Using Iterators

The simplest and most native solution is to treat the file object as an iterator. When you open a file in Python, you do not need to load everything. You can process it line by line, consuming a minimal amount of memory regardless of the total document size. This is the foundational pattern for applying Python programming logic to Big Data problems.

# Efficient line-by-line reading
filename = "large_data.txt"

with open(filename, "r", encoding="utf-8") as file:
    for line in file:
        # Process each line individually
        process_data(line)

In this code, only one line at a time lives in memory. If your file has one million lines, Python reads the first one, executes the function, and moves to the next, discarding the previous one. This is extremely useful for log files or database exports.

Using Generators for Efficient Processing

If you need to apply transformations to the data before using it, generators are your best friends. They allow you to create a sequence of data without building the entire sequence in memory at once. Instead of returning a complete list, a generator uses the yield keyword to deliver one item at a time. This is closely related to the concept discussed in the guide on list comprehension vs generator expression.

Imagine filtering only lines that contain the word “ERROR” in a 15GB log. Instead of creating a new list with those errors in memory, you create a generator that streams the filtered lines on demand, keeping memory consumption constant and low:

def filter_errors(file_path):
    with open(file_path, "r") as f:
        for line in f:
            if "ERROR" in line:
                yield line.strip()

# Using the generator does not load extra memory
for error in filter_errors("system.log"):
    print(f"Alert detected: {error}")

Reading Giant CSV Files with Pandas and Chunksize

Pandas is the go-to library for data analysis, but it is known for being memory-hungry. By default, pd.read_csv() tries to load the entire file. However, Pandas offers the chunksize parameter, which transforms reading into an iterative process. If you set chunksize=10000, the command returns an object that reads 10,000 rows at a time, giving you a smaller DataFrame to work with on each pass:

import pandas as pd

filepath = "global_sales.csv"
chunk_size = 50_000

# Creating the iterator
reader = pd.read_csv(filepath, chunksize=chunk_size)

for chunk in reader:
    # 'chunk' is a smaller DataFrame available for operations
    total_sales = chunk['amount'].sum()
    print(f"Total processed in this chunk: {total_sales}")

This approach allows complex aggregations. You read a chunk, extract the needed information such as a sum or average, discard the chunk, and move to the next. At the end, you combine the partial results to get the global total.

Alternative Libraries: Dask and Polars

When a file is so large that even chunked processing becomes slow, it is time to look at tools built specifically for parallel computation. Dask is a library that extends Pandas to work across multiple CPU cores or even clusters of machines. It does not load data immediately but creates a task graph, only executing the processing when you request the final result.

Another modern and extremely fast option is Polars. Written in Rust and maintained at pola.rs, Polars manages memory far more efficiently than traditional Pandas and supports native lazy reading, optimizing queries to read only the necessary columns from disk. For anyone building data analysis pipelines with Python, both tools are worth exploring as file sizes grow.

Handling Encoding Errors

When reading large files, encountering special characters that break the code is very common. “UnicodeDecodeError” errors occur when Python tries to read a file encoded in one format (like Latin-1) using another (like UTF-8). To prevent your script from stopping in the middle of a two-hour processing job, always handle these errors with try and except in Python. Also verify that the file path is correct to avoid a FileNotFoundError:

try:
    with open("data.csv", "r", encoding="utf-8") as f:
        for line in f:
            process(line)
except UnicodeDecodeError:
    print("Encoding error detected. Trying another format...")
    # Try latin-1 or ISO-8859-1 if UTF-8 fails

Working with Compressed Files

Large files often arrive compressed (such as .zip or .gz) to save disk space. You do not need to extract the entire archive before reading. Python includes built-in modules like gzip and zipfile that let you read the contents directly from the compressed file, saving both time and disk space. When reading a gzip-compressed CSV with Pandas, the compression='gzip' parameter handles everything automatically.

Streaming Large JSON Files

JSON files are harder to read line by line than TXT or CSV because they have a hierarchical structure with nested keys and arrays. Loading a 5GB file with json.load() will freeze your system. For these cases, use libraries like ijson (Iterative JSON), which allows you to navigate the JSON structure as a stream without loading the entire document into RAM. With ijson, you point to the path of the object you want to extract and the library reads through the file while ignoring the rest of the structure.

Optimizing Performance with Buffering

Python allows you to adjust the read buffer size. The buffer is a small area in memory that stores data temporarily before passing it to the processing logic. For extremely large files, increasing the buffer size can reduce how many times the system needs to access the physical disk, which speeds up reading. For performance-intensive scripts, combining this with Python multiprocessing can deliver substantial throughput improvements:

# Setting a manual 1MB buffer
with open("giant.txt", "r", buffering=1_048_576) as f:
    for line in f:
        # Processing here
        pass

Complete Project Code: Giant Log Analyzer

Here is the full unified script that applies iterators, error handling, and efficient counting. It is capable of reading a file of any size, counting occurrences of a specific word, and reporting progress without ever loading more than one line into memory at a time:

import time
import os

def process_giant_file(file_path, target_word):
    """Reads a file efficiently and counts occurrences of a word."""
    count = 0
    lines_processed = 0
    start_time = time.time()

    if not os.path.exists(file_path):
        print("Error: File not found. Check the path and try again.")
        return

    try:
        with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
            for line in f:
                if target_word.lower() in line.lower():
                    count += 1
                lines_processed += 1

                # Progress feedback every 1 million lines
                if lines_processed % 1_000_000 == 0:
                    print(f"Processed {lines_processed:,} lines so far...")

        end_time = time.time()

        print("n--- Analysis Results ---")
        print(f"Total lines read: {lines_processed:,}")
        print(f"Occurrences of '{target_word}': {count:,}")
        print(f"Total time: {end_time - start_time:.2f} seconds")

    except FileNotFoundError:
        print("Error: The file was not found.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    FILE = "my_giant_log.txt"   # Replace with your actual file path
    WORD = "ERROR"
    process_giant_file(FILE, WORD)

Frequently Asked Questions

Why does Python freeze when opening a 4GB file if I have 8GB of RAM?

Python does not freeze just because of raw file size. When converting disk data into Python objects like strings or dictionaries, memory consumption can triple or quadruple, quickly exceeding the available RAM even when the raw file would theoretically fit.

Is readlines() safe for large files?

No. The readlines() method reads all lines and stores them as a list in memory. For large files this will cause an immediate freeze. Always use the for line in file: iterator pattern instead.

What is the difference between reading line by line and reading in chunks?

Reading line by line is ideal for simple text processing where you handle one record at a time. Reading in chunks (byte blocks) is more efficient for binary files or when using Pandas for statistical aggregation, since each chunk is a small fully-functional DataFrame.

How do I read only the last lines of a giant file?

You can use the seek() method to move the file pointer to the end and read only the last bytes, or use libraries like file_read_backwards which handle this efficiently without loading the full file.

Is line-by-line processing much slower than loading everything at once?

Generally no. By avoiding virtual memory (swap), the execution is often much faster and more stable than trying to load everything at once, because you completely eliminate the bottleneck of disk-backed memory paging.

Can I use multiprocessing to read a single file faster?

Yes, but it is complex because you must split the file into physical sections without breaking lines in the middle. For a single spinning hard drive, sequential optimized reading is usually sufficient. SSDs benefit more from parallel reads.

How do I find out the size of a file before opening it?

Use the Python os module: os.path.getsize('path') returns the size in bytes, letting your script decide upfront whether to use standard reading or the chunked strategy.

When should I switch from Pandas to Dask or Polars?

Consider switching when Pandas with chunksize is still too slow, when you need to aggregate across multiple large files, or when you want to use multiple CPU cores automatically. Polars is also a strong choice when you need to process a single very large file as fast as possible on a single machine.