Extract Web Tables with Python & Pandas

Leandro Hirt

Atualizado em: 11/05/2026

Published on: May 11, 2026

Reading time: 10 minutes

Have you ever landed on a website full of useful data organized in tables, only to realize that copying it manually would take forever? Learning how to extract table data from websites with Python and Pandas is one of the most valuable skills for anyone who wants to work with Python for data analysis. Through a technique called web scraping, you can turn the visual content of a web page into a structured, usable format, like an Excel spreadsheet or a CSV file, in a matter of seconds.

Python has become the go-to language for this kind of task because of its simplicity and its ecosystem of powerful libraries. Pandas, specifically, includes a function called read_html() that automates almost the entire process of finding and capturing HTML tables. This guide covers everything from setting up your environment to cleaning the extracted data, so you can automate this kind of data collection in a reliable and professional way.

Why Use Pandas to Extract Tables from Websites?

Several libraries exist for web scraping, including BeautifulSoup and Selenium. However, Pandas stands out when the goal is specifically to capture tabular data. The main advantage is implementation speed. With other tools, you would need to manually map each row (tr) and each cell (td) in the HTML structure. Pandas reads the raw page source and returns a ready-to-use list of DataFrames.

According to the official Pandas documentation, the HTML reading function uses auxiliary libraries like LXML to parse the website structure. This means that beyond the extraction itself, you already have access to the tools needed to filter, sort, and save the data in multiple formats. It is the ideal choice for anyone who wants to automate tasks with Python without writing hundreds of lines of boilerplate code.

Setting Up Your Development Environment

To get started, make sure the right tools are installed. Beyond Python itself, you need Pandas and a few supporting libraries that help Python parse HTML. If you are working in an isolated environment, remember to activate your Python virtual environment before installing to keep your project dependencies organized.

Installing the Required Libraries

Open your terminal or command prompt and run the following command to install Pandas along with its HTML parsers:

Bash

pip install pandas lxml html5lib beautifulsoup4

pip install pandas lxml html5lib beautifulsoup4

These libraries work together behind the scenes. lxml and html5lib are responsible for reading the raw page structure, while Pandas organizes everything into readable DataFrames. Having all three installed gives Pandas flexibility to fall back on alternative parsers if one fails on a particular site.

Understanding the Pandas read_html Function

The pd.read_html() function is the heart of this tutorial. It accepts a URL, a local file, or a string containing raw HTML. When called, it searches for all <table> tags in the content and converts them into a list of DataFrame objects. A DataFrame is essentially Python’s version of a database table or a spreadsheet.

One important detail: a single web page often contains multiple tables. Navigation menus, footers, and sidebar comparisons are all frequently built with HTML table tags, not just the main data table you are looking for. For that reason, read_html() always returns a list, even if it finds only one table. You access individual tables using index notation, like tables[0] for the first one found.

Extracting Table Data in Practice

Now that you understand the theory, it is time to put it into practice. Imagine you want to extract a currency exchange rate table from a financial news site, or demographic data from Wikipedia. The workflow follows a logical sequence: import the library, define the URL, and capture the data.

Loading the Page and Getting the Table List

In this first step, you load all structured data that Pandas can find at the given URL. Notice how short the code is:

Python

import pandas as pd

# Define the target URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"

# Pandas reads all tables on the page
tables = pd.read_html(url)

# Check how many tables were found
print(f"Total tables found: {len(tables)}")

import pandas as pd

# Define the target URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"

# Pandas reads all tables on the page
tables = pd.read_html(url)

# Check how many tables were found
print(f"Total tables found: {len(tables)}")

At this point, Python has already done the heavy lifting. All HTML tags have been stripped and only the organized text remains. This is significantly simpler than using the Requests library alone and manually parsing the raw HTML character by character.

Identifying the Right Table

Because the page may contain many tables, you need to preview each one to find the data you actually want. A quick way to do this is displaying the first few rows of each item in the list:

Python

# Preview the first table found
df_population = tables[0]
print(df_population.head())

# Preview the first table found
df_population = tables[0]
print(df_population.head())

If the first table is not the right one, try tables[1], tables[2], and so on. For pages with many tables, you can use a for loop in Python to print the column headers of each one and visually identify which contains the data you need.

Cleaning and Processing the Extracted Data

Data extracted from the web rarely comes in perfect shape. It may contain special characters, null values, or confusing column names with footnote references. After learning how to extract the data, the next crucial step is cleaning it so it is ready for reports, charts, or further analysis.

Selecting and Renaming Columns

Wikipedia tables often include columns for rank, references, or footnotes that are not useful for analysis. You can remove them easily using Pandas column selection or the drop method. It is also common for numbers to arrive as text strings with commas or spaces that Python does not recognize as mathematical values:

Python

# Select only the columns you need
df_clean = df_population[['Country / Area', 'Population']].copy()

# Rename columns for easier access
df_clean.columns = ['Country', 'Population']

# Display the clean result
print(df_clean.head())

# Select only the columns you need
df_clean = df_population[['Country / Area', 'Population']].copy()

# Rename columns for easier access
df_clean.columns = ['Country', 'Population']

# Display the clean result
print(df_clean.head())

Watch out for column name mismatches. If you try to select a column name that does not exactly match what Pandas found, you will trigger a KeyError. Always print the column names first with df.columns before trying to filter.

Handling Blocked Sites and Dynamic Pages

Not all websites allow scripts to access their data freely. Some servers return a “403 Forbidden” error because they detect that the request did not come from a real browser. The standard workaround is to set a User-Agent header that simulates a browser like Chrome or Firefox:

Python

import requests

# Set a header to simulate a real browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
response = requests.get(url, headers=headers)

# Pass the response content to Pandas
tables = pd.read_html(response.text)

import requests

# Set a header to simulate a real browser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
response = requests.get(url, headers=headers)

# Pass the response content to Pandas
tables = pd.read_html(response.text)

For more complex cases where a table only appears after clicking a button, logging in, or waiting for JavaScript to load, Pandas alone is not enough. In those situations, you would need a tool that controls the browser directly, like Selenium, to render the page first and then pass the resulting HTML to Pandas for parsing.

Exporting the Data to CSV and Excel

After extracting and cleaning the data, you will almost certainly want to save it. Pandas makes this extremely simple. You can convert your DataFrame to a spreadsheet file in a single line of code, ready to be opened by colleagues who do not write code:

Python

# Save as CSV
df_clean.to_csv("population_data.csv", index=False, encoding='utf-8')

# Save as Excel
df_clean.to_excel("population_data.xlsx", index=False)

# Save as CSV
df_clean.to_csv("population_data.csv", index=False, encoding='utf-8')

# Save as Excel
df_clean.to_excel("population_data.xlsx", index=False)

File handling is a core part of the Python ecosystem. If you want to go deeper into working with spreadsheets programmatically, the guide on Python and Excel covers editing .xlsx files with full control over formatting, formulas, and sheet structure.

Complete Project Code

Here is the full unified script that performs the extraction, data cleaning, and export in one run. This version uses Wikipedia’s list of countries by population as an example. Replace the URL with any page containing an HTML table to adapt it to your own use case:

Python

import pandas as pd
import requests

def extract_population_data():
    # 1. Initial Settings
    url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

    try:
        # 2. Fetch the page and read HTML
        response = requests.get(url, headers=headers)
        tables = pd.read_html(response.text)

        # 3. Select the main data table (usually the first one)
        df = tables[0]

        # 4. Data Cleaning
        # Print columns first to identify the correct names
        print("Available columns:", df.columns.tolist())

        # Select columns by position if names are complex
        df_clean = df.iloc[:, [0, 1]].copy()
        df_clean.columns = ['Country', 'Population']

        # Remove non-numeric characters from Population column
        df_clean['Population'] = df_clean['Population'].astype(str).str.replace(r'D', '', regex=True)

        # 5. Export to CSV
        df_clean.to_csv("countries_population.csv", index=False)

        print("Extraction complete! File 'countries_population.csv' created.")
        print(df_clean.head(10))

    except Exception as e:
        print(f"An error occurred during extraction: {e}")

if __name__ == "__main__":
    extract_population_data()

import pandas as pd
import requests

def extract_population_data():
    # 1. Initial Settings
    url = "https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)"
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

    try:
        # 2. Fetch the page and read HTML
        response = requests.get(url, headers=headers)
        tables = pd.read_html(response.text)

        # 3. Select the main data table (usually the first one)
        df = tables[0]

        # 4. Data Cleaning
        # Print columns first to identify the correct names
        print("Available columns:", df.columns.tolist())

        # Select columns by position if names are complex
        df_clean = df.iloc[:, [0, 1]].copy()
        df_clean.columns = ['Country', 'Population']

        # Remove non-numeric characters from Population column
        df_clean['Population'] = df_clean['Population'].astype(str).str.replace(r'D', '', regex=True)

        # 5. Export to CSV
        df_clean.to_csv("countries_population.csv", index=False)

        print("Extraction complete! File 'countries_population.csv' created.")
        print(df_clean.head(10))

    except Exception as e:
        print(f"An error occurred during extraction: {e}")

if __name__ == "__main__":
    extract_population_data()

Mastering data extraction is the first step toward becoming a data analyst or an automation-focused developer. With Pandas, what once took hours of manual copying can be done with reliable, reusable scripts that run in seconds. Keep practicing with different websites and explore how combining this skill with data analysis with Pandas and NumPy unlocks an entirely new level of productivity.

Frequently Asked Questions

Can Pandas read tables from any website?

Pandas can read tables built with HTML <table> tags. It does not work for tables constructed entirely with <div> elements or canvas-based layouts without combining it with other libraries like Selenium or BeautifulSoup.

Why do I get an “ImportError: lxml not found” error?

This error means Pandas cannot find the LXML parser it needs to process HTML. Run pip install lxml in your terminal to fix it.

How do I extract only a specific table when the page has many?

Use the match parameter: pd.read_html(url, match="Table Header Text"). Pandas will only return tables that contain that specific text, making it much easier to target exactly what you need.

Not directly. For login-protected pages, use the Requests library to handle authentication first, or use Selenium to navigate to the logged-in page and pass the rendered HTML to Pandas.

How do I convert numbers in the table to proper numeric format?

Numbers often arrive as strings with commas or spaces. Use df['column'].str.replace(',', '').astype(int) to remove thousand separators and convert to integer, or use pd.to_numeric(df['column'], errors='coerce') for a more robust conversion.

Can I extract tables from HTML files saved on my computer?

Yes. Instead of a URL, pass the local file path: pd.read_html("path/to/your/file.html"). This is useful for processing downloaded reports or archived data.

Is read_html better than BeautifulSoup?

For simple tables, read_html is significantly faster and requires far less code. BeautifulSoup is the better choice when you need to extract other elements beyond tables, such as links, images, or scattered text nodes.

Is scraping table data from websites legal?

It depends on the website’s terms of service and your intended use. Always check the /robots.txt file of the domain and avoid sending high-frequency requests that could overload the server. Public data for personal or research use is generally lower risk, but commercial use of scraped data requires careful review of the site’s legal terms.