How to Convert HTML Tables into CSV Files in Python?

HTML is the standard language to create web pages, and it is generally used to structure the text, images, other elements on a web page. HTML can represent the text data in various formats such as links, normal text, tables, lists, etc.

Let's say that you have an HTML or you want to grab the HTML web page from the internet and wish to extract the table data to analyze or crunch the data.

Here in this Python tutorial, I will walk you through the Python program on how to extract table data from the HTML web pages and save it locally in CSV files. But before we get to the main topic, let's discuss and install the libraries that we will be using in this Python tutorial.

Required Libraries

Python requests library

We will be using the requests library to send HTTP GET request to the web page and in response get HTML text data. To install the requests library, run the following pip command on your terminal or command prompt:

pip install requests

Python beautifulsoup4 Library

The beautifulsoup4 library is an open-source Python HTML & XML data extractor library. We will be using this library to extract table data from a HTML page using HTML tag names like <th>, <table>, <tr>, and <td>. You can install this library using the following pip command:

pip install beautifulsoup4

Python CSV Module

CSV (Comma Separated Values) is one of the modules in the Python Standard Library, and you do not need to install it separately. As its name suggests, we can use this module to read and write CSV files. To learn how to write CSV files in Python, click here .

Note: In this tutorial, we will be extracting data from HTML tables. Here, we assume that you have some knowledge about the usage of the HTML <table> tag along with other tags like <th>, <tr>, and <td>.

Convert HTML Tables into CSV Files in Python

Let's begin with importing the modules for our Python program.

import requests
from bs4 import BeautifulSoup
import csv

Now define a Python variable url for the web-page URL

url= r"https://www.techgeekbuzz.com/difference-between-repeater-datalist-and-gridview/"

response = requests.get(url) #send get request
html_page = response.text #fetch HTML page

The get() function will send a get request to the URL, whereas text property will fetch the response HTML web page. Now we will parse the html_page using BeautifulSoup() module so we can extract the html_page using BeautifulSoup find_all() function.

page = BeautifulSoup(html_page, 'html.parser')    #parse html_page

As in this tutorial we are only fetching tables data, let's extract all the tables present in the page .

tables = page.find_all("table")  #find tables

#print the total tables found
print(f"Total {len(tables)} Found on page {url}")

The find_all("table") will return a list of all the <table> tags present in page . Now, we will loop through every table present in tables list, create the new CSV file, and write table data on the CSV file.

for index, table in enumerate(tables):
    print(f"\n-----------------------Table{index+1}-----------------------------------------\n")

    table_rows = table.find_all("tr")

    #open csv file in write mode
    with open(f"Table{index+1}.csv", "w", newline="") as file:
        
        #initialize csv writer object
        writer = csv.writer(file)

        for row in table_rows:
            row_data= []

            #<th> data
            if row.find_all("th"):
                table_headings = row.find_all("th")
                for th in table_headings:
                    row_data.append(th.text.strip())
            #<td> data
            else:
                table_data = row.find_all("td")
                for td in table_data:
                    row_data.append(td.text.strip())
            #write data in csv file
            writer.writerow(row_data)
            
            print(",".join(row_data))
    print("--------------------------------------------------------\n")

Now put all the code together and execute.

Python Program to Convert Web Page Tables to CSV files

import requests
from bs4 import BeautifulSoup
import csv

url= r"https://www.techgeekbuzz.com/difference-between-repeater-datalist-and-gridview/"

response = requests.get(url)
html_page = response.text

soup = BeautifulSoup(html_page, 'html.parser')
#find <table>
tables = soup.find_all("table")
print(f"Total {len(tables)} Table(s)Found on page {url}")

for index, table in enumerate(tables):
    print(f"\n-----------------------Table{index+1}-----------------------------------------\n")
    
    #find <tr>
    table_rows = table.find_all("tr")

    #open csv file in write mode
    with open(f"Table{index+1}.csv", "w", newline="") as file:

        #initialize csv writer object
        writer = csv.writer(file)

        for row in table_rows:
            row_data= []

            #<th> data
            if row.find_all("th"):
                table_headings = row.find_all("th")
                for th in table_headings:
                    row_data.append(th.text.strip())
            #<td> data
            else:
                table_data = row.find_all("td")
                for td in table_data:
                    row_data.append(td.text.strip())
            #write data in csv file
            writer.writerow(row_data)

            print(",".join(row_data))
    print("--------------------------------------------------------\n")

Output

Total 3 Table(s) Found on page https://www.techgeekbuzz.com/difference-between-repeater-datalist-and-gridview/


-----------------------Table2-----------------------------------------

GridView,Repeater
Debut
GridView was introduced in Asp.Net 2.0,The Repeater was introduced in Asp.Net 1.0.
Columns generation
It automatically generates columns using the data source.,It cannot generate columns.
Row selection
It can select a row from the data source.,It cannot select rows.
Content Editing
Using GridView control, we can edit object content.,It does not support content editing.
In-built methods
It comes with built-in paging and sorting methods.,No built-in support for Built-in paging and sorting developer has to code.
Auto formatting and styling
In GridView we get inbuilt auto format and styling feature.,It does not support these features.
Performance
It is slower than Repeater.,Because of its lightweight, it is faster as compared to GridView.
--------------------------------------------------------

-----------------------Table3-----------------------------------------

GridView,DataList
Debut
GridView was introduced in Asp.Net 2.0 version.,DataList was introduced in Asp.Net 1.0 version.
In-built methods
It comes with built-in paging and sorting methods.,No built-in support for Built-in paging and sorting, the developer has to code for these features.
Build-in CRUD operation
It comes with built-in Update and Deletes Operations, so the developer does not need to write code for simple operations.,If developer use DataList then he/she has to write code for the Update and Delete operations.
Auto formatting and styling
In GridView we get inbuilt auto format and styling feature.,It does not support these features.
Customizable Row
We do not get Customizable row separator feature in GridView.,DataList has SeparatorTemplate for customizable row separator.
Performance:
Its performance is the lowest as compared to Repeater and DataList.,It is faster than the GridView.
--------------------------------------------------------

When you execute the above program, you will see that it will save the .csv file in the same directory where your Python script is located.

Conclusion

Here, we learned "How to convert HTML tables to CSV files in Python?" Also, this tutorial is a small demonstration of web-scrapping with Python. If you want to learn more about extracting data from web pages , you can read the official documentation of BeautifulSoup4.

People are also reading:

How to Convert HTML Tables into CSV Files in Python?

Required Libraries

Python `requests` library

Python `beautifulsoup4` Library

Python CSV Module

Convert HTML Tables into CSV Files in Python

Python Program to Convert Web Page Tables to CSV files

Output

Conclusion

Related Blogs

7 Most Common Programming Errors Every Programmer Should Know

Carbon Programming Language - A Successor to C++

Introduction to Elixir Programming Language

Leave a Comment on this Post

0 Comments

How to Convert HTML Tables into CSV Files in Python?

Table of Content

Required Libraries

Python requests library

Python beautifulsoup4 Library

Python CSV Module

Convert HTML Tables into CSV Files in Python

Python Program to Convert Web Page Tables to CSV files

Output

Conclusion

Related Blogs

7 Most Common Programming Errors Every Programmer Should Know

Carbon Programming Language - A Successor to C++

Introduction to Elixir Programming Language

Leave a Comment on this Post

0 Comments

Python `requests` library

Python `beautifulsoup4` Library