Let's say there is a webpage on the internet with many email addresses, and you want to write a Python script that can extract all the email addresses. This email extractor in Python is a small application of Python web scraping where we access data from the Internet .
Whenever we say web scraping with Python, the first library that comes to our mind is
requests
, but in this tutorial, we will not be using the Python
requests
library. Instead, we will use the
requests-html
library that supports all features of the
requests
library and more.
You might be wondering why to use the
requests-html
library if web scraping can be performed using
requests
. The main reason behind using
requests-html
is that it supports JavaScript.
In some websites, the data is rendered on the browser by the JavaScript code, but when we request a webpage with the
requests
library, the JavaScript code does not execute. However, with
requests-html
, we can execute the JavaScript code of the responded object.
Required Libraries and Dependencies
Alright, now let's discuss and install the libraries that we will be using to develop an email extractor in Python.
1) Python
requests-html
Library
The
requests-html
library is an open-source, HTML parsing Python library, and in this tutorial, we will be using this library as an alternative for the Python
requests
library. To install the
requests-html
library for your Python environment, run the following pip install command on your terminal or command prompt:
pip install requests-html
2) Python
beautifulsoup4
Library
Beautiful Soup
is a Python open-source library that is used to extract or pull data from HTML and XML files. In this tutorial, we will be using the
beautifulsoup4
library to extract email data from an HTML page. To install the
beautifulsoup4
library for your Python environment, run the following pip install command:
pip install beautifulsoup4
3) Python
re
Module
The
Python
re
module stands for regular expression, and it is a standard Python library that is used to match string patterns from a text using regular expressions.
In this tutorial, we will extract emails from a webpage. An email is a specific sequence of characters, and by using the regular expression, we can grab only that text or string data that matches the specific sequence or pattern.
Random Email Generator
For this tutorial, we will be extracting emails from the https://www.randomlists.com/email-addresses URL, which generates random emails with every request. If you want, you can use any other webpage URL to extract emails.
How to Make an Email Extractor in Python?
Let's start with importing all the modules.
from requests_html import HTMLSession
import re
from bs4 import BeautifulSoup
Now set the
url
and
pattern
identifiers that represent the webpage URL and regular expression pattern for the emails.
#page url
url =r"https://www.randomlists.com/email-addresses"
#regex pattern
pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
Next, initialize the HTMLSession() object, which sets cookies for the persistence connection.
#initialize the session
session = HTMLSession()
After initializing the session, let's send a GET request to the page URL.
#send the get request
response = session.get(url)
After sending the GET request, we get the
response
or HTML data from the server. Now, let's run all the JavaScript code of the
response
object using the
html.render()
method.
#simulate JS running code
response.html.render()
For the first time, it will download the Chromium simulator for your Python environment. Thus, do not worry when you see a downloading process during code execution. The data you see on the webpage is generally put inside the HTML <body> tag. So, let's grab the body tag from the response object.
#get body element
body = response.html.find("body")[0]
The
find("body")
function will return a list of
<body>
elements. As an HTML page can have only one body, that's why we used the [0] index to grab the first result. Next, let's extract the list of emails from the body text and print all the emails.
#extract emails
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text)
for index,email in enumerate(emails):
print(index+1, "---->", email)
Now let us put all the code together and execute it.
Python Program to Extract Emails from a Webpage
from requests_html import HTMLSession
import re
from bs4 import BeautifulSoup
#page url
url =r"https://www.randomlists.com/email-addresses"
#regex pattern
pattern =r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
#initialize the session
session = HTMLSession()
#send the get request
response = session.get(url)
#simulate JS running code
response.html.render()
#get body element
body = response.html.find("body")[0]
#extract emails
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", body.text)
for index,email in enumerate(emails):
print(index+1, "---->", email)
Output
1 ----> horrocks@yahoo.com
2 ----> leocharre@live.com
3 ----> howler@gmail.com
4 ----> naoya@me.com
5 ----> gfxguy@gmail.com
6 ----> kalpol@outlook.com
7 ----> scato@hotmail.com
8 ----> tkrotchko@live.com
9 ----> citizenl@aol.com
10 ----> sagal@mac.com
11 ----> afeldspar@sbcglobal.net
12 ----> maneesh@gmail.com
Conclusion
In this Python tutorial, we learned how to make an email extractor in Python that can extract emails from the webpage using
requests-html
,
beautifulsoup4
, and
re
Python libraries. You can also extract emails from a text file using Python file handling methods and regular expression as we have done above.
We hope you like this article, and if you have any queries or suggestions related to the above article or program, please let us know by commenting below.
Thanks for reading!
People are also reading:
Leave a Comment on this Post