- Required Libraries
- 1) Python requests Library
- 2) Python beautifulsoup4 Library
- How to Extract CSS Files from Web Pages in Python?
- A Python Program to Extract Internal and External CSS from a Webpage
- How to Extract JavaScript Files from Web Pages in Python?
- A Python Program to Extract Internal and External JavaScript from a Webpage
- Conclusion
It is often required to extract all the CSS and JavaScript files from the webpage so that you can list out all the external and internal styling and scripting performed on the webpage.
In this tutorial, we will walk you through code that will extract JavaScript and CSS files from web pages in Python.
A webpage is a collection of HTML, CSS, and JavaScript code. When a webpage is loaded in the browser, the browser parses the complete HTML file along with CSS and JavaScript files and executes them.
The webpage can have multiple CSS and JavaScript files, and the more files an HTML page has, the more time the browser will take to load the complete webpage. Before we can extract JavaScript and CSS files from web pages in Python, we need to install the required libraries.
The following section details how to do so.
Required Libraries
1) Python
requests
Library
Requests
is the de-facto Python library for HTTP requests. We will be using this library in this tutorial to send the get request to the webpage URL and get its HTML code.
To install requests for your Python environment, run the following pip install command on your terminal or command prompt:
2) Python
beautifulsoup4
Library
Beautifulsoup4 is an open-source Python library. It is generally used to pull out data from HTML and XML files. We will be using this library in our Python program to extract data from the URL HTML webpage.
You can install the
beautifulsoup4
library for your Python environment using the following Python pip install command:
After installing both the libraries, open your best Python IDE or text editor and code along.
How to Extract CSS Files from Web Pages in Python?
In an HTML file, the CSS can be embedded in two ways, internal CSS and external CSS . Let's write a Python program that will extract the internal as well as the external CSS from an HTML file. Let's start with importing the modules:
Now, we will define a
Python user-defined function
page_Css(html_page)
that will accept html_page as an argument and extract all the internal CSS
<style>
code and external CSS
<link rel="stylesheet">
href links.
-
The
find_all('link', rel="stylesheet")
statement will return a list of all the external CSS <link> tags. -
The
find_all('style')
method/function will return a list of all the internal<style>
tags from the page_html. -
In the
with open("internal_css.css", "w") as file:
block, we are writing all the internal CSS code in theinternal_css.css
file. -
In the
with open("external_css.txt", "w") as file:
block, we are writing the external CSS href links in theexternal_css.txt
file.
After defining the function, let's send a Get request to the webpage URL and call the page_Css() function.
The
request.get(url)
function will send a GET HTTP request to the url and return a response. The
BeautifulSoup()
module will parse the HTML page of the
response
. Now put all the code together and execute.
A Python Program to Extract Internal and External CSS from a Webpage
Output
In the program, we have only printed the links for the external CSS. After executing the above program, you can check the directory where your Python Script is located. There, you will find two new files,
internal_css.css
and
external_css.txt
, which contain internal CSS code and external CSS links, respectively.
Next, let's write a similar Python program that will extract JavaScript from the webpage.
How to Extract JavaScript Files from Web Pages in Python?
Again we will start with importing the required modules.
Now, lets add a user-defined function,
page_javaScript(page_html)
. It will extract internal and external JavaScript from the HTML webpage.
-
The
page_html.find_all("script")
statement will return a list of all JavaScript<script>
tags present in thepage_html
. -
list(filter(lambda script:script.has_attr("src"), all_script_tags))
andlist(filter(lambda script: not script.has_attr("src"), all_script_tags))
will filter the list of internal and external JavaScript using the Python lambda and filter functions . -
The
with open("internal_script.js", "w") as file:
block will write the internal JavaScript code in the new fileinternal_script.js
. -
The
with open("external_script.txt", "w") as file:
block will write all the external JavaScript source links in theexternal_script.txt
file.
Now, we need to send the GET request to the page URL.
Finally, put all the code together and execute.
A Python Program to Extract Internal and External JavaScript from a Webpage
Output
In the program, we have only printed the webpage external JavaScript source link. After executing the program you can also check your Python script directory and look for the newly created
internal_script.js
and
external_script.js
files that contain the webpage's internal JavaScript code and external JavaScript links, respectively.
Conclusion
In this tutorial, you learned how to extract JavaScript and CSS files from web pages in Python. To extract the CSS and JavaScript files, we have used web scrapping using Python requests and beautifulsoup4 libraries.
Before writing the above Python programs, make sure that you have installed both the libraries for your Python environment.
People are also reading:
Leave a Comment on this Post