How to Extract All Website Links in Python?

Posted in /  

How to Extract All Website Links in Python?
vinaykhatri

Vinay Khatri
Last updated on November 21, 2024

    A webpage is a collection of data, and the data can be anything text, image, video, file, links, and so on. With the help of web scraping, we can extract that data from the webpage. Here, we will discuss how to extract all website links in Python.

    Let's say there is a webpage, and you want to extract only URLs or links from that page to know the number of internal and external links. There are many web applications on the internet that charge hundreds of dollars to provide such features, where they extract valuable data from other webpages to get insights into their strategies.

    You do not need to buy or rely on other applications to perform such trivial tasks when you can write a Python script that can extract all the URL links from the webpage, and that's what we are going to do in this tutorial.

    How to Extract All Website Links in Python?

    Here, in this Python tutorial, we will walk you through the Python program that can extract links or URLs from a webpage. However, before we dive into the code, let's install the required libraries that we will be using in this Python tutorial.

    Install Required Libraries

    Here is the list of all required libraries and how to install them that we will be going to use in this tutorial:

    1) Python requests Library

    requests is the de-facto Python library to make HTTP requests. We will be using this library to send GET requests to the URL of the webpage. You can install the requests library for your Python environment using the following pip install command:

    pip install requests

    2) Python beautifulsoup4 Library

    beautifulsoup4 is an open-source library that is used to extract or pull data from an HTML or XML page. In this tutorial, we will be using this library to extract <a> tag href links from the HTML of the webpage. To install beautifulsoup for your Python environment, run the following pip install command:

    pip install beautifulsoup4

    3) Python Colorama Library

    The colorama library is used to print colorful text output on the terminal or command prompt. This library is optional for this tutorial, and we will be using this library only to print the output in a colorful format. To install colorama for your Python environment, run the following pip install command:

    pip install colorama

    Alright then, we are all set now. Open your best Python IDE or text editor and start coding.

    How to Extract URLs from Webpages in Python?

    Let's begin with importing the required modules.

    #modules
    from colorama import Back
    import requests
    from bs4 import BeautifulSoup
    
    #for windows 
    from colorama import init
    init()
    We are using a Windows system, and that's why we need to mention two additional statements, from colorama import init and init() . This is because, on Windows, we need to filter the ANSI escape sequence out of any text sent to stdout or stderr and replace them with equivalent Win32 calls.

    If you are on Mac or Linux, then you do not need to write the above two statements. Even if you write them, the two statements will have no effect. After initializing the colorama init() method, let's define the webpage URL with the url identifier and send a GET request to the URL.

    #page url
    url = r"https://www.techgeekbuzz.com/"
    
    #send get request
    response = requests.get(url)

    Now, we can parse the response HTML text using the beautifulSoup() module and find all the <a> tags present in the response HTML page.

    #parse html page
    html_page = BeautifulSoup(response.text, "html.parser")
    
    #get all <a> tags
    all_urls = html_page.findAll("a")

    The findAll() function will return a list of all <a> tags present in the html_page . As we want to extract internal and external URLs present on the web page, let's define two empty Python sets , namely internal_urls and external_urls .

    internal_urls = set()
    external_urls =set()

    Next, we will loop through every <a> tag present in the all_urls list and get their href attribute value using the get() function because href attribute has the link URL value.

    for link in all_urls:
        href=link.get('href')
        
        if href:
            if r"techgeekbuzz.com" in href:    #internal link
                internal_urls.add(href)
    
            elif href[0]=="#":   #same page target link   
                internal_urls.add(f"{url}{href}")
                
            else:                       #external link
                external_urls.add(href)

    add() is the set method that adds elements to the set object. Now, let's print all internal URLs with a green background and external links with a red background.

    print( Back.MAGENTA + f"Total External URLs: {len(internal_urls)}\n")
    for url in internal_urls:
        print(Back.GREEN + f"Internal URL {url}")
    
    
    print(Back.MAGENTA + f"\n\nTotal External URLs: {len(external_urls)}\n")
    for url in external_urls:
        print(Back.RED + f"External URL {url}")

    Put all the code together and execute.

    Python Program to Extract URLs from the Webpage

    #modules
    from colorama import Back
    import requests
    from bs4 import BeautifulSoup
    
    #set  windows 
    from colorama import init
    init()
    
    #page url
    url = r"https://www.techgeekbuzz.com/"
    
    #send get request
    response = requests.get(url)
    
    #parse html page
    html_page = BeautifulSoup(response.text, "html.parser")
    
    #get all  tags
    all_urls = html_page.findAll("a")
    
    internal_urls = set()
    external_urls =set()
    
    for link in all_urls:
        href=link.get('href')
        
        if href:
            if r"techgeekbuzz.com" in href:    #internal link
                internal_urls.add(href)
             
            elif href[0]=="#":   #same page target link   
                internal_urls.add(f"{url}{href}")
                
            else:                       #external link
                external_urls.add(href)
    
    print( Back.MAGENTA  + f"Total External URLs: {len(internal_urls)}\n")
    for url in internal_urls:
        print(Back.GREEN + f"Internal URL {url}")
    
    
    print(Back.MAGENTA  + f"\n\nTotal External URLs: {len(external_urls)}\n")
    for url in external_urls:
        print(Back.RED + f"External URL {url}")

    Output

    Conclusion

    In this tutorial, you learned how to extract all website links in Python? The above program is an application of web scraping with Python. We would recommend you read the official documentation of beautifulsoup4 and requests to know more about web data extraction with Python.

    People are also reading:

    Leave a Comment on this Post

    0 Comments