In this tutorial, we will discuss how to extract all PDF links in Python. A Portable Document Format (PDF) is a file format that generally contains text and image data. The text data can also be links leading to websites or web pages.
There are many Python libraries that can be used to read and write pdf files, but when it comes to reading or extracting specific data, such as images and links, then only a few of those libraries come useful.
Here in this Python tutorial, we will walk you through a Python program that extracts all the external links from the PDF. A PDF could also have internal links that lead the user to a specific section of the page, but in this tutorial, we are not covering that part, but in the program below, we have provided code - in the form of comments - to access the internal linking links. Before diving into the program, let's install the required library.
Install Required Library to Extract All PDF Links in Python
For this program to " Extract All PDF Links in Python, " we will be using the Python open-source PyMuPDF library , which is a powerful and straightforward pdf and other book format reading tool. To install the PyMuPDF library, run the following pip command on your terminal or command prompt:
pip install PyMuPDF
You will also require a PDF from which you wish to extract the links. We would suggest you store the pdf file in the same directory of your Python script so you can load the PDF file in Python by mentioning the relative path. Otherwise, you have to specify the absolute path to the pdf file. Now you are all set. So, open your favorite Python IDE or text editor and start coding.
How to Extract All PDF Links in Python?
Let's begin with importing the required module.
import fitz # PyMuPDF
Now specify the filename in string format.
#filename
filename = "book.pdf"
Here, our pdf file,
"book.pdf"
, resides in the same directory of the
Python script
that's why we are specifying the relative path. If your pdf file is located in some other directory or drive, then you need to specify the absolute path. You can also specify the relative path, but you have to be precise. Now open the pdf file with the
fitz.open()
method.
with fitz.open(filename) as my_pdf_file:
#loop through every page
for page_number in range(1, len(my_pdf_file)+1):
# acess individual page
page = my_pdf_file[page_number-1]
for link in page.links():
#if the link is an extrenal link with http or https (URI)
if "uri" in link:
#access link url
url = link["uri"]
print(f'Link: "{url}" found on page number --> {page_number}')
#if the link is internal or file with no URI
else:
pass
# if "page" in link:
# print("Internal page linking to page no", link["page"])
# else:
# print("File linking", link["file"]
-
The
fitz.open(filename) as my_pdf_file
statement will open the pdf file. -
The
page.links()
statement will return a list of all the links present on the page. -
link
is a dictionary object, which has keys, such asuri
,page
,file
,kind
, and so on. - The link will have the Uniform Resource Identifier (URI) if it starts with HTTP, https, or mailto.
Now, put all the code together and execute.
#A Simple Program to Extract All PDF Links in Python
import fitz # PyMuPDF
#filename
filename = "book.pdf"
with fitz.open(filename) as my_pdf_file:
#loop through every page
for page_number in range(1, len(my_pdf_file)+1):
# acess individual page
page = my_pdf_file[page_number-1]
for link in page.links():
#if the link is an extrenal link with http or https (URI)
if "uri" in link:
url = link["uri"]
print(f'Link: "{url}" found on page number --> {page_number}')
#if the link is internal or file with no URI
else:
pass
# if "page" in link:
# print("Internal page linking to page no", link["page"])
# else:
# print("File linking", link["file"])
Output
Link: "https://twoscoopspress.com" found on page number --> 4
Link: "http://2scoops.co/malcolm-tredinnick-memorial" found on page number --> 7
Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 32
Link: "http://www.2scoops.co/1.8-errata/" found on page number --> 32
Link: "https://docs.djangoproject.com/en/1.8/intro/tutorial01/" found on page number --> 33
Link: "http://www.2scoops.co/1.8-code-examples/" found on page number --> 34
Link: "https://docs.djangoproject.com/en/1.8/misc/design-philosophies/" found on page number --> 36
Link: "http://12factor.net" found on page number --> 37
Link: "http://www.2scoops.co/1.8-change-list/" found on page number --> 37
Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 38
Link: "https://github.com/twoscoops/two-scoops-of-django-1.8/issues" found on page number --> 38
Link: "http://www.python.org/dev/peps/pep-0008/" found on page number --> 40
Link: "http://2scoops.co/hobgoblin-of-little-minds" found on page number --> 40
Link: "http://www.python.org/dev/peps/pep-0008/#maximum-line-length" found on page number --> 41
Link: "http://2scoops.co/guido-on-pep-8-vs-pep-328" found on page number --> 45
Link: "http://www.python.org/dev/peps/pep-0328/" found on page number --> 45
Link: "http://2scoops.co/1.8-coding-style" found on page number --> 47
Link: "https://github.com/rwaldron/idiomatic.js/" found on page number --> 48
Link: "https://github.com/madrobby/pragmatic.js" found on page number --> 48
Link: "https://github.com/airbnb/javascript" found on page number --> 48
............
.........
.......
....
...
.
Link: "http://ponycheckup.com/" found on page number --> 506
From the above output, you can see that we only extracted the URI links that are the external links or URLs starting with HTTP or mailto.
Conclusion
In this Python tutorial, we learned how to extract all PDF links in Python. You can also extract links from a specific page number. You just need to tweak the above code a little bit, and you can access all the links from the specific page.
We have also written a Python tutorial on how to extract images from the PDF using Python and pyMuPDF library . We would suggest you read it if you want to work with Python and PDFs.
People are also reading:
Leave a Comment on this Post