How to extract Urls in raw Text?

How Url Extraction Works?

Extracting Urls from raw text can be carried out using several methods. In this example, we are using regular expressions to extract urls from raw text.

A sample pseudo code to extract urls form raw text will look like this.

compile regular expression which extract
strings starting from http
include ://
contains .
Contain ports :
apply regular expression on desired raw text.

Python Code for Url Extraction

First, python regular expression module is imported

import re

Now compile desired regular expression which targets urls starting http and https.

re.compile(
        "((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",
        re.DOTALL,
    )

Next all unique urls are extracted using re.finalal command. These results are now saved as a set which only contains unique urls.

all_urls = set([link[0] for link in re.findall(link_regex, raw_text)])

Flask Routes for Url Extraction

@app.route("/url-extractor/", methods=["GET", "POST"])
def url_extractor():

    if request.method == "POST":
        raw_text = request.form.get("rawText")
        if raw_text:
            link_regex = re.compile(
                "((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",
                re.DOTALL,
            )
            results = "No URLs found"
            all_links = set([link[0] for link in re.findall(link_regex, raw_text)])
            if all_links:
                raw_text = "\n".join(list(all_links))
                results = f"Successfully exracted {len(all_links)} urls"
            return render_template(
                "url-extractor.html", raw_text=raw_text, result=results
            )

    return render_template("url-extractor.html", raw_text="", result="")