How to extract Urls in raw Text?

Url Extractor

Faisal Shahzad 17-02-2023 Text Processing Tools

How Url Extraction Works?

Extracting Urls from raw text can be carried out using several methods. In this example, we are using regular expressions to extract urls from raw text.

A sample pseudo code to extract urls form raw text will look like this.

  • compile regular expression which extract
  • strings starting from http
  • include ://
  • contains .
  • Contain ports :
  • apply regular expression on desired raw text.

Python Code for Url Extraction

First, python regular expression module is imported

import re

Now compile desired regular expression which targets urls starting http and https.

re.compile(
        "((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",
        re.DOTALL,
    )

Next all unique urls are extracted using re.finalal command. These results are now saved as a set which only contains unique urls.

all_urls = set([link[0] for link in re.findall(link_regex, raw_text)])

Flask Routes for Url Extraction

@app.route("/url-extractor/", methods=["GET", "POST"])
def url_extractor():

    if request.method == "POST":
        raw_text = request.form.get("rawText")
        if raw_text:
            link_regex = re.compile(
                "((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",
                re.DOTALL,
            )
            results = "No URLs found"
            all_links = set([link[0] for link in re.findall(link_regex, raw_text)])
            if all_links:
                raw_text = "\n".join(list(all_links))
                results = f"Successfully exracted {len(all_links)} urls"
            return render_template(
                "url-extractor.html", raw_text=raw_text, result=results
            )

    return render_template("url-extractor.html", raw_text="", result="")