How Url Extraction Works?
Extracting Urls from raw text can be carried out using several methods. In this example, we are using regular expressions to extract urls from raw text.
A sample pseudo code to extract urls form raw text will look like this.
- compile regular expression which extract
- strings starting from
http
- include
://
- contains
.
- Contain ports
:
- apply regular expression on desired raw text.
Python Code for Url Extraction
First, python regular expression module is imported
import re
Now compile desired regular expression which targets urls starting http
and https
.
re.compile(
"((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",
re.DOTALL,
)
Next all unique urls are extracted using re.finalal
command. These results are now saved as a set
which only contains unique urls.
all_urls = set([link[0] for link in re.findall(link_regex, raw_text)])
Flask Routes for Url Extraction
@app.route("/url-extractor/", methods=["GET", "POST"])
def url_extractor():
if request.method == "POST":
raw_text = request.form.get("rawText")
if raw_text:
link_regex = re.compile(
"((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)",
re.DOTALL,
)
results = "No URLs found"
all_links = set([link[0] for link in re.findall(link_regex, raw_text)])
if all_links:
raw_text = "\n".join(list(all_links))
results = f"Successfully exracted {len(all_links)} urls"
return render_template(
"url-extractor.html", raw_text=raw_text, result=results
)
return render_template("url-extractor.html", raw_text="", result="")