Skip to content

Commit

Permalink
Merge pull request #2176 from jupyter-naas/2175-python-remove-html-ta…
Browse files Browse the repository at this point in the history
…gs-from-text

feat: Python - Remove HTML tags from text
  • Loading branch information
FlorentLvr authored Aug 25, 2023
2 parents 8ab5d36 + cbdb119 commit 84ab35b
Showing 1 changed file with 265 additions and 0 deletions.
265 changes: 265 additions & 0 deletions RegEx/RegEx_Remove_HTML_tags_from_text.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "254da872-f152-48f2-89be-228691c74a96",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"<img width=\"10%\" alt=\"Naas\" src=\"https://landen.imgix.net/jtci2pxwjczr/assets/5ice39g4.png?w=160\"/>"
]
},
{
"cell_type": "markdown",
"id": "9ebb49d4-ee78-445e-ade0-ca238000607f",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"# RegEx - Remove HTML tags from text"
]
},
{
"cell_type": "markdown",
"id": "0c028211-9124-472c-ad76-a982652038b6",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**Tags:** #regex #python #html #text #remove #tags #string"
]
},
{
"cell_type": "markdown",
"id": "6907d86c-89e8-469a-9e51-02281dcafd3e",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**Author:** [Florent Ravenel](https://www.linkedin.com/in/florent-ravenel)"
]
},
{
"cell_type": "markdown",
"id": "39b79b12-92cf-4046-820c-21eab09d2e93",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**Last update:** 2023-08-25 (Created: 2023-08-25)"
]
},
{
"cell_type": "markdown",
"id": "2638c7c0-377b-4598-a4f3-9e6e294704b2",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**Description:** This notebook shows how to remove HTML tags from a text using Python. It is usefull for organizations that need to clean text from HTML tags before using it."
]
},
{
"cell_type": "markdown",
"id": "89f9ab67-b189-4a42-bae3-c024313ba70c",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"**References:**\n",
"- [Python - Remove HTML tags from text](https://www.geeksforgeeks.org/python-remove-html-tags-from-text/)\n",
"- [Python - Regular Expressions](https://docs.python.org/3/library/re.html)"
]
},
{
"cell_type": "markdown",
"id": "a72458da-3130-48cf-b321-b3b196941aac",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"## Input"
]
},
{
"cell_type": "markdown",
"id": "92ed98c9-8f9c-4126-b514-35a985130ed0",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"### Import libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "40102e20-55f1-4fce-8ceb-3289ec71941e",
"metadata": {
"papermill": {},
"tags": []
},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "markdown",
"id": "1f9dab78-35ed-4659-ac87-de3ad1d7e6e3",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"### Setup variables\n",
"- `text`: Text containing HTML tags"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5356174a-0a81-429a-9076-90a253e504b7",
"metadata": {
"papermill": {},
"tags": []
},
"outputs": [],
"source": [
"text = \"<html><head><title>Test</title></head><body><h1>Hello World!</h1></body></html>\""
]
},
{
"cell_type": "markdown",
"id": "5d948ec9-cd40-4ba0-a7e3-088922b9b2ab",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"## Model"
]
},
{
"cell_type": "markdown",
"id": "74d17d4c-f01d-40a8-ad28-4b967658d9ab",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"### Remove HTML tags"
]
},
{
"cell_type": "markdown",
"id": "708caab6-5805-42a1-b363-58538c531afd",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"This function uses regular expressions to remove HTML tags from a text."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "09039b3c-53bd-4766-973d-b370ccd64e0e",
"metadata": {
"papermill": {},
"tags": []
},
"outputs": [],
"source": [
"def remove_html_tags(text):\n",
" clean = re.compile(\"<.*?>\")\n",
" return re.sub(clean, \"\", text)"
]
},
{
"cell_type": "markdown",
"id": "99dc8041-7515-46b1-9b89-a28f9eb4dc2c",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"## Output"
]
},
{
"cell_type": "markdown",
"id": "f8ede354-c5bf-4aff-9538-bea3257878cc",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
"### Display result"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "10ec01d9-00b8-4e1e-9aaa-2e6ecdc8bf77",
"metadata": {
"papermill": {},
"tags": []
},
"outputs": [],
"source": [
"print(remove_html_tags(text))"
]
},
{
"cell_type": "markdown",
"id": "9e1e4416-d66b-4d73-863e-ca13263e1231",
"metadata": {
"papermill": {},
"tags": []
},
"source": [
" "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit 84ab35b

Please sign in to comment.