#delete
- Installation
- Creating a simple XML document
- The Element class
- The SubElement class
- Setting text and attributes
- Parse an XML file using LXML in Python
- Finding elements in XML
- Handling HTML with lxml.html
- lxml web scraping tutorial
- Conclusion
In this lxml Python tutorial, we will explore the lxml library. We will go through the basics of creating XML documents and then jump on processing XML and HTML documents. Finally, we will put together all the pieces and see how to extract data using lxml.
For a detailed explanation, see our blog post.
The best way to download and install the lxml library is to use the pip package manager. This works on Windows, Mac, and Linux:
pip3 install lxml
A very simple XML document would look like this:
<root>
<branch>
<branch_one>
</branch_one>
<branch_one>
</branch_one >
</branch>
</root>
To create an XML document using python lxml, the first step is to import the etree
module of lxml:
>>> from lxml import etree
In this example, we will create an HTML, which is XML compliant. It means that the root element will have its name as html:
>>> root = etree.Element("html")
Similarly, every html will have a head and a body:
>>> head = etree.Element("head")
>>> body = etree.Element("body")
To create parent-child relationships, we can simply use the append() method.
>>> root.append(head)
>>> root.append(body)
This document can be serialized and printed to the terminal with the help of tostring()
function:
>>> print(etree.tostring(root, pretty_print=True).decode())
Creating an Element
object and calling the append()
function can make the code messy and unreadable. The easiest way is to use the SubElement
type:
body = etree.Element("body")
root.append(body)
# is same as
body = etree.SubElement(root,"body")
Here are the examples:
para = etree.SubElement(body, "p")
para.text="Hello World!"
Similarly, attributes can be set using key-value convention:
para.set("style", "font-size:20pt")
One thing to note here is that the attribute can be passed in the constructor of SubElement:
para = etree.SubElement(body, "p", style="font-size:20pt", id="firstPara")
para.text = "Hello World!"
Here is the complete code:
from lxml import etree
root = etree.Element("html")
head = etree.SubElement(root, "head")
title = etree.SubElement(head, "title")
title.text = "This is Page Title"
body = etree.SubElement(root, "body")
heading = etree.SubElement(body, "h1", style="font-size:20pt", id="head")
heading.text = "Hello World!"
para = etree.SubElement(body, "p", id="firstPara")
para.text = "This HTML is XML Compliant!"
para = etree.SubElement(body, "p", id="secondPara")
para.text = "This is the second paragraph."
etree.dump(root) # prints everything to console. Use for debug only
Add the following lines at the bottom of the snippet and run it again:
with open(‘input.html’, ‘wb’) as f:
f.write(etree.tostring(root, pretty_print=True)
Save the following snippet as input.html.
<html>
<head>
<title>This is Page Title</title>
</head>
<body>
<h1 style="font-size:20pt" id="head">Hello World!</h1>
<p id="firstPara">This HTML is XML Compliant!</p>
<p id="secondPara">This is the second paragraph.</p>
</body>
</html>
To get the root element, simply call the getroot()
method.
from lxml import etree
tree = etree.parse('input.html')
elem = tree.getroot()
etree.dump(elem) #prints file contents to console
The lxml.etree module exposes another method that can be used to parse contents from a valid xml string — fromstring()
xml = '<html><body>Hello</body></html>'
root = etree.fromstring(xml)
etree.dump(root)
If you want to dig deeper into parsing, we have already written a tutorial on BeautifulSoup, a Python package used for parsing HTML and XML documents.
Broadly, there are two ways of finding elements using the Python lxml library. The first is by using the Python lxml querying languages: XPath and ElementPath.
tree = etree.parse('input.html')
elem = tree.getroot()
para = elem.find('body/p')
etree.dump(para)
# Output
# <p id="firstPara">This HTML is XML Compliant!</p>
Similarly, findall()
will return a list of all the elements matching the selector.
elem = tree.getroot()
para = elem.findall('body/p')
for e in para:
etree.dump(e)
# Outputs
# <p id="firstPara">This HTML is XML Compliant!</p>
# <p id="secondPara">This is the second paragraph.</p>
Another way of selecting the elements is by using XPath directly
para = elem.xpath('//p/text()')
for e in para:
print(e)
# Output
# This HTML is XML Compliant!
# This is the second paragraph.
Here is the code to print all paragraphs from the same HTML file.
from lxml import html
with open('input.html') as f:
html_string = f.read()
tree = html.fromstring(html_string)
para = tree.xpath('//p/text()')
for e in para:
print(e)
# Output
# This HTML is XML Compliant!
# This is the second paragraph
Now that we know how to parse and find elements in XML and HTML, the only missing piece is getting the HTML of a web page.
For this, the Requests library is a great choice:
pip install requests
Once the requests library is installed, HTML of any web page can be retrieved using get()
method. Here is an example.
import requests
response = requests.get('http://books.toscrape.com/')
print(response.text)
# prints source HTML
Here is a quick example that prints a list of countries from Wikipedia:
import requests
from lxml import html
response = requests.get('https://en.wikipedia.org/wiki/List_of_countries_by_population_in_2010')
tree = html.fromstring(response.text)
countries = tree.xpath('//span[@class="flagicon"]')
for country in countries:
print(country.xpath('./following-sibling::a/text()')[0])
The following modified code prints the country name and image URL of the flag.
for country in countries:
flag = country.xpath('./img/@src')[0]
country = country.xpath('./following-sibling::a/text()')[0]
print(country, flag)
If you wish to find out more about XML Processing and Web Scraping With lxml, see our blog post.