What is XML?
XML stands for eXtensible Markup Language. It is a subset of the Standard Generalized Markup Language (SGML), designed to mark up electronic documents to give them structure. You can learn more through the XML Tutorial on PMeve.
XML is primarily designed for storing and transporting data.
It is a set of rules that defines semantic tags, which divide documents into several parts and assign identifiers to those parts. XML is also a meta-markup language, meaning it defines the syntax for other domain-specific, semantic, structured markup languages.
Parsing XML in Python
Common XML programming interfaces include DOM and SAX, which handle XML files in different ways, depending on the use case.
Python provides three methods for parsing XML: ElementTree, SAX, and DOM.
1. ElementTree
xml.etree.ElementTree
is a module in Python's standard library designed to handle XML, offering a simple and efficient API for parsing and generating XML documents.
2. SAX (Simple API for XML)
Python’s standard library includes a SAX parser, which uses an event-driven model to handle XML files by triggering events during the parsing process and calling user-defined callback functions.
3. DOM (Document Object Model)
This method parses XML data into a tree in memory, allowing for manipulation of the XML through tree-based operations.
Example XML File
Here’s an example of an XML file movies.xml
used in the chapter:
<collection shelf="New Arrivals"> <movie title="Enemy Behind"> <type>War, Thriller</type> <format>DVD</format> <year>2003</year> <rating>PG</rating> <stars>10</stars> <description>Talk about a US-Japan war</description> </movie> <movie title="Transformers"> <type>Anime, Science Fiction</type> <format>DVD</format> <year>1989</year> <rating>R</rating> <stars>8</stars> <description>A scientific fiction</description> </movie> <movie title="Trigun"> <type>Anime, Action</type> <format>DVD</format> <episodes>4</episodes> <rating>PG</rating> <stars>10</stars> <description>Vash the Stampede!</description> </movie> <movie title="Ishtar"> <type>Comedy</type> <format>VHS</format> <rating>PG</rating> <stars>2</stars> <description>Viewable boredom</description> </movie> </collection>
Parsing XML with ElementTree in Python
xml.etree.ElementTree
is a module in Python’s standard library for working with XML.
Here are some key concepts and usages:
ElementTree and Element objects:
ElementTree
: A class representing an entire XML document tree.Element
: Represents an individual element in the XML document. Each element has a tag, a set of attributes, and zero or more child elements.
Parsing XML
fromstring() method: Converts an XML string into an
Element
object.
Example:
import xml.etree.ElementTree as ET xml_string = '<root><element>Some data</element></root>' root = ET.fromstring(xml_string)
parse() method: Used to parse an XML document from a file.
tree = ET.parse('example.xml') root = tree.getroot()
Traversing the XML Tree
find() method: Finds the first child element with the specified tag.
title_element = root.find('title')
findall() method: Finds all child elements with the specified tag.
book_elements = root.findall('book')
Accessing Attributes and Text Content
attrib attribute: Accesses an element's attributes.
price = book_element.attrib['price']
text attribute: Accesses an element's text content.
title_text = title_element.text
Creating XML
Element() constructor: Creates a new element.
pnew_element = ET.Element('new_element')
SubElement() function: Creates a child element with the specified tag.
new_sub_element = ET.SubElement(root, 'new_sub_element')
Modifying XML
Modify the attributes and text content of elements by directly changing the
attrib
andtext
properties.remove() method: Removes an element.
python複製程式碼root.remove(title_element)
Simple XML Reading Example:
import xml.etree.ElementTree as ET # Define an XML string xml_string = ''' <bookstore> <book> <title>Introduction to Python</title> <author>John Doe</author> <price>29.99</price> </book> <book> <title>Data Science with Python</title> <author>Jane Smith</author> <price>39.95</price> </book> </bookstore> ''' # Parse the XML string using ElementTree root = ET.fromstring(xml_string) # Traverse the XML tree for book in root.findall('book'): title = book.find('title').text author = book.find('author').text price = book.find('price').text print(f'Title: {title}, Author: {author}, Price: {price}')
Output:
Title: Introduction to Python, Author: John Doe, Price: 29.99 Title: Data Science with Python, Author: Jane Smith, Price: 39.95
Creating and Parsing XML Example
import xml.etree.ElementTree as ET # Create an XML document root = ET.Element('bookstore') # Add the first book book1 = ET.SubElement(root, 'book') title1 = ET.SubElement(book1, 'title') title1.text = 'Introduction to Python' author1 = ET.SubElement(book1, 'author') author1.text = 'John Doe' price1 = ET.SubElement(book1, 'price') price1.text = '29.99' # Add the second book book2 = ET.SubElement(root, 'book') title2 = ET.SubElement(book2, 'title') title2.text = 'Data Science with Python' author2 = ET.SubElement(book2, 'author') author2.text = 'Jane Smith' price2 = ET.SubElement(book2, 'price') price2.text = '39.95' # Save the XML document to a file tree = ET.ElementTree(root) tree.write('books.xml') # Parse the XML document from the file parsed_tree = ET.parse('books.xml') parsed_root = parsed_tree.getroot() # Traverse the XML tree and print book information for book in parsed_root.findall('book'): title = book.find('title').text author = book.find('author').text price = book.find('price').text print(f'Title: {title}, Author: {author}, Price: {price}')
In this example, we first create an XML document containing information about two books. The document is saved to a file called books.xml
. Then, we use the ET.parse()
method to parse the XML document and extract and print the title, author, and price of each book.
Python Using SAX to Parse XML (2024)
SAX is an event-driven API used to parse XML documents. It involves two main components: the parser and the event handler.
The parser reads the XML document and triggers events (such as the start and end of elements) that are handled by the event handler, which processes the XML data in response to those events.
SAX is especially useful in these cases:
For processing large files.
When only part of the document or specific information is needed.
When building a custom object model is desired.
In Python, to work with SAX for XML processing, you need to import the parse
function from xml.sax
, and the ContentHandler
from xml.sax.handler
.
ContentHandler Class Methods
characters(content)
Method
When it's called:
When characters are found before encountering a tag, the value of
content
is the string before the tag.When characters are found between two tags,
content
contains those strings.When characters are found after a tag but before an end-of-line character,
content
contains those strings.
The tag could either be a start tag or an end tag.
startDocument()
Method
Called when the document starts.
endDocument()
Method
Called when the parser reaches the end of the document.
startElement(name, attrs)
Method
Called when the parser encounters an XML start tag. name
is the name of the tag, and attrs
is a dictionary of the tag’s attributes.
endElement(name)
Method
Called when the parser encounters an XML end tag.
make_parser
Method
This method creates and returns a new parser object.
xml.sax.make_parser([parser_list])
parser_list: An optional parameter that specifies a list of parsers.
parse
Method
This method creates a SAX parser and parses an XML document:
xml.sax.parse(xmlfile, contenthandler[, errorhandler])
xmlfile: The XML file name.
contenthandler: Must be an object of
ContentHandler
.errorhandler: If specified, must be a
SAX ErrorHandler
object.
parseString
Method
This method creates a SAX parser and parses an XML string:
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
xmlstring: The XML string.
contenthandler: Must be an object of
ContentHandler
.errorhandler: If specified, must be a
SAX ErrorHandler
object.
Python SAX Parsing Example
Here’s an example of how to use Python to parse XML using SAX:
#!/usr/bin/python3 import xml.sax class MovieHandler(xml.sax.ContentHandler): def __init__(self): self.CurrentData = "" self.type = "" self.format = "" self.year = "" self.rating = "" self.stars = "" self.description = "" # Element start def startElement(self, tag, attributes): self.CurrentData = tag if tag == "movie": print("*****Movie*****") title = attributes["title"] print("Title:", title) # Element end def endElement(self, tag): if self.CurrentData == "type": print("Type:", self.type) elif self.CurrentData == "format": print("Format:", self.format) elif self.CurrentData == "year": print("Year:", self.year) elif self.CurrentData == "rating": print("Rating:", self.rating) elif self.CurrentData == "stars": print("Stars:", self.stars) elif self.CurrentData == "description": print("Description:", self.description) self.CurrentData = "" # Read characters def characters(self, content): if self.CurrentData == "type": self.type = content elif self.CurrentData == "format": self.format = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "rating": self.rating = content elif self.CurrentData == "stars": self.stars = content elif self.CurrentData == "description": self.description = content if __name__ == "__main__": # Create an XMLReader parser = xml.sax.make_parser() # Disable namespace support parser.setFeature(xml.sax.handler.feature_namespaces, 0) # Override ContentHandler Handler = MovieHandler() parser.setContentHandler(Handler) parser.parse("movies.xml")
Output:
*****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Year: 2003 Rating: PG Stars: 10 Description: Talk about a US-Japan war *****Movie***** Title: Transformers Type: Anime, Science Fiction Format: DVD Year: 1989 Rating: R Stars: 8 Description: A scientific fiction *****Movie***** Title: Trigun Type: Anime, Action Format: DVD Rating: PG Stars: 10 Description: Vash the Stampede! *****Movie***** Title: Ishtar Type: Comedy Format: VHS Rating: PG Stars: 2 Description: Viewable boredom
For more details, refer to the full SAX API documentation in Python SAX APIs.
Using xml.dom
to Parse XML
The Document Object Model (DOM) is a standard programming interface for processing XML documents, recommended by W3C. A DOM parser reads the entire XML document and stores it in memory as a tree structure. This allows you to read or modify the contents and structure of the document, and even save the changes back to the XML file.
In Python, you can use xml.dom.minidom
to parse XML files, as shown in the following example:
#!/usr/bin/python3 from xml.dom.minidom import parse import xml.dom.minidom # Use minidom parser to open the XML document DOMTree = xml.dom.minidom.parse("movies.xml") collection = DOMTree.documentElement if collection.hasAttribute("shelf"): print("Root element: %s" % collection.getAttribute("shelf")) # Get all movies from the collection movies = collection.getElementsByTagName("movie") # Print movie details for movie in movies: print("*****Movie*****") if movie.hasAttribute("title"): print("Title: %s" % movie.getAttribute("title")) type = movie.getElementsByTagName('type')[0] print("Type: %s" % type.childNodes[0].data) format = movie.getElementsByTagName('format')[0] print("Format: %s" % format.childNodes[0].data) rating = movie.getElementsByTagName('rating')[0] print("Rating: %s" % rating.childNodes[0].data) description = movie.getElementsByTagName('description')[0] print("Description: %s" % description.childNodes[0].data)
Output:
Root element: New Arrivals *****Movie***** Title: Enemy Behind Type: War, Thriller Format: DVD Rating: PG Description: Talk about a US-Japan war *****Movie***** Title: Transformers Type: Anime, Science Fiction Format: DVD Rating: R Description: A scientific fiction *****Movie***** Title: Trigun Type: Anime, Action Format: DVD Rating: PG Description: Vash the Stampede! *****Movie***** Title: Ishtar Type: Comedy Format: VHS Rating: PG Description: Viewable boredom
This example parses the movies.xml
file using xml.dom.minidom
and prints detailed information about each movie.