Python Tutorial (37) -XML parsing

Time: Column:Python views:222

What is XML?

XML stands for eXtensible Markup Language. It is a subset of the Standard Generalized Markup Language (SGML), designed to mark up electronic documents to give them structure. You can learn more through the XML Tutorial on PMeve.

XML is primarily designed for storing and transporting data.

It is a set of rules that defines semantic tags, which divide documents into several parts and assign identifiers to those parts. XML is also a meta-markup language, meaning it defines the syntax for other domain-specific, semantic, structured markup languages.

Parsing XML in Python

Common XML programming interfaces include DOM and SAX, which handle XML files in different ways, depending on the use case.

Python provides three methods for parsing XML: ElementTree, SAX, and DOM.

1. ElementTree

xml.etree.ElementTree is a module in Python's standard library designed to handle XML, offering a simple and efficient API for parsing and generating XML documents.

2. SAX (Simple API for XML)

Python’s standard library includes a SAX parser, which uses an event-driven model to handle XML files by triggering events during the parsing process and calling user-defined callback functions.

3. DOM (Document Object Model)

This method parses XML data into a tree in memory, allowing for manipulation of the XML through tree-based operations.

Example XML File

Here’s an example of an XML file movies.xml used in the chapter:

<collection shelf="New Arrivals">
    <movie title="Enemy Behind">
        <type>War, Thriller</type>
        <format>DVD</format>
        <year>2003</year>
        <rating>PG</rating>
        <stars>10</stars>
        <description>Talk about a US-Japan war</description>
    </movie>
    <movie title="Transformers">
        <type>Anime, Science Fiction</type>
        <format>DVD</format>
        <year>1989</year>
        <rating>R</rating>
        <stars>8</stars>
        <description>A scientific fiction</description>
    </movie>
    <movie title="Trigun">
        <type>Anime, Action</type>
        <format>DVD</format>
        <episodes>4</episodes>
        <rating>PG</rating>
        <stars>10</stars>
        <description>Vash the Stampede!</description>
    </movie>
    <movie title="Ishtar">
        <type>Comedy</type>
        <format>VHS</format>
        <rating>PG</rating>
        <stars>2</stars>
        <description>Viewable boredom</description>
    </movie>
</collection>

Parsing XML with ElementTree in Python

xml.etree.ElementTree is a module in Python’s standard library for working with XML.

Here are some key concepts and usages:

  • ElementTree and Element objects:

    • ElementTree: A class representing an entire XML document tree.

    • Element: Represents an individual element in the XML document. Each element has a tag, a set of attributes, and zero or more child elements.

Parsing XML

  • fromstring() method: Converts an XML string into an Element object.

Example:

import xml.etree.ElementTree as ET

xml_string = '<root><element>Some data</element></root>'
root = ET.fromstring(xml_string)
  • parse() method: Used to parse an XML document from a file.

tree = ET.parse('example.xml')
root = tree.getroot()

Traversing the XML Tree

  • find() method: Finds the first child element with the specified tag.

title_element = root.find('title')
  • findall() method: Finds all child elements with the specified tag.

book_elements = root.findall('book')

Accessing Attributes and Text Content

  • attrib attribute: Accesses an element's attributes.

price = book_element.attrib['price']
  • text attribute: Accesses an element's text content.

title_text = title_element.text

Creating XML

  • Element() constructor: Creates a new element.

pnew_element = ET.Element('new_element')
  • SubElement() function: Creates a child element with the specified tag.

new_sub_element = ET.SubElement(root, 'new_sub_element')

Modifying XML

  • Modify the attributes and text content of elements by directly changing the attrib and text properties.

  • remove() method: Removes an element.

python複製程式碼root.remove(title_element)

Simple XML Reading Example:

import xml.etree.ElementTree as ET

# Define an XML string
xml_string = '''
<bookstore>
    <book>
        <title>Introduction to Python</title>
        <author>John Doe</author>
        <price>29.99</price>
    </book>
    <book>
        <title>Data Science with Python</title>
        <author>Jane Smith</author>
        <price>39.95</price>
    </book>
</bookstore>
'''

# Parse the XML string using ElementTree
root = ET.fromstring(xml_string)

# Traverse the XML tree
for book in root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    price = book.find('price').text
    print(f'Title: {title}, Author: {author}, Price: {price}')

Output:

Title: Introduction to Python, Author: John Doe, Price: 29.99
Title: Data Science with Python, Author: Jane Smith, Price: 39.95

Creating and Parsing XML Example

import xml.etree.ElementTree as ET

# Create an XML document
root = ET.Element('bookstore')

# Add the first book
book1 = ET.SubElement(root, 'book')
title1 = ET.SubElement(book1, 'title')
title1.text = 'Introduction to Python'
author1 = ET.SubElement(book1, 'author')
author1.text = 'John Doe'
price1 = ET.SubElement(book1, 'price')
price1.text = '29.99'

# Add the second book
book2 = ET.SubElement(root, 'book')
title2 = ET.SubElement(book2, 'title')
title2.text = 'Data Science with Python'
author2 = ET.SubElement(book2, 'author')
author2.text = 'Jane Smith'
price2 = ET.SubElement(book2, 'price')
price2.text = '39.95'

# Save the XML document to a file
tree = ET.ElementTree(root)
tree.write('books.xml')

# Parse the XML document from the file
parsed_tree = ET.parse('books.xml')
parsed_root = parsed_tree.getroot()

# Traverse the XML tree and print book information
for book in parsed_root.findall('book'):
    title = book.find('title').text
    author = book.find('author').text
    price = book.find('price').text
    print(f'Title: {title}, Author: {author}, Price: {price}')

In this example, we first create an XML document containing information about two books. The document is saved to a file called books.xml. Then, we use the ET.parse() method to parse the XML document and extract and print the title, author, and price of each book.


Python Using SAX to Parse XML (2024)

SAX is an event-driven API used to parse XML documents. It involves two main components: the parser and the event handler.

The parser reads the XML document and triggers events (such as the start and end of elements) that are handled by the event handler, which processes the XML data in response to those events.

SAX is especially useful in these cases:

  1. For processing large files.

  2. When only part of the document or specific information is needed.

  3. When building a custom object model is desired.

In Python, to work with SAX for XML processing, you need to import the parse function from xml.sax, and the ContentHandler from xml.sax.handler.

ContentHandler Class Methods

characters(content) Method

When it's called:

  • When characters are found before encountering a tag, the value of content is the string before the tag.

  • When characters are found between two tags, content contains those strings.

  • When characters are found after a tag but before an end-of-line character, content contains those strings.

The tag could either be a start tag or an end tag.

startDocument() Method

Called when the document starts.

endDocument() Method

Called when the parser reaches the end of the document.

startElement(name, attrs) Method

Called when the parser encounters an XML start tag. name is the name of the tag, and attrs is a dictionary of the tag’s attributes.

endElement(name) Method

Called when the parser encounters an XML end tag.

make_parser Method

This method creates and returns a new parser object.

xml.sax.make_parser([parser_list])
  • parser_list: An optional parameter that specifies a list of parsers.

parse Method

This method creates a SAX parser and parses an XML document:

xml.sax.parse(xmlfile, contenthandler[, errorhandler])
  • xmlfile: The XML file name.

  • contenthandler: Must be an object of ContentHandler.

  • errorhandler: If specified, must be a SAX ErrorHandler object.

parseString Method

This method creates a SAX parser and parses an XML string:

xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
  • xmlstring: The XML string.

  • contenthandler: Must be an object of ContentHandler.

  • errorhandler: If specified, must be a SAX ErrorHandler object.


Python SAX Parsing Example

Here’s an example of how to use Python to parse XML using SAX:

#!/usr/bin/python3
import xml.sax

class MovieHandler(xml.sax.ContentHandler):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Element start
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print("*****Movie*****")
         title = attributes["title"]
         print("Title:", title)

   # Element end
   def endElement(self, tag):
      if self.CurrentData == "type":
         print("Type:", self.type)
      elif self.CurrentData == "format":
         print("Format:", self.format)
      elif self.CurrentData == "year":
         print("Year:", self.year)
      elif self.CurrentData == "rating":
         print("Rating:", self.rating)
      elif self.CurrentData == "stars":
         print("Stars:", self.stars)
      elif self.CurrentData == "description":
         print("Description:", self.description)
      self.CurrentData = ""

   # Read characters
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content

if __name__ == "__main__":
   # Create an XMLReader
   parser = xml.sax.make_parser()
   # Disable namespace support
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # Override ContentHandler
   Handler = MovieHandler()
   parser.setContentHandler(Handler)

   parser.parse("movies.xml")

Output:

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A scientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom

For more details, refer to the full SAX API documentation in Python SAX APIs.


Using xml.dom to Parse XML

The Document Object Model (DOM) is a standard programming interface for processing XML documents, recommended by W3C. A DOM parser reads the entire XML document and stores it in memory as a tree structure. This allows you to read or modify the contents and structure of the document, and even save the changes back to the XML file.

In Python, you can use xml.dom.minidom to parse XML files, as shown in the following example:

#!/usr/bin/python3
from xml.dom.minidom import parse
import xml.dom.minidom

# Use minidom parser to open the XML document
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print("Root element: %s" % collection.getAttribute("shelf"))

# Get all movies from the collection
movies = collection.getElementsByTagName("movie")

# Print movie details
for movie in movies:
   print("*****Movie*****")
   if movie.hasAttribute("title"):
      print("Title: %s" % movie.getAttribute("title"))

   type = movie.getElementsByTagName('type')[0]
   print("Type: %s" % type.childNodes[0].data)
   format = movie.getElementsByTagName('format')[0]
   print("Format: %s" % format.childNodes[0].data)
   rating = movie.getElementsByTagName('rating')[0]
   print("Rating: %s" % rating.childNodes[0].data)
   description = movie.getElementsByTagName('description')[0]
   print("Description: %s" % description.childNodes[0].data)

Output:

Root element: New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A scientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom

This example parses the movies.xml file using xml.dom.minidom and prints detailed information about each movie.