Tag Archives: Python

Transforming and Validating XML with Python and lxml

XML isn’t nearly as sexy as JSON these days, but it’s still out there in the wild. And it is powerful. For example, it’s pretty awesome that you can assemble an XSL transform to parse XML and turn it into newly formatted XML. It’s also pretty awesome that you can verify XML against a schema to ensure the XML meets all requirements (say, for example, that an ID be unique across all instances) — that the XML is “valid.”

If you are a front-end developer, chances are that you make a series of HTTP requests and receive data — it’s a pretty common thing. For the purposes of this post, we’ll assume that data is XML. But, there’s a problem: the XML is not using the tags you need for your application. So, you apply an XSL transform. Your application makes many assumptions about the format of this massaged data, so you employ a schema or XSD to validate each assumption.

There’s also a pretty good chance that the folks maintaining these services want to tinker. So it would be immensely helpful to be able to quickly test out each URL to be confident that changes made to services won’t negatively affect your application. It would be wise to structure these as actual unit tests, but that is beyond the scope of my focus here.

Commence the tool-making! This seemed like a perfect candidate for Python, so I hopped to it. After some googling, I quickly got the impression that lxml was the perfect library for the job, able to handle both XML transforms and XSD validation. It couldn’t have been easier to work with.

I whipped up a Python script to read URLs from a designated text file, iterate over each one, hit the URL, transform the XML, validate the XML, and write any validation errors to a log file. Pretty straight-forward, and I can now validate all of my URLs at a moment’s notice, and have a full report generated in seconds.

Below is my script:

More Python Fun

It’s no secret that, recently, I’ve been teaching myself Python. A couple of weeks ago, I wrote a Python script to convert a CSV file to an XML file, and that wet my appetite for more.

Earlier today, I discovered Anaconda from Continuum Analytics, which comes with IPython Notebook. Not only is it a really nice tool for learning Python, but you can also plot points! This would have made Calculus way more fun 15 years ago!

At any rate, I started fooling around with some basic list slicing, list comprehension and the functional favorites: filter, map and reduce. IPython Notebook made this incredibly simple. Wanting to tackle something a bit more complicated, I sought out a coding interview problem.

The problem is such that you’re provided an initial collection of integers, and you are to produce a sum of the highest, non-adjacent integers in the collection. It sounds challenging, but when you break it up into smaller pieces, it’s pretty trivial.

I started by building a min heap of the original collection such that I could pop off the largest values in order. A max heap is technically more appropriate, but the Python heapq module that turns a list into a heap only supports min. As for the values themselves, I simply inverted them by multiplying each by -1.

The index of each item is also critical in determining whether adjacent items have already been applied toward the sum. So instead of pushing the raw value onto the heap, I pushed a tuple containing the value and its index.

With the heap fully constructed, the next thing needed was some way of keeping track of which items were used toward the sum. I chose the simple solution of creating a list of boolean values, each initialized to False, such that when an item at the same index is used toward the sum, its value is changed to True.

While popping items off the heap, each item’s neighbors are examined to determine whether it’s a candidate for the sum. If it is, its value is added to a final list, from which a sum can easily be reduced.

Here’s the full script:

Could this problem be solved other ways, either by reducing allocations or increasing speed? Quite possibly, but remember, this was just an exercise to flex my new Python muscles.

Boy Meets Python

Last week I needed a quick solution to convert a CSV file to an XML file, and because C# is my primary language, I was able to throw this together in less than 10 minutes:

So what does this have to do with Python? Well, this weekend, I had the sudden urge to learn some Python. I wanted to build something that a.) would force me to learn a few things about the language, and b.) had value. The CSV to XML converter was fresh in my mind, and so I thought it would be a great way to begin my Python journey.

To start, I installed Python on Windows. I downloaded the installer from http://www.python.org/getit/, and was writing Python in just a few minutes. Pretty painless.

Writing Python was slightly awkward at first, but I quickly got the hang of things. Having taken the time to learn LINQ and lambda expressions a few years ago certainly helped.

Command line arguments were a breeze using argparse. Within minutes I had a way to specify the CSV input file and the XML output file. It isn’t absolutely necessary, but argparse makes specifying expected parameters easy, and comes prepackaged with --help. Nice.

Next, I stumbled upon csv, which was certainly helpful. But, again, I’m pretty sure I could have survived without it, treating the input file as a standard text file and reading one line at a time.

A long time ago I got into the habit of encapsulating file I/O with using() in C#. It felt awkward acquiring a file handle and having to call close() on it explicitly, but once I discovered Python’s with keyword, I felt right at home.

The rest of the script, which is really the meat of the conversion, required me to learn a little bit about lists and strings. I’m an avid user of string.Format(...) in C# and was happy to see that I could call format(...) in Python.

I began by reading in the first line, which always contains the headers. I wanted to form a string format something to the effect of <row col0="{0}" col1="{1}"/> that I could use when processing each subsequent line. I discovered the join() method on the string, and thought that might allow me to dynamically assemble the attributes. Calling join() on the string " " and passing to it a collection of strings generated by an iterator that iterates over the headers cleverly assembles the format string — in one line of code! (I felt pretty stupid when I realized that the string in C# also has this feature.)

The last remaining piece was processing each line of the CSV file. This was trivial once I generated a format string, with one exception. For each line, I thought I could call format() on the format string, pass in the list of values from the line, and write the newly constructed string to the file. The problem was, format() is expecting comma-delimited parameters, and I was holding onto a list of values as strings. Simply passing the reference to the list, line, was not sufficient. To my surprise, I discovered that I could essentially dereference the list (as such: *line), satisfying format().

And that completed the exercise! I won’t admit how long it took me to write, but let’s just say it took longer than 10 minutes.

Below is the script:

(Having spent the time to set up the row format in Python, I thought I should go back and use the same approach in C# , complete with using Join(), for a more apples-to-apples comparison.)

The mere fact that I did all of this on Windows felt slightly sacrilegious, so I decided to go back and conduct the same exercise, this time on Linux — Ubuntu 13.04 to be exact.

Ubuntu ships with Python installed, so technically there were even fewer steps to get started. But, it ships with v2.7.4, and the script I wrote on Windows apparently uses language features that didn’t exist until v3.x. So, I grabbed Python 3.3.2 for Linux from http://www.python.org/getit/, and followed these excellent instructions so that I could have both v2.7.4 and v3.3.2 installed simultaneously. Once installed, the script I wrote on Windows ran equally well on Linux.

It was clear during this exercise that I merely scratched the surface with Python. It appears to have quite an exhaustive API, contains many of the same constructs that I’m used to in C#, and I will not hesitate to use it for all of my future scripting needs.