20 May 2004
Copyright © 2000, 2001, 2002, 2003, 2004 Mark Pilgrim
This book lives at http://diveintopython.org/. If you're reading it somewhere else, you may not have the latest version.
Permission is granted to copy, distribute, and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in Appendix G, GNU Free Documentation License.
The example programs in this book are free software; you can redistribute and/or modify them under the terms of the Python license as published by the Python Software Foundation. A copy of the license is included in Appendix H, Python license.
Table of Contents
Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you.
The first thing you need to do with Python is install it. Or do you?
If you're using an account on a hosted server, your ISP may have already installed Python. Most popular Linux distributions come with Python in the default installation. Mac OS X 10.2 and later includes a command-line version of Python, although you'll probably want to install a version that includes a more Mac-like graphical interface.
Windows does not come with any version of Python, but don't despair! There are several ways to point-and-click your way to Python on Windows.
As you can see already, Python runs on a great many operating systems. The full list includes Windows, Mac OS, Mac OS X, and all varieties of free UNIX-compatible systems like Linux. There are also versions that run on Sun Solaris, AS/400, Amiga, OS/2, BeOS, and a plethora of other platforms you've probably never even heard of.
What's more, Python programs written on one platform can, with a little care, run on any supported platform. For instance, I regularly develop Python programs on Windows and later deploy them on Linux.
So back to the question that started this section, “Which Python is right for you?” The answer is whichever one runs on the computer you already have.
On Windows, you have a couple choices for installing Python.
ActiveState makes a Windows installer for Python called ActivePython, which includes a complete version of Python, an IDE with a Python-aware code editor, plus some Windows extensions for Python that allow complete access to Windows-specific services, APIs, and the Windows Registry.
ActivePython is freely downloadable, although it is not open source. It is the IDE I used to learn Python, and I recommend you try it unless you have a specific reason not to. One such reason might be that ActiveState is generally several months behind in updating their ActivePython installer when new version of Python are released. If you absolutely need the latest version of Python and ActivePython is still a version behind as you read this, you'll want to use the second option for installing Python on Windows.
The second option is the “official” Python installer, distributed by the people who develop Python itself. It is freely downloadable and open source, and it is always current with the latest version of Python.
Here is the procedure for installing ActivePython:
Download ActivePython from http://www.activestate.com/Products/ActivePython/.
If you are using Windows 95, Windows 98, or Windows ME, you will also need to download and install Windows Installer 2.0 before installing ActivePython.
Double-click the installer, ActivePython-2.2.2-224-win32-ix86.msi.
Step through the installer program.
If space is tight, you can do a custom installation and deselect the documentation, but I don't recommend this unless you absolutely can't spare the 14MB.
After the installation is complete, close the installer and choose
-> -> -> . You'll see something like the following:
PythonWin 2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)] on win32.
Portions Copyright 1994-2001 Mark Hammond (mhammond@skippinet.com.au) -
see 'Help/About PythonWin' for further copyright information.
>>>
Download the latest Python Windows installer by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then downloading the .exe installer.
Double-click the installer, Python-2.xxx.yyy.exe. The name will depend on the version of Python available when you read this.
Step through the installer program.
If disk space is tight, you can deselect the HTMLHelp file, the utility scripts (Tools/), and/or the test suite (Lib/test/).
If you do not have administrative rights on your machine, you can select Non-Admin Install. This just affects where Registry entries and Start menu shortcuts are created.
, then chooseAfter the installation is complete, close the installer and select
-> -> -> . You'll see something like the following:
Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
****************************************************************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
****************************************************************
IDLE 1.0
>>>
On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it.
Mac OS X 10.2 and later comes with a command-line version of Python preinstalled. If you are comfortable with the command line, you can use this version for the first third of the book. However, the preinstalled version does not come with an XML parser, so when you get to the XML chapter, you'll need to install the full version.
Rather than using the preinstalled version, you'll probably want to install the latest version, which also comes with a graphical interactive shell.
To use the preinstalled version of Python, follow these steps:
Open the /Applications folder.
Open the Utilities folder.
Double-click Terminal to open a terminal window and get to a command line.
Type python at the command prompt.
Try it out:
Welcome to Darwin! [localhost:~] you% python Python 2.2 (#1, 07/14/02, 23:25:09) [GCC Apple cpp-precomp 6.14] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> [press Ctrl+D to get back to the command prompt] [localhost:~] you%
Follow these steps to download and install the latest version of Python:
Download the MacPython-OSX disk image from http://homepages.cwi.nl/~jack/macpython/download.html.
If your browser has not already done so, double-click MacPython-OSX-2.3-1.dmg to mount the disk image on your desktop.
Double-click the installer, MacPython-OSX.pkg.
The installer will prompt you for your administrative username and password.
Step through the installer program.
After installation is complete, close the installer and open the /Applications folder.
Open the MacPython-2.3 folder
Double-click PythonIDE to launch Python.
The MacPython IDE should display a splash screen, then take you to the interactive shell. If the interactive shell does not appear, select -> (Cmd-0). The opening window will look something like this:
Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1
>>>
Note that once you install the latest version, the pre-installed version is still present. If you are running scripts from the command line, you need to be aware which version of Python you are using.
[localhost:~] you% python Python 2.2 (#1, 07/14/02, 23:25:09) [GCC Apple cpp-precomp 6.14] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> [press Ctrl+D to get back to the command prompt] [localhost:~] you% /usr/local/bin/python Python 2.3 (#2, Jul 30 2003, 11:45:28) [GCC 3.1 20020420 (prerelease)] on darwin Type "help", "copyright", "credits", or "license" for more information. >>> [press Ctrl+D to get back to the command prompt] [localhost:~] you%
Mac OS 9 does not come with any version of Python, but installation is very simple, and there is only one choice.
Follow these steps to install Python on Mac OS 9:
Download the MacPython23full.bin file from http://homepages.cwi.nl/~jack/macpython/download.html.
If your browser does not decompress the file automatically, double-click MacPython23full.bin to decompress the file with Stuffit Expander.
Double-click the installer, MacPython23full.
Step through the installer program.
AFter installation is complete, close the installer and open the /Applications folder.
Open the MacPython-OS9 2.3 folder.
Double-click Python IDE to launch Python.
The MacPython IDE should display a splash screen, and then take you to the interactive shell. If the interactive shell does not appear, select -> (Cmd-0). You'll see a screen like this:
Python 2.3 (#2, Jul 30 2003, 11:45:28)
[GCC 3.1 20020420 (prerelease)]
Type "copyright", "credits" or "license" for more information.
MacPython IDE 1.0.1
>>>
Installing under UNIX-compatible operating systems such as Linux is easy if you're willing to install a binary package. Pre-built binary packages are available for most popular Linux distributions. Or you can always compile from source.
Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the rpms/ directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here:
localhost:~$ su - Password: [enter your root password] [root@localhost root]# wget http://python.org/ftp/python/2.3/rpms/redhat-9/python2.3-2.3-5pydotorg.i386.rpm Resolving python.org... done. Connecting to python.org[194.109.137.226]:80... connected. HTTP request sent, awaiting response... 200 OK Length: 7,495,111 [application/octet-stream] ... [root@localhost root]# rpm -Uvh python2.3-2.3-5pydotorg.i386.rpm Preparing... ########################################### [100%] 1:python2.3 ########################################### [100%] [root@localhost root]# pythonPython 2.2.2 (#1, Feb 24 2003, 19:13:11) [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2 Type "help", "copyright", "credits", or "license" for more information. >>> [press Ctrl+D to exit] [root@localhost root]# python2.3
Python 2.3 (#1, Sep 12 2003, 10:53:56) [GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-5)] on linux2 Type "help", "copyright", "credits", or "license" for more information. >>> [press Ctrl+D to exit] [root@localhost root]# which python2.3
/usr/bin/python2.3
If you are lucky enough to be running Debian GNU/Linux, you install Python through the apt command.
localhost:~$ su - Password: [enter your root password] localhost:~# apt-get install python Reading Package Lists... Done Building Dependency Tree... Done The following extra packages will be installed: python2.3 Suggested packages: python-tk python2.3-doc The following NEW packages will be installed: python python2.3 0 upgraded, 2 newly installed, 0 to remove and 3 not upgraded. Need to get 0B/2880kB of archives. After unpacking 9351kB of additional disk space will be used. Do you want to continue? [Y/n] Y Selecting previously deselected package python2.3. (Reading database ... 22848 files and directories currently installed.) Unpacking python2.3 (from .../python2.3_2.3.1-1_i386.deb) ... Selecting previously deselected package python. Unpacking python (from .../python_2.3.1-1_all.deb) ... Setting up python (2.3.1-1) ... Setting up python2.3 (2.3.1-1) ... Compiling python modules in /usr/lib/python2.3 ... Compiling optimized python modules in /usr/lib/python2.3 ... localhost:~# exit logout localhost:~$ python Python 2.3.1 (#2, Sep 24 2003, 11:39:14) [GCC 3.3.2 20030908 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> [press Ctrl+D to exit]
If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the .tgz file), and then do the usual configure, make, make install dance.
localhost:~$ su - Password: [enter your root password] localhost:~# wget http://www.python.org/ftp/python/2.3/Python-2.3.tgz Resolving www.python.org... done. Connecting to www.python.org[194.109.137.226]:80... connected. HTTP request sent, awaiting response... 200 OK Length: 8,436,880 [application/x-tar] ... localhost:~# tar xfz Python-2.3.tgz localhost:~# cd Python-2.3 localhost:~/Python-2.3# ./configure checking MACHDEP... linux2 checking EXTRAPLATDIR... checking for --without-gcc... no ... localhost:~/Python-2.3# make gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o Modules/python.o Modules/python.c gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o Parser/acceler.o Parser/acceler.c gcc -pthread -c -fno-strict-aliasing -DNDEBUG -g -O3 -Wall -Wstrict-prototypes -I. -I./Include -DPy_BUILD_CORE -o Parser/grammar1.o Parser/grammar1.c ... localhost:~/Python-2.3# make install /usr/bin/install -c python /usr/local/bin/python2.3 ... localhost:~/Python-2.3# exit logout localhost:~$ which python /usr/local/bin/python localhost:~$ python Python 2.3.1 (#2, Sep 24 2003, 11:39:14) [GCC 3.3.2 20030908 (Debian prerelease)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> [press Ctrl+D to get back to the command prompt] localhost:~$
Now that you have Python installed, what's this interactive shell thing you're running?
It's like this: Python leads a double life. It's an interpreter for scripts that you can run from the command line or run like applications, by double-clicking the scripts. But it's also an interactive shell that can evaluate arbitrary statements and expressions. This is extremely useful for debugging, quick hacking, and testing. I even know some people who use the Python interactive shell in lieu of a calculator!
Launch the Python interactive shell in whatever way works on your platform, and let's dive in with the steps shown here:
>>> 1 + 12 >>> print 'hello world'
hello world >>> x = 1
>>> y = 2 >>> x + y 3
You should now have a version of Python installed that works for you.
Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing python on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
Congratulations, and welcome to Python.
You know how other books go on and on about programming fundamentals and finally work up to building a complete, working program? Let's skip all that.
Here is a complete, working Python program.
It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
If you have not already done so, you can download this and other examples used in this book.
def buildConnectionString(params): """Build a connection string from a dictionary of parameters. Returns string.""" return ";".join(["%s=%s" % (k, v) for k, v in params.items()]) if __name__ == "__main__": myParams = {"server":"mpilgrim", \ "database":"master", \ "uid":"sa", \ "pwd":"secret" \ } print buildConnectionString(myParams)
Now run this program and see what happens.
![]() |
|
In the ActivePython IDE on Windows, you can run the Python program you're editing by choosing -> (Ctrl-R). Output is displayed in the interactive window. |
![]() |
|
In the Python IDE on Mac OS, you can run a Python program with -> (Cmd-R), but there is an important option you must set first. Open the .py file in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the option is checked. This is a per-file setting, but you'll only need to do it once per file. |
![]() |
|
On UNIX-compatible systems (including Mac OS X), you can run a Python program from the command line: python odbchelper.py |
Python has functions like most other languages, but it does not have separate header files like C++ or interface/implementation sections like Pascal. When you need a function, just declare it, like this:
def buildConnectionString(params):
Note that the keyword def starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments (not shown here) are separated with commas.
Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. In fact, every Python function returns a value; if the function ever executes a return statement, it will return that value, otherwise it will return None, the Python null value.
![]() |
|
In Visual Basic, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's None), and all functions start with def. |
The argument, params, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
![]() |
|
In Java, C++, and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally. |
An erudite reader sent me this explanation of how Python compares to other programming languages:
So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).
You can document a Python function by giving it a doc string.
def buildConnectionString(params): """Build a connection string from a dictionary of parameters. Returns string."""
Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining a doc string.
![]() |
|
Triple quotes are also an easy way to define a string with both single and double quotes, like qq/.../ in Perl. |
Everything between the triple quotes is the function's doc string, which documents what the function does. A doc string, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically need to give your function a doc string, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the doc string is available at runtime as an attribute of the function.
![]() |
|
Many Python IDEs use the doc string to provide context-sensitive documentation, so that when you type a function name, its doc string appears as a tooltip. This can be incredibly helpful, but it's only as good as the doc strings you write. |
In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime.
A function, like everything else in Python, is an object.
Open your favorite Python IDE and follow along:
>>> import odbchelper>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> print odbchelper.buildConnectionString(params)
server=mpilgrim;uid=sa;database=master;pwd=secret >>> print odbchelper.buildConnectionString.__doc__
Build a connection string from a dictionary Returns string.
![]() | The first line imports the odbchelper program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in Chapter 4.) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the IDE too. This is an important concept, and you'll talk more about it later. |
![]() | When you want to use functions defined in imported modules, you need to include the module name. So you can't just say buildConnectionString; it must be odbchelper.buildConnectionString. If you've used classes in Java, this should feel vaguely familiar. |
![]() | Instead of calling the function as you would expect to, you asked for one of the function's attributes, __doc__. |
![]() |
|
import in Python is like require in Perl. Once you import a Python module, you access its functions with module.function; once you require a Perl module, you access its functions with module::function. |
Before you go any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in sys.path. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)
>>> import sys>>> sys.path
['', '/usr/local/lib/python2.2', '/usr/local/lib/python2.2/plat-linux2', '/usr/local/lib/python2.2/lib-dynload', '/usr/local/lib/python2.2/site-packages', '/usr/local/lib/python2.2/site-packages/PIL', '/usr/local/lib/python2.2/site-packages/piddle'] >>> sys
<module 'sys' (built-in)> >>> sys.path.append('/my/new/path')
![]() | Importing the sys module makes all of its functions and attributes available. |
![]() | sys.path is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a .py file matching the module name you're trying to import. |
![]() | Actually, I lied; the truth is more complicated than that, because not all modules are stored as .py files. Some, like the sys module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The sys module is written in C.) |
![]() | You can add a new directory to Python's search path at runtime by appending the directory name to sys.path, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll talk more about append and other list methods in Chapter 3.) |
Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute __doc__, which returns the doc string defined in the function's source code. The sys module is an object which has (among other things) an attribute called path. And so forth.
Still, this begs the question. What is an object? Different programming languages define “object” in different ways. In some, it means that all objects must have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in Chapter 3), and not all objects are subclassable (more on this in Chapter 5). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in Chapter 4).
This is so important that I'm going to repeat it in case you missed it the first few times: everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
Python functions have no explicit begin or end, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself.
def buildConnectionString(params): """Build a connection string from a dictionary of parameters. Returns string.""" return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
Code blocks are defined by their indentation. By "code block", I mean functions, if statements, for loops, while loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code (including the doc string) is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not indented is outside the function.
Example 2.6, “if Statements” shows an example of code indentation with if statements.
def fib(n):print 'n =', n
if n > 1:
return n * fib(n - 1) else:
print 'end of the line' return 1
After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read and understand other people's Python code.
![]() |
|
Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. C++ and Java use semicolons to separate statements and curly braces to separate code blocks. |
Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. Here's an example that uses the if __name__ trick.
Some quick observations before you get to the good stuff. First, parentheses are not required around the if expression. Second, the if statement ends with a colon, and is followed by indented code.
![]() |
|
Like C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing. |
So why is this particular if statement a trick? Modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension. But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__.
>>> import odbchelper >>> odbchelper.__name__ 'odbchelper'
Knowing this, you can design a test suite for your module within the module itself by putting it in this if statement. When you run the module directly, __name__ is __main__, so the test suite executes. When you import the module, __name__ is something else, so the test suite is ignored. This makes it easier to develop and debug new modules before integrating them into a larger program.
![]() |
|
On MacPython, there is an additional step to make the if __name__ trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and make sure is checked. |
You'll get back to your first Python program in just a minute. But first, a short digression is in order, because you need to know about dictionaries, tuples, and lists (oh my!). If you're a Perl hacker, you can probably skim the bits about dictionaries and lists, but you should still pay attention to tuples.
One of Python's built-in datatypes is the dictionary, which defines one-to-one relationships between keys and values.
![]() |
|
A dictionary in Python is like a hash in Perl. In Perl, variables that store hashes always start with a % character. In Python, variables can be named anything, and Python keeps track of the datatype internally. |
![]() |
|
A dictionary in Python is like an instance of the Hashtable class in Java. |
![]() |
|
A dictionary in Python is like an instance of the Scripting.Dictionary object in Visual Basic. |
>>> d = {"server":"mpilgrim", "database":"master"}>>> d {'server': 'mpilgrim', 'database': 'master'} >>> d["server"]
'mpilgrim' >>> d["database"]
'master' >>> d["mpilgrim"]
Traceback (innermost last): File "<interactive input>", line 1, in ? KeyError: mpilgrim
>>> d {'server': 'mpilgrim', 'database': 'master'} >>> d["database"] = "pubs">>> d {'server': 'mpilgrim', 'database': 'pubs'} >>> d["uid"] = "sa"
>>> d {'server': 'mpilgrim', 'uid': 'sa', 'database': 'pubs'}
Note that the new element (key 'uid', value 'sa') appears to be in the middle. In fact, it was just a coincidence that the elements appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
![]() |
|
Dictionaries have no concept of order among elements. It is incorrect to say that the elements are “out of order”; they are simply unordered. This is an important distinction that will annoy you when you want to access the elements of a dictionary in a specific, repeatable order (like alphabetical order by key). There are ways of doing this, but they're not built into the dictionary. |
When working with dictionaries, you need to be aware that dictionary keys are case-sensitive.
>>> d = {} >>> d["key"] = "value" >>> d["key"] = "other value">>> d {'key': 'other value'} >>> d["Key"] = "third value"
>>> d {'Key': 'third value', 'key': 'other value'}
>>> d {'server': 'mpilgrim', 'uid': 'sa', 'database': 'pubs'} >>> d["retrycount"] = 3>>> d {'server': 'mpilgrim', 'uid': 'sa', 'database': 'master', 'retrycount': 3} >>> d[42] = "douglas"
>>> d {'server': 'mpilgrim', 'uid': 'sa', 'database': 'master', 42: 'douglas', 'retrycount': 3}
>>> d {'server': 'mpilgrim', 'uid': 'sa', 'database': 'master', 42: 'douglas', 'retrycount': 3} >>> del d[42]>>> d {'server': 'mpilgrim', 'uid': 'sa', 'database': 'master', 'retrycount': 3} >>> d.clear()
>>> d {}
Lists are Python's workhorse datatype. If your only experience with lists is arrays in Visual Basic or (God forbid) the datastore in Powerbuilder, brace yourself for Python lists.
![]() |
|
A list in Python is like an array in Perl. In Perl, variables that store arrays always start with the @ character; in Python, variables can be named anything, and Python keeps track of the datatype internally. |
![]() |
|
A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added. |
>>> li = ["a", "b", "mpilgrim", "z", "example"]>>> li ['a', 'b', 'mpilgrim', 'z', 'example'] >>> li[0]
'a' >>> li[4]
'example'
>>> li ['a', 'b', 'mpilgrim', 'z', 'example'] >>> li[-1]'example' >>> li[-3]
'mpilgrim'
>>> li ['a', 'b', 'mpilgrim', 'z', 'example'] >>> li[1:3]['b', 'mpilgrim'] >>> li[1:-1]
['b', 'mpilgrim', 'z'] >>> li[0:3]
['a', 'b', 'mpilgrim']
>>> li ['a', 'b', 'mpilgrim', 'z', 'example'] >>> li[:3]['a', 'b', 'mpilgrim'] >>> li[3:]
![]()
['z', 'example'] >>> li[:]
['a', 'b', 'mpilgrim', 'z', 'example']
![]() | If the left slice index is 0, you can leave it out, and 0 is implied. So li[:3] is the same as li[0:3] from Example 3.8, “Slicing a List”. |
![]() | Similarly, if the right slice index is the length of the list, you can leave it out. So li[3:] is the same as li[3:5], because this list has five elements. |
![]() | Note the symmetry here. In this five-element list, li[:3] returns the first 3 elements, and li[3:] returns the last two elements. In fact, li[:n] will always return the first n elements, and li[n:] will return the rest, regardless of the length of the list. |
![]() | If both slice indices are left out, all elements of the list are included. But this is not the same as the original li list; it is a new list that happens to have all the same elements. li[:] is shorthand for making a complete copy of a list. |
>>> li ['a', 'b', 'mpilgrim', 'z', 'example'] >>> li.append("new")>>> li ['a', 'b', 'mpilgrim', 'z', 'example', 'new'] >>> li.insert(2, "new")
>>> li ['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new'] >>> li.extend(["two", "elements"])
>>> li ['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
>>> li = ['a', 'b', 'c'] >>> li.extend(['d', 'e', 'f'])>>> li ['a', 'b', 'c', 'd', 'e', 'f'] >>> len(li)
6 >>> li[-1] 'f' >>> li = ['a', 'b', 'c'] >>> li.append(['d', 'e', 'f'])
>>> li ['a', 'b', 'c', ['d', 'e', 'f']] >>> len(li)
4 >>> li[-1] ['d', 'e', 'f']
>>> li ['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements'] >>> li.index("example")5 >>> li.index("new")
2 >>> li.index("c")
Traceback (innermost last): File "<interactive input>", line 1, in ? ValueError: list.index(x): x not in list >>> "c" in li
False
![]() |
|
Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an if statement), according to the following rules:
|
>>> li ['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements'] >>> li.remove("z")>>> li ['a', 'b', 'new', 'mpilgrim', 'example', 'new', 'two', 'elements'] >>> li.remove("new")
>>> li ['a', 'b', 'mpilgrim', 'example', 'new', 'two', 'elements'] >>> li.remove("c")
Traceback (innermost last): File "<interactive input>", line 1, in ? ValueError: list.remove(x): x not in list >>> li.pop()
'elements' >>> li ['a', 'b', 'mpilgrim', 'example', 'new', 'two']
>>> li = ['a', 'b', 'mpilgrim'] >>> li = li + ['example', 'new']>>> li ['a', 'b', 'mpilgrim', 'example', 'new'] >>> li += ['two']
>>> li ['a', 'b', 'mpilgrim', 'example', 'new', 'two'] >>> li = [1, 2] * 3
>>> li [1, 2, 1, 2, 1, 2]
![]() | Lists can also be concatenated with the + operator. list = list + otherlist has the same result as list.extend(otherlist). But the + operator returns a new (concatenated) list as a value, whereas extend only alters an existing list. This means that extend is faster, especially for large lists. |
![]() | Python supports the += operator. li += ['two'] is equivalent to li.extend(['two']). The += operator works for lists, strings, and integers, and it can be overloaded to work for user-defined classes as well. (More on classes in Chapter 5.) |
![]() | The * operator works on lists as a repeater. li = [1, 2] * 3 is equivalent to li = [1, 2] + [1, 2] + [1, 2], which concatenates the three lists into one. |
A tuple is an immutable list. A tuple can not be changed in any way once it is created.
>>> t = ("a", "b", "mpilgrim", "z", "example")>>> t ('a', 'b', 'mpilgrim', 'z', 'example') >>> t[0]
'a' >>> t[-1]
'example' >>> t[1:3]
('b', 'mpilgrim')
>>> t ('a', 'b', 'mpilgrim', 'z', 'example') >>> t.append("new")Traceback (innermost last): File "<interactive input>", line 1, in ? AttributeError: 'tuple' object has no attribute 'append' >>> t.remove("z")
Traceback (innermost last): File "<interactive input>", line 1, in ? AttributeError: 'tuple' object has no attribute 'remove' >>> t.index("example")
Traceback (innermost last): File "<interactive input>", line 1, in ? AttributeError: 'tuple' object has no attribute 'index' >>> "z" in t
True
So what are tuples good for?
![]() |
|
Tuples can be converted into lists, and vice-versa. The built-in tuple function takes a list and returns a tuple with the same elements, and the list function takes a tuple and returns a list. In effect, tuple freezes a list, and list thaws a tuple. |
Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2, odbchelper.py.
Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring into existence by being assigned a value, and they are automatically destroyed when they go out of scope.
if __name__ == "__main__": myParams = {"server":"mpilgrim", \ "database":"master", \ "uid":"sa", \ "pwd":"secret" \ }
Notice the indentation. An if statement is a code block and needs to be indented just like a function.
Also notice that the variable assignment is one command split over several lines, with a backslash (“\”) serving as a line-continuation marker.
![]() |
|
When a command is split among several lines with the line-continuation marker (“\”), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python IDE auto-indents the continued line, you should probably accept its default unless you have a burning reason not to. |
Strictly speaking, expressions in parentheses, straight brackets, or curly braces (like defining a dictionary) can be split into multiple lines with or without the line continuation character (“\”). I like to include the backslash even when it's not required because I think it makes the code easier to read, but that's a matter of style.
Third, you never declared the variable myParams, you just assigned a value to it. This is like VBScript without the option explicit option. Luckily, unlike VBScript, Python will not allow you to reference a variable that has never been assigned a value; trying to do so will raise an exception.
>>> x Traceback (innermost last): File "<interactive input>", line 1, in ? NameError: There is no variable named 'x' >>> x = 1 >>> x 1
You will thank Python for this one day.
One of the cooler programming shortcuts in Python is using sequences to assign multiple values at once.
>>> v = ('a', 'b', 'e') >>> (x, y, z) = v>>> x 'a' >>> y 'b' >>> z 'e'
This has all sorts of uses. I often want to assign names to a range of values. In C, you would use enum and manually list each constant and its associated value, which seems especially tedious when the values are consecutive. In Python, you can use the built-in range function with multi-variable assignment to quickly assign consecutive values.
>>> range(7)[0, 1, 2, 3, 4, 5, 6] >>> (MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY) = range(7)
>>> MONDAY
0 >>> TUESDAY 1 >>> SUNDAY 6
You can also use multi-variable assignment to build functions that return multiple values, simply by returning a tuple of all the values. The caller can treat it as a tuple, or assign the values to individual variables. Many standard Python libraries do this, including the os module, which you'll discuss in Chapter 6.
Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert values into a string with the %s placeholder.
![]() |
|
String formatting in Python uses the same syntax as the sprintf function in C. |
>>> k = "uid" >>> v = "sa" >>> "%s=%s" % (k, v)'uid=sa'
Note that (k, v) is a tuple. I told you they were good for something.
You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
>>> uid = "sa" >>> pwd = "secret" >>> print pwd + " is not a good password for " + uidsecret is not a good password for sa >>> print "%s is not a good password for %s" % (pwd, uid)
secret is not a good password for sa >>> userCount = 6 >>> print "Users connected: %d" % (userCount, )
![]()
Users connected: 6 >>> print "Users connected: " + userCount
Traceback (innermost last): File "<interactive input>", line 1, in ? TypeError: cannot concatenate 'str' and 'int' objects
As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
>>> print "Today's stock price: %f" % 50.462550.462500 >>> print "Today's stock price: %.2f" % 50.4625
50.46 >>> print "Change since yesterday: %+.2f" % 1.5
+1.50
One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each of the elements of the list.
>>> li = [1, 9, 8, 4] >>> [elem*2 for elem in li][2, 18, 16, 8] >>> li
[1, 9, 8, 4] >>> li = [elem*2 for elem in li]
>>> li [2, 18, 16, 8]
Here are the list comprehensions in the buildConnectionString function that you declared in Chapter 2:
["%s=%s" % (k, v) for k, v in params.items()]
First, notice that you're calling the items function of the params dictionary. This function returns a list of tuples of all the data in the dictionary.
>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> params.keys()['server', 'uid', 'database', 'pwd'] >>> params.values()
['mpilgrim', 'sa', 'master', 'secret'] >>> params.items()
[('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')]
Now let's see what buildConnectionString does. It takes a list, params.items(), and maps it to a new list by applying string formatting to each element. The new list will have the same number of elements as params.items(), but each element in the new list will be a string that contains both a key and its associated value from the params dictionary.
>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> params.items() [('server', 'mpilgrim'), ('uid', 'sa'), ('database', 'master'), ('pwd', 'secret')] >>> [k for k, v in params.items()]['server', 'uid', 'database', 'pwd'] >>> [v for k, v in params.items()]
['mpilgrim', 'sa', 'master', 'secret'] >>> ["%s=%s" % (k, v) for k, v in params.items()]
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
![]() | Note that you're using two variables to iterate through the params.items() list. This is another use of multi-variable assignment. The first element of params.items() is ('server', 'mpilgrim'), so in the first iteration of the list comprehension, k will get 'server' and v will get 'mpilgrim'. In this case, you're ignoring the value of v and only including the value of k in the returned list, so this list comprehension ends up being equivalent to params.keys(). |
![]() | Here you're doing the same thing, but ignoring the value of k, so this list comprehension ends up being equivalent to params.values(). |
![]() | Combining the previous two examples with some simple string formatting, you get a list of strings that include both the key and value of each element of the dictionary. This looks suspiciously like the output of the program. All that remains is to join the elements in this list into a single string. |
You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.
Here is an example of joining a list from the buildConnectionString function:
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string ";" itself is an object, and you are calling its join method.
The join method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
![]() |
|
join works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception. |
>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> ["%s=%s" % (k, v) for k, v in params.items()] ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret'] >>> ";".join(["%s=%s" % (k, v) for k, v in params.items()]) 'server=mpilgrim;uid=sa;database=master;pwd=secret'
This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.
You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called split.
>>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret'] >>> s = ";".join(li) >>> s 'server=mpilgrim;uid=sa;database=master;pwd=secret' >>> s.split(";")['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret'] >>> s.split(";", 1)
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']
![]() |
|
anystring.split(delimiter, 1) is a useful technique when you want to search a string for a substring and then work with everything before the substring (which ends up in the first element of the returned list) and everything after it (which ends up in the second element). |
When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.
The odbchelper.py program and its output should now make perfect sense.
def buildConnectionString(params): """Build a connection string from a dictionary of parameters. Returns string.""" return ";".join(["%s=%s" % (k, v) for k, v in params.items()]) if __name__ == "__main__": myParams = {"server":"mpilgrim", \ "database":"master", \ "uid":"sa", \ "pwd":"secret" \ } print buildConnectionString(myParams)
Here is the output of odbchelper.py:
server=mpilgrim;uid=sa;database=master;pwd=secret
Before diving into the next chapter, make sure you're comfortable doing all of these things:
This chapter covers one of Python's strengths: introspection. As you know, everything in Python is an object, and introspection is code looking at other modules and functions in memory as objects, getting information about them, and manipulating them. Along the way, you'll define functions with no name, call functions with arguments out of order, and reference functions whose names you don't even know ahead of time.
Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered in Chapter 2, Your First Python Program. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter.
If you have not already done so, you can download this and other examples used in this book.
def info(object, spacing=10, collapse=1):![]()
![]()
"""Print methods and doc strings. Takes module, class, list, dictionary, or string.""" methodList = [method for method in dir(object) if callable(getattr(object, method))] processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s) print "\n".join(["%s %s" % (method.ljust(spacing), processFunc(str(getattr(object, method).__doc__))) for method in methodList]) if __name__ == "__main__":
![]()
print info.__doc__
![]() | This module has one function, info. According to its function declaration, it takes three parameters: object, spacing, and collapse. The last two are actually optional parameters, as you'll see shortly. |
![]() | The info function has a multi-line doc string that succinctly describes the function's purpose. Note that no return value is mentioned; this function will be used solely for its effects, rather than its value. |
![]() | Code within the function is indented. |
![]() | The if __name__ trick allows this program do something useful when run by itself, without interfering with its use as a module for other programs. In this case, the program simply prints out the doc string of the info function. |
![]() | if statements use == for comparison, and parentheses are not required. |
The info function is designed to be used by you, the programmer, while working in the Python IDE. It takes any object that has functions or methods (like a module, which has functions, or a list, which has methods) and prints out the functions and their doc strings.
>>> from apihelper import info >>> li = [] >>> info(li) append L.append(object) -- append object to end count L.count(value) -> integer -- return number of occurrences of value extend L.extend(list) -- extend list by appending list elements index L.index(value) -> integer -- return index of first occurrence of value insert L.insert(index, object) -- insert object before index pop L.pop([index]) -> item -- remove and return item at index (default last) remove L.remove(value) -- remove first occurrence of value reverse L.reverse() -- reverse *IN PLACE* sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1
By default the output is formatted to be easy to read. Multi-line doc strings are collapsed into a single long line, but this option can be changed by specifying 0 for the collapse argument. If the function names are longer than 10 characters, you can specify a larger value for the spacing argument to make the output easier to read.
>>> import odbchelper >>> info(odbchelper) buildConnectionString Build a connection string from a dictionary Returns string. >>> info(odbchelper, 30) buildConnectionString Build a connection string from a dictionary Returns string. >>> info(odbchelper, 30, 0) buildConnectionString Build a connection string from a dictionary Returns string.
Python allows function arguments to have default values; if the function is called without the argument, the argument gets its default value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/SQL can do this, so if you're a SQL Server scripting guru, you can skim this part.
Here is an example of info, a function with two optional arguments:
def info(object, spacing=10, collapse=1):
spacing and collapse are optional, because they have default values defined. object is required, because it has no default value. If info is called with only one argument, spacing defaults to 10 and collapse defaults to 1. If info is called with two arguments, collapse still defaults to 1.
Say you want to specify a value for collapse but want to accept the default value for spacing. In most languages, you would be out of luck, because you would need to call the function with three arguments. But in Python, arguments can be specified by name, in any order.
info(odbchelper)info(odbchelper, 12)
info(odbchelper, collapse=0)
info(spacing=15, object=odbchelper)
This looks totally whacked until you realize that arguments are simply a dictionary. The “normal” method of calling functions without argument names is actually just a shorthand where Python matches up the values with the argument names in the order they're specified in the function declaration. And most of the time, you'll call functions the “normal” way, but you always have the additional flexibility if you need it.
![]() |
|
The only thing you need to do to call a function is specify a value (somehow) for each required argument; the manner and order in which you do that is up to you. |
Python has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was actually a conscious design decision, to keep the core language from getting bloated like other scripting languages (cough cough, Visual Basic).
The type function returns the datatype of any arbitrary object. The possible types are listed in the types module. This is useful for helper functions that can handle several types of data.
>>> type(1)<type 'int'> >>> li = [] >>> type(li)
<type 'list'> >>> import odbchelper >>> type(odbchelper)
<type 'module'> >>> import types
>>> type(odbchelper) == types.ModuleType True
The str coerces data into a string. Every datatype can be coerced into a string.
>>> str(1)'1' >>> horsemen = ['war', 'pestilence', 'famine'] >>> horsemen ['war', 'pestilence', 'famine'] >>> horsemen.append('Powerbuilder') >>> str(horsemen)
"['war', 'pestilence', 'famine', 'Powerbuilder']" >>> str(odbchelper)
"<module 'odbchelper' from 'c:\\docbook\\dip\\py\\odbchelper.py'>" >>> str(None)
'None'
At the heart of the info function is the powerful dir function. dir returns a list of the attributes and methods of any object: modules, functions, strings, lists, dictionaries... pretty much anything.
>>> li = [] >>> dir(li)['append', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse', 'sort'] >>> d = {} >>> dir(d)
['clear', 'copy', 'get', 'has_key', 'items', 'keys', 'setdefault', 'update', 'values'] >>> import odbchelper >>> dir(odbchelper)
['__builtins__', '__doc__', '__file__', '__name__', 'buildConnectionString']
![]() | li is a list, so dir(li) returns a list of all the methods of a list. Note that the returned list contains the names of the methods as strings, not the methods themselves. |
![]() | d is a dictionary, so dir(d) returns a list of the names of dictionary methods. At least one of these, keys, should look familiar. |
![]() | This is where it really gets interesting. odbchelper is a module, so dir(odbchelper) returns a list of all kinds of stuff defined in the module, including built-in attributes, like __name__, __doc__, and whatever other attributes and methods you define. In this case, odbchelper has only one user-defined method, the buildConnectionString function described in Chapter 2. |
Finally, the callable function takes any object and returns True if the object can be called, or False otherwise. Callable objects include functions, class methods, even classes themselves. (More on classes in the next chapter.)
>>> import string >>> string.punctuation'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~' >>> string.join
<function join at 00C55A7C> >>> callable(string.punctuation)
False >>> callable(string.join)
True >>> print string.join.__doc__
join(list [,sep]) -> string Return a string composed of the words in list, with intervening occurrences of sep. The default separator is a single space. (joinfields and join are synonymous)
![]() | The functions in the string module are deprecated (although many people still use the join function), but the module contains a lot of useful constants like this string.punctuation, which contains all the standard punctuation characters. |
![]() | string.join is a function that joins a list of strings. |
![]() | string.punctuation is not callable; it is a string. (A string does have callable methods, but the string itself is not callable.) |
![]() | string.join is callable; it's a function that takes two arguments. |
![]() | Any callable object may have a doc string. By using the callable function on each of an object's attributes, you can determine which attributes you care about (methods, functions, classes) and which you want to ignore (constants and so on) without knowing anything about the object ahead of time. |
type, str, dir, and all the rest of Python's built-in functions are grouped into a special module called __builtin__. (That's two underscores before and after.) If it helps, you can think of Python automatically executing from __builtin__ import * on startup, which imports all the “built-in” functions into the namespace so you can use them directly.
The advantage of thinking like this is that you can access all the built-in functions and attributes as a group by getting information about the __builtin__ module. And guess what, Python has a function called info. Try it yourself and skim through the list now. We'll dive into some of the more important functions later. (Some of the built-in error classes, like AttributeError, should already look familiar.)
>>> from apihelper import info >>> import __builtin__ >>> info(__builtin__, 20) ArithmeticError Base class for arithmetic errors. AssertionError Assertion failed. AttributeError Attribute not found. EOFError Read beyond end of file. EnvironmentError Base class for I/O related errors. Exception Common base class for all exceptions. FloatingPointError Floating point operation failed. IOError I/O operation failed. [...snip...]
![]() |
|
Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind yourself how to use these modules, Python is largely self-documenting. |
You already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the getattr function.
>>> li = ["Larry", "Curly"] >>> li.pop<built-in method pop of list object at 010DF884> >>> getattr(li, "pop")
<built-in method pop of list object at 010DF884> >>> getattr(li, "append")("Moe")
>>> li ["Larry", "Curly", "Moe"] >>> getattr({}, "clear")
<built-in method clear of dictionary object at 00F113D4> >>> getattr((), "pop")
Traceback (innermost last): File "<interactive input>", line 1, in ? AttributeError: 'tuple' object has no attribute 'pop'
![]() | This gets a reference to the pop method of the list. Note that this is not calling the pop method; that would be li.pop(). This is the method itself. |
![]() | This also returns a reference to the pop method, but this time, the method name is specified as a string argument to the getattr function. getattr is an incredibly useful built-in function that returns any attribute of any object. In this case, the object is a list, and the attribute is the pop method. |
![]() | In case it hasn't sunk in just how incredibly useful this is, try this: the return value of getattr is the method, which you can then call just as if you had said li.append("Moe") directly. But you didn't call the function directly; you specified the function name as a string instead. |
![]() | getattr also works on dictionaries. |
![]() | In theory, getattr would work on tuples, except that tuples have no methods, so getattr will raise an exception no matter what attribute name you give. |
getattr isn't just for built-in datatypes. It also works on modules.
>>> import odbchelper >>> odbchelper.buildConnectionString<function buildConnectionString at 00D18DD4> >>> getattr(odbchelper, "buildConnectionString")
<function buildConnectionString at 00D18DD4> >>> object = odbchelper >>> method = "buildConnectionString" >>> getattr(object, method)
<function buildConnectionString at 00D18DD4> >>> type(getattr(object, method))
<type 'function'> >>> import types >>> type(getattr(object, method)) == types.FunctionType True >>> callable(getattr(object, method))
True
![]() | This returns a reference to the buildConnectionString function in the odbchelper module, which you studied in Chapter 2, Your First Python Program. (The hex address you see is specific to my machine; your output will be different.) |
![]() | Using getattr, you can get the same reference to the same function. In general, getattr(object, "attribute") is equivalent to object.attribute. If object is a module, then attribute can be anything defined in the module: a function, class, or global variable. |
![]() | And this is what you actually use in the info function. object is passed into the function as an argument; method is a string which is the name of a method or function. |
![]() | In this case, method is the name of a function, which you can prove by getting its type. |
![]() | Since method is a function, it is callable. |
A common usage pattern of getattr is as a dispatcher. For example, if you had a program that could output data in a variety of different formats, you could define separate functions for each output format and use a single dispatch function to call the right one.
For example, let's imagine a program that prints site statistics in HTML, XML, and plain text formats. The choice of output format could be specified on the command line, or stored in a configuration file. A statsout module defines three functions, output_html, output_xml, and output_text. Then the main program defines a single output function, like this:
import statsout def output(data, format="text"):output_function = getattr(statsout, "output_%s" % format)
return output_function(data)
![]()
Did you see the bug in the previous example? This is a very loose coupling of strings and functions, and there is no error checking. What happens if the user passes in a format that doesn't have a corresponding function defined in statsout? Well, getattr will return None, which will be assigned to output_function instead of a valid function, and the next line that attempts to call that function will crash and raise an exception. That's bad.
Luckily, getattr takes an optional third argument, a default value.
import statsout def output(data, format="text"): output_function = getattr(statsout, "output_%s" % format, statsout.output_text) return output_function(data)![]()
As you can see, getattr is quite powerful. It is the heart of introspection, and you'll see even more powerful examples of it in later chapters.
As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (Section 3.6, “Mapping Lists”). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.
Here is the list filtering syntax:
[mapping-expression for element in source-list if filter-expression]
This is an extension of the list comprehensions that you know and love. The first two thirds are the same; the last part, starting with the if, is the filter expression. A filter expression can be any expression that evaluates true or false (which in Python can be almost anything). Any element for which the filter expression evaluates true will be included in the mapping. All other elements are ignored, so they are never put through the mapping expression and are not included in the output list.
>>> li = ["a", "mpilgrim", "foo", "b", "c", "b", "d", "d"] >>> [elem for elem in li if len(elem) > 1]['mpilgrim', 'foo'] >>> [elem for elem in li if elem != "b"]
['a', 'mpilgrim', 'foo', 'c', 'd', 'd'] >>> [elem for elem in li if li.count(elem) == 1]
['a', 'mpilgrim', 'foo', 'c']
Let's get back to this line from apihelper.py:
methodList = [method for method in dir(object) if callable(getattr(object, method))]
This looks complicated, and it is complicated, but the basic structure is the same. The whole filter expression returns a list, which is assigned to the methodList variable. The first half of the expression is the list mapping part. The mapping expression is an identity expression, which it returns the value of each element. dir(object) returns a list of object's attributes and methods -- that's the list you're mapping. So the only new part is the filter expression after the if.
The filter expression looks scary, but it's not. You already know about callable, getattr, and in. As you saw in the previous section, the expression getattr(object, method) returns a function object if object is a module and method is the name of a function in that module.
So this expression takes an object (named object). Then it gets a list of the names of the object's attributes, methods, functions, and a few other things. Then it filters that list to weed out all the stuff that you don't care about. You do the weeding out by taking the name of each attribute/method/function and getting a reference to the real thing, via the getattr function. Then you check to see if that object is callable, which will be any methods and functions, both built-in (like the pop method of a list) and user-defined (like the buildConnectionString function of the odbchelper module). You don't care about other attributes, like the __name__ attribute that's built in to every module.
In Python, and and or perform boolean logic as you would expect, but they do not return boolean values; instead, they return one of the actual values they are comparing.
>>> 'a' and 'b''b' >>> '' and 'b'
'' >>> 'a' and 'b' and 'c'
'c'
![]() | When using and, values are evaluated in a boolean context from left to right. 0, '', [], (), {}, and None are false in a boolean context; everything else is true. Well, almost everything. By default, instances of classes are true in a boolean context, but you can define special methods in your class to make an instance evaluate to false. You'll learn all about classes and special methods in Chapter 5. If all values are true in a boolean context, and returns the last value. In this case, and evaluates 'a', which is true, then 'b', which is true, and returns 'b'. |
![]() | If any value is false in a boolean context, and returns the first false value. In this case, '' is the first false value. |
![]() | All values are true, so and returns the last value, 'c'. |
>>> 'a' or 'b''a' >>> '' or 'b'
'b' >>> '' or [] or {}
{} >>> def sidefx(): ... print "in sidefx()" ... return 1 >>> 'a' or sidefx()
'a'
If you're a C hacker, you are certainly familiar with the bool ? a : b expression, which evaluates to a if bool is true, and b otherwise. Because of the way and and or work in Python, you can accomplish the same thing.
>>> a = "first" >>> b = "second" >>> 1 and a or b'first' >>> 0 and a or b
'second'
However, since this Python expression is simply boolean logic, and not a special construct of the language, there is one extremely important difference between this and-or trick in Python and the bool ? a : b syntax in C. If the value of a is false, the expression will not work as you would expect it to. (Can you tell I was bitten by this? More than once?)
The and-or trick, bool and a or b, will not work like the C expression bool ? a : b when a is false in a boolean context.
The real trick behind the and-or trick, then, is to make sure that the value of a is never false. One common way of doing this is to turn a into [a] and b into [b], then taking the first element of the returned list, which will be either a or b.
>>> a = "" >>> b = "second" >>> (1 and [a] or [b])[0]''
By now, this trick may seem like more trouble than it's worth. You could, after all, accomplish the same thing with an if statement, so why go through all this fuss? Well, in many cases, you are choosing between two constant values, so you can use the simpler syntax and not worry, because you know that the a value will always be true. And even if you need to use the more complicated safe form, there are good reasons to do so. For example, there are some cases in Python where if statements are not allowed, such as in lambda functions.
Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called lambda functions can be used anywhere a function is required.
>>> def f(x): ... return x*2 ... >>> f(3) 6 >>> g = lambda x: x*2>>> g(3) 6 >>> (lambda x: x*2)(3)
6
To generalize, a lambda function is a function that takes any number of arguments (including optional arguments) and returns the value of a single expression. lambda functions can not contain commands, and they can not contain more than one expression. Don't try to squeeze too much into a lambda function; if you need something more complex, define a normal function instead and make it as long as you want.
![]() |
|
lambda functions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without littering my code with a lot of little one-line functions. |
Here are the lambda functions in apihelper.py:
processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s)
Notice that this uses the simple form of the and-or trick, which is okay, because a lambda function is always true in a boolean context. (That doesn't mean that a lambda function can't return a false value. The function is always true; its return value could be anything.)
Also notice that you're using the split function with no arguments. You've already seen it used with one or two arguments, but without any arguments it splits on whitespace.
>>> s = "this is\na\ttest">>> print s this is a test >>> print s.split()
['this', 'is', 'a', 'test'] >>> print " ".join(s.split())
'this is a test'
![]() | This is a multiline string, defined by escape characters instead of triple quotes. \n is a carriage return, and \t is a tab character. |
![]() | split without any arguments splits on whitespace. So three spaces, a carriage return, and a tab character are all the same. |
![]() | You can normalize whitespace by splitting a string with split and then rejoining it with join, using a single space as a delimiter. This is what the info function does to collapse multi-line doc strings into a single line. |
So what is the info function actually doing with these lambda functions, splits, and and-or tricks?
processFunc is now a function, but which function it is depends on the value of the collapse variable. If collapse is true, processFunc(string) will collapse whitespace; otherwise, processFunc(string) will return its argument unchanged.
To do this in a less robust language, like Visual Basic, you would probably create a function that took a string and a collapse argument and used an if statement to decide whether to collapse the whitespace or not, then returned the appropriate value. This would be inefficient, because the function would need to handle every possible case. Every time you called it, it would need to decide whether to collapse whitespace before it could give you what you wanted. In Python, you can take that decision logic out of the function and define a lambda function that is custom-tailored to give you exactly (and only) what you want. This is more efficient, more elegant, and less prone to those nasty oh-I-thought-those-arguments-were-reversed kinds of errors.
The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time to knock them down.
This is the meat of apihelper.py:
print "\n".join(["%s %s" % (method.ljust(spacing), processFunc(str(getattr(object, method).__doc__))) for method in methodList])
Note that this is one command, split over multiple lines, but it doesn't use the line continuation character (\). Remember when I said that some expressions can be split into multiple lines without using a backslash? A list comprehension is one of those expressions, since the entire expression is contained in square brackets.
Now, let's take it from the end and work backwards. The
for method in methodList
shows that this is a list comprehension. As you know, methodList is a list of all the methods you care about in object. So you're looping through that list with method.
>>> import odbchelper >>> object = odbchelper>>> method = 'buildConnectionString'
>>> getattr(object, method)
<function buildConnectionString at 010D6D74> >>> print getattr(object, method).__doc__
Build a connection string from a dictionary of parameters. Returns string.
![]() | In the info function, object is the object you're getting help on, passed in as an argument. |
![]() | As you're looping through methodList, method is the name of the current method. |
![]() | Using the getattr function, you're getting a reference to the method function in the object module. |
![]() | Now, printing the actual doc string of the method is easy. |
The next piece of the puzzle is the use of str around the doc string. As you may recall, str is a built-in function that coerces data into a string. But a doc string is always a string, so why bother with the str function? The answer is that not every function has a doc string, and if it doesn't, its __doc__ attribute is None.
>>> >>> def foo(): print 2 >>> >>> foo() 2 >>> >>> foo.__doc__>>> foo.__doc__ == None
True >>> str(foo.__doc__)
'None'
![]() |
|
In SQL, you must use IS NULL instead of = NULL to compare a null value. In Python, you can use either == None or is None, but is None is faster. |
Now that you are guaranteed to have a string, you can pass the string to processFunc, which you have already defined as a function that either does or doesn't collapse whitespace. Now you see why it was important to use str to convert a None value into a string representation. processFunc is assuming a string argument and calling its split method, which would crash if you passed it None because None doesn't have a split method.
Stepping back even further, you see that you're using string formatting again to concatenate the return value of processFunc with the return value of method's ljust method. This is a new string method that you haven't seen before.
>>> s = 'buildConnectionString' >>> s.ljust(30)'buildConnectionString ' >>> s.ljust(20)
'buildConnectionString'
You're almost finished. Given the padded method name from the ljust method and the (possibly collapsed) doc string from the call to processFunc, you concatenate the two and get a single string. Since you're mapping methodList, you end up with a list of strings. Using the join method of the string "\n", you join this list into a single string, with each element of the list on a separate line, and print the result.
That's the last piece of the puzzle. You should now understand this code.
print "\n".join(["%s %s" % (method.ljust(spacing), processFunc(str(getattr(object, method).__doc__))) for method in methodList])
The apihelper.py program and its output should now make perfect sense.
def info(object, spacing=10, collapse=1): """Print methods and doc strings. Takes module, class, list, dictionary, or string.""" methodList = [method for method in dir(object) if callable(getattr(object, method))] processFunc = collapse and (lambda s: " ".join(s.split())) or (lambda s: s) print "\n".join(["%s %s" % (method.ljust(spacing), processFunc(str(getattr(object, method).__doc__))) for method in methodList]) if __name__ == "__main__": print info.__doc__
Here is the output of apihelper.py:
>>> from apihelper import info >>> li = [] >>> info(li) append L.append(object) -- append object to end count L.count(value) -> integer -- return number of occurrences of value extend L.extend(list) -- extend list by appending list elements index L.index(value) -> integer -- return index of first occurrence of value insert L.insert(index, object) -- insert object before index pop L.pop([index]) -> item -- remove and return item at index (default last) remove L.remove(value) -- remove first occurrence of value reverse L.reverse() -- reverse *IN PLACE* sort L.sort([cmpfunc]) -- sort *IN PLACE*; if given, cmpfunc(x, y) -> -1, 0, 1
Before diving into the next chapter, make sure you're comfortable doing all of these things:
This chapter, and pretty much every chapter after this, deals with object-oriented Python programming.
Here is a complete, working Python program. Read the doc strings of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't worry about the stuff you don't understand; that's what the rest of the chapter is for.
If you have not already done so, you can download this and other examples used in this book.
"""Framework for getting filetype-specific metadata. Instantiate appropriate class with filename. Returned object acts like a dictionary, with key-value pairs for each piece of metadata. import fileinfo info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3") print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()]) Or use listDirectory function to get info on all files in a directory. for info in fileinfo.listDirectory("/music/ap/", [".mp3"]): ... Framework can be extended by adding classes for particular file types, e.g. HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for parsing its files appropriately; see MP3FileInfo for example. """ import os import sys from UserDict import UserDict def stripnulls(data): "strip whitespace and nulls" return data.replace("\00", "").strip() class FileInfo(UserDict): "store file metadata" def __init__(self, filename=None): UserDict.__init__(self) self["name"] = filename class MP3FileInfo(FileInfo): "store ID3v1.0 MP3 tags" tagDataMap = {"title" : ( 3, 33, stripnulls), "artist" : ( 33, 63, stripnulls), "album" : ( 63, 93, stripnulls), "year" : ( 93, 97, stripnulls), "comment" : ( 97, 126, stripnulls), "genre" : (127, 128, ord)} def __parse(self, filename): "parse ID3v1.0 tags from MP3 file" self.clear() try: fsock = open(filename, "rb", 0) try: fsock.seek(-128, 2) tagdata = fsock.read(128) finally: fsock.close() if tagdata[:3] == "TAG": for tag, (start, end, parseFunc) in self.tagDataMap.items(): self[tag] = parseFunc(tagdata[start:end]) except IOError: pass def __setitem__(self, key, item): if key == "name" and item: self.__parse(item) FileInfo.__setitem__(self, key, item) def listDirectory(directory, fileExtList): "get list of file info objects for files of particular extensions" fileList = [os.path.normcase(f) for f in os.listdir(directory)] fileList = [os.path.join(directory, f) for f in fileList if os.path.splitext(f)[1] in fileExtList] def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): "get file info class from filename extension" subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] return hasattr(module, subclass) and getattr(module, subclass) or FileInfo return [getFileInfoClass(f)(f) for f in fileList] if __name__ == "__main__": for info in listDirectory("/music/_singles/", [".mp3"]):print "\n".join(["%s=%s" % (k, v) for k, v in info.items()]) print
This is the output I got on my machine. Your output will be different, unless, by some startling coincidence, you share my exact taste in music.
album=
artist=Ghost in the Machine
title=A Time Long Forgotten (Concept
genre=31
name=/music/_singles/a_time_long_forgotten_con.mp3
year=1999
comment=http://mp3.com/ghostmachine
album=Rave Mix
artist=***DJ MARY-JANE***
title=HELLRAISER****Trance from Hell
genre=31
name=/music/_singles/hellraiser.mp3
year=2000
comment=http://mp3.com/DJMARYJANE
album=Rave Mix
artist=***DJ MARY-JANE***
title=KAIRO****THE BEST GOA
genre=31
name=/music/_singles/kairo.mp3
year=2000
comment=http://mp3.com/DJMARYJANE
album=Journeys
artist=Masters of Balance
title=Long Way Home
genre=31
name=/music/_singles/long_way_home1.mp3
year=2000
comment=http://mp3.com/MastersofBalan
album=
artist=The Cynic Project
title=Sidewinder
genre=18
name=/music/_singles/sidewinder.mp3
year=2000
comment=http://mp3.com/cynicproject
album=Digitosis@128k
artist=VXpanded
title=Spinning
genre=255
name=/music/_singles/spinning.mp3
year=2000
comment=http://mp3.com/artists/95/vxp
Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences.
Here is the basic from module import syntax:
from UserDict import UserDict
This is similar to the import module syntax that you know and love, but with an important difference: the attributes and methods of the imported module types are imported directly into the local namespace, so they are available directly, without qualification by module name. You can import individual items or use from module import * to import everything.
![]() |
|
from module import * in Python is like use module in Perl; import module in Python is like require module in Perl. |
![]() |
|
from module import * in Python is like import module.* in Java; import module in Python is like import module in Java. |
>>> import types >>> types.FunctionType<type 'function'> >>> FunctionType
Traceback (innermost last): File "<interactive input>", line 1, in ? NameError: There is no variable named 'FunctionType' >>> from types import FunctionType
>>> FunctionType
<type 'function'>
When should you use from module import?
Other than that, it's just a matter of style, and you will see Python code written both ways.
![]() |
|
Use from module import * sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes debugging and refactoring more difficult. |
Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you've defined.
Defining a class in Python is simple. As with functions, there is no separate interface definition. Just define the class and start coding. A Python class starts with the reserved word class, followed by the class name. Technically, that's all that's required, since a class doesn't need to inherit from any other class.
class Loaf:pass
![]()
![]() |
|
The pass statement in Python is like an empty set of braces ({}) in Java or C. |
Of course, realistically, most classes will be inherited from other classes, and they will define their own class methods and attributes. But as you've just seen, there is nothing that a class absolutely must have, other than a name. In particular, C++ programmers may find it odd that Python classes don't have explicit constructors and destructors. Python classes do have something similar to a constructor: the __init__ method.
from UserDict import UserDict class FileInfo(UserDict):
![]() | In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. So the FileInfo class is inherited from the UserDict class (which was imported from the UserDict module). UserDict is a class that acts like a dictionary, allowing you to essentially subclass the dictionary datatype and add your own behavior. (There are similar classes UserList and UserString which allow you to subclass lists and strings.) There is a bit of black magic behind this, which you will demystify later in this chapter when you explore the UserDict class in more depth. |
![]() |
|
In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like extends in Java. |
Python supports multiple inheritance. In the parentheses following the class name, you can list as many ancestor classes as you like, separated by commas.
This example shows the initialization of the FileInfo class using the __init__ method.
class FileInfo(UserDict): "store file metadata"def __init__(self, filename=None):
![]()
![]()
![]() | Classes can (and should) have doc strings too, just like modules and functions. |
![]() | __init__ is called immediately after an instance of the class is created. It would be tempting but incorrect to call this the constructor of the class. It's tempting, because it looks like a constructor (by convention, __init__ is the first method defined for the class), acts like one (it's the first piece of code executed in a newly created instance of the class), and even sounds like one (“init” certainly suggests a constructor-ish nature). Incorrect, because the object has already been constructed by the time __init__ is called, and you already have a valid reference to the new instance of the class. But __init__ is the closest thing you're going to get to a constructor in Python, and it fills much the same role. |
![]() | The first argument of every class method, including __init__, is always a reference to the current instance of the class. By convention, this argument is always named self. In the __init__ method, self refers to the newly created object; in other class methods, it refers to the instance whose method was called. Although you need to specify self explicitly when defining the method, you do not specify it when calling the method; Python will add it for you automatically. |
![]() | __init__ methods can take any number of arguments, and just like functions, the arguments can be defined with default values, making them optional to the caller. In this case, filename has a default value of None, which is the Python null value. |
![]() |
|
By convention, the first argument of any Python class method (the reference to the current instance) is called self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything but self; this is a very strong convention. |
class FileInfo(UserDict): "store file metadata" def __init__(self, filename=None): UserDict.__init__(self)self["name"] = filename
![]()
When defining your class methods, you must explicitly list self as the first argument for each method, including __init__. When you call a method of an ancestor class from within your class, you must include the self argument. But when you call your class method from outside, you do not specify anything for the self argument; you skip it entirely, and Python automatically adds the instance reference for you. I am aware that this is confusing at first; it's not really inconsistent, but it may appear inconsistent because it relies on a distinction (between bound and unbound methods) that you don't know about yet.
Whew. I realize that's a lot to absorb, but you'll get the hang of it. All Python classes work the same way, so once you learn one, you've learned them all. If you forget everything else, remember this one thing, because I promise it will trip you up:
![]() |
|
__init__ methods are optional, but when you define one, you must remember to explicitly call the ancestor's __init__ method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments. |
Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__ method defines. The return value will be the newly created object.
>>> import fileinfo >>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3")>>> f.__class__
<class fileinfo.FileInfo at 010EC204> >>> f.__doc__
'store file metadata' >>> f
{'name': '/music/_singles/kairo.mp3'}
![]() | You are creating an instance of the FileInfo class (defined in the fileinfo module) and assigning the newly created instance to the variable f. You are passing one parameter, /music/_singles/kairo.mp3, which will end up as the filename argument in FileInfo's __init__ method. |
![]() | Every class instance has a built-in attribute, __class__, which is the object's class. (Note that the representation of this includes the physical address of the instance on my machine; your representation will be different.) Java programmers may be familiar with the Class class, which contains methods like getName and getSuperclass to get metadata information about an object. In Python, this kind of metadata is available directly on the object itself through attributes like __class__, __name__, and __bases__. |
![]() | You can access the instance's doc string just as with a function or a module. All instances of a class share the same doc string. |
![]() | Remember when the __init__ method assigned its filename argument to self["name"]? Well, here's the result. The arguments you pass when you create the class instance get sent right along to the __init__ method (along with the object reference, self, which Python adds for free). |
![]() |
|
In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit new operator like C++ or Java. |
If creating new instances is easy, destroying them is even easier. In general, there is no need to explicitly free instances, because they are freed automatically when the variables assigned to them go out of scope. Memory leaks are rare in Python.
>>> def leakmem(): ... f = fileinfo.FileInfo('/music/_singles/kairo.mp3')... >>> for i in range(100): ... leakmem()
The technical term for this form of garbage collection is “reference counting”. Python keeps a list of references to every instance created. In the above example, there was only one reference to the FileInfo instance: the local variable f. When the function ends, the variable f goes out of scope, so the reference count drops to 0, and Python destroys the instance automatically.
In previous versions of Python, there were situations where reference counting failed, and Python couldn't clean up after you. If you created two instances that referenced each other (for instance, a doubly-linked list, where each node has a pointer to the previous and next node in the list), neither instance would ever be destroyed automatically because Python (correctly) believed that there is always a reference to each instance. Python 2.0 has an additional form of garbage collection called “mark-and-sweep” which is smart enough to notice this virtual gridlock and clean up circular references correctly.
As a former philosophy major, it disturbs me to think that things disappear when no one is looking at them, but that's exactly what happens in Python. In general, you can simply forget about memory management and let Python clean up after you.
As you've seen, FileInfo is a class that acts like a dictionary. To explore this further, let's look at the UserDict class in the UserDict module, which is the ancestor of the FileInfo class. This is nothing special; the class is written in Python and stored in a .py file, just like any other Python code. In particular, it's stored in the lib directory in your Python installation.
![]() |
|
In the ActivePython IDE on Windows, you can quickly open any module in your library path by selecting -> (Ctrl-L). |
class UserDict:def __init__(self, dict=None):
self.data = {}
if dict is not None: self.update(dict)
![]()
![]()
![]() | Note that UserDict is a base class, not inherited from any other class. |
![]() | This is the __init__ method that you overrode in the FileInfo class. Note that the argument list in this ancestor class is different than the descendant. That's okay; each subclass can have its own set of arguments, as long as it calls the ancestor with the correct arguments. Here the ancestor class has a way to define initial values (by passing a dictionary in the dict argument) which the FileInfo does not use. |
![]() | Python supports data attributes (called “instance variables” in Java and Powerbuilder, and “member variables” in C++). Data attributes are pieces of data held by a specific instance of a class. In this case, each instance of UserDict will have a data attribute data. To reference this attribute from code outside the class, you qualify it with the instance name, instance.data, in the same way that you qualify a function with its module name. To reference a data attribute from within the class, you use self as the qualifier. By convention, all data attributes are initialized to reasonable values in the __init__ method. However, this is not required, since data attributes, like local variables, spring into existence when they are first assigned a value. |
![]() | The update method is a dictionary duplicator: it copies all the keys and values from one dictionary to another. This does not clear the target dictionary first; if the target dictionary already has some keys, the ones from the source dictionary will be overwritten, but others will be left untouched. Think of update as a merge function, not a copy function. |
![]() | This is a syntax you may not have seen before (I haven't used it in the examples in this book). It's an if statement, but instead of having an indented block starting on the next line, there is just a single statement on the same line, after the colon. This is perfectly legal syntax, which is just a shortcut you can use when you have only one statement in a block. (It's like specifying a single statement without braces in C++.) You can use this syntax, or you can have indented code on subsequent lines, but you can't do both for the same block. |
![]() |
|
Java and Powerbuilder support function overloading by argument list, i.e. one class can have multiple methods with the same name but a different number of arguments, or arguments of different types. Other languages (most notably PL/SQL) even support function overloading by argument name; i.e. one class can have multiple methods with the same name and the same number of arguments of the same type but different argument names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name, and there can be only one method per class with a given name. So if a descendant class has an __init__ method, it always overrides the ancestor __init__ method, even if the descendant defines it with a different argument list. And the same rule applies to any other method. |
![]() |
|
Guido, the original author of Python, explains method overriding this way: "Derived classes may override methods of their base classes. Because methods have no special privileges when calling other methods of the same object, a method of a base class that calls another method defined in the same base class, may in fact end up calling a method of a derived class that overrides it. (For C++ programmers: all methods in Python are effectively virtual.)" If that doesn't make sense to you (it confuses the hell out of me), feel free to ignore it. I just thought I'd pass it along. |
![]() |
|
Always assign an initial value to all of an instance's data attributes in the __init__ method. It will save you hours of debugging later, tracking down AttributeError exceptions because you're referencing uninitialized (and therefore non-existent) attributes. |
def clear(self): self.data.clear()def copy(self):
if self.__class__ is UserDict:
return UserDict(self.data) import copy
return copy.copy(self) def keys(self): return self.data.keys()
def items(self): return self.data.items() def values(self): return self.data.values()
![]() | clear is a normal class method; it is publicly available to be called by anyone at any time. Notice that clear, like all class methods, has self as its first argument. (Remember that you don't include self when you call the method; it's something that Python adds for you.) Also note the basic technique of this wrapper class: store a real dictionary (data) as a data attribute, define all the methods that a real dictionary has, and have each class method redirect to the corresponding method on the real dictionary. (In case you'd forgotten, a dictionary's clear method deletes all of its keys and their associated values.) |
![]() | The copy method of a real dictionary returns a new dictionary that is an exact duplicate of the original (all the same key-value pairs). But UserDict can't simply redirect to self.data.copy, because that method returns a real dictionary, and what you want is to return a new instance that is the same class as self. |
![]() | You use the __class__ attribute to see if self is a UserDict; if so, you're golden, because you know how to copy a UserDict: just create a new UserDict and give it the real dictionary that you've squirreled away in self.data. Then you immediately return the new UserDict you don't even get to the import copy on the next line. |
![]() | If self.__class__ is not UserDict, then self must be some subclass of UserDict (like maybe FileInfo), in which case life gets trickier. UserDict doesn't know how to make an exact copy of one of its descendants; there could, for instance, be other data attributes defined in the subclass, so you would need to iterate through them and make sure to copy all of them. Luckily, Python comes with a module to do exactly this, and it's called copy. I won't go into the details here (though it's a wicked cool module, if you're ever inclined to dive into it on your own). Suffice it to say that copy can copy arbitrary Python objects, and that's how you're using it here. |
![]() | The rest of the methods are straightforward, redirecting the calls to the built-in methods on self.data. |
![]() |
|
In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: UserString, UserList, and UserDict. Using a combination of normal and special methods, the UserDict class does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes like dict. An example of this is given in the examples that come with this book, in fileinfo_fromdict.py. |
In Python, you can inherit directly from the dict built-in datatype, as shown in this example. There are three differences here compared to the UserDict version.
class FileInfo(dict):"store file metadata" def __init__(self, filename=None):
self["name"] = filename
In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for you by Python in particular circumstances or when specific syntax is used.
As you saw in the previous section, normal methods go a long way towards wrapping a dictionary in a class. But normal methods alone are not enough, because there are a lot of things you can do with dictionaries besides call methods on them. For starters, you can get and set items with a syntax that doesn't include explicitly invoking methods. This is where special class methods come in: they provide a way to map non-method-calling syntax into method calls.
def __getitem__(self, key): return self.data[key]
>>> f = fileinfo.FileInfo("/music/_singles/kairo.mp3") >>> f {'name':'/music/_singles/kairo.mp3'} >>> f.__getitem__("name")'/music/_singles/kairo.mp3' >>> f["name"]
'/music/_singles/kairo.mp3'
![]() | The __getitem__ special method looks simple enough. Like the normal methods clear, keys, and values, it just redirects to the dictionary to return its value. But how does it get called? Well, you can call __getitem__ directly, but in practice you wouldn't actually do that; I'm just doing it here to show you how it works. The right way to use __getitem__ is to get Python to call it for you. |
![]() | This looks just like the syntax you would use to get a dictionary value, and in fact it returns the value you would expect. But here's the missing link: under the covers, Python has converted this syntax to the method call f.__getitem__("name"). That's why __getitem__ is a special class method; not only can you call it yourself, you can get Python to call it for you by using the right syntax. |
Of course, Python has a __setitem__ special method to go along with __getitem__, as shown in the next example.
def __setitem__(self, key, item): self.data[key] = item
>>> f {'name':'/music/_singles/kairo.mp3'} >>> f.__setitem__("genre", 31)>>> f {'name':'/music/_singles/kairo.mp3', 'genre':31} >>> f["genre"] = 32
>>> f {'name':'/music/_singles/kairo.mp3', 'genre':32}
__setitem__ is a special class method because it gets called for you, but it's still a class method. Just as easily as the __setitem__ method was defined in UserDict, you can redefine it in the descendant class to override the ancestor method. This allows you to define classes that act like dictionaries in some ways but define their own behavior above and beyond the built-in dictionary.
This concept is the basis of the entire framework you're studying in this chapter. Each file type can have a handler class that knows how to get metadata from a particular type of file. Once some attributes (like the file's name and location) are known, the handler class knows how to derive other attributes automatically. This is done by overriding the __setitem__ method, checking for particular keys, and adding additional processing when they are found.
For example, MP3FileInfo is a descendant of FileInfo. When an MP3FileInfo's name is set, it doesn't just set the name key (like the ancestor FileInfo does); it also looks in the file itself for MP3 tags and populates a whole set of keys. The next example shows how this works.
def __setitem__(self, key, item):if key == "name" and item:
self.__parse(item)
FileInfo.__setitem__(self, key, item)
![]() | Notice that this __setitem__ method is defined exactly the same way as the ancestor method. This is important, since Python will be calling the method for you, and it expects it to be defined with a certain number of arguments. (Technically speaking, the names of the arguments don't matter; only the number of arguments is important.) |
![]() | Here's the crux of the entire MP3FileInfo class: if you're assigning a value to the name key, you want to do something extra. |
![]() | The extra processing you do for names is encapsulated in the __parse method. This is another class method defined in MP3FileInfo, and when you call it, you qualify it with self. Just calling __parse would look for a normal function defined outside the class, which is not what you want. Calling self.__parse will look for a class method defined within the class. This isn't anything new; you reference data attributes the same way. |
![]() | After doing this extra processing, you want to call the ancestor method. Remember that this is never done for you in Python; you must do it manually. Note that you're calling the immediate ancestor, FileInfo, even though it doesn't have a __setitem__ method. That's okay, because Python will walk up the ancestor tree until it finds a class with the method you're calling, so this line of code will eventually find and call the __setitem__ defined in UserDict. |
![]() |
|
When accessing data attributes within a class, you need to qualify the attribute name: self.attribute. When calling other methods within a class, you need to qualify the method name: self.method. |
>>> import fileinfo >>> mp3file = fileinfo.MP3FileInfo()>>> mp3file {'name':None} >>> mp3file["name"] = "/music/_singles/kairo.mp3"
>>> mp3file {'album': 'Rave Mix', 'artist': '***DJ MARY-JANE***', 'genre': 31, 'title': 'KAIRO****THE BEST GOA', 'name': '/music/_singles/kairo.mp3', 'year': '2000', 'comment': 'http://mp3.com/DJMARYJANE'} >>> mp3file["name"] = "/music/_singles/sidewinder.mp3"
>>> mp3file {'album': '', 'artist': 'The Cynic Project', 'genre': 18, 'title': 'Sidewinder', 'name': '/music/_singles/sidewinder.mp3', 'year': '2000', 'comment': 'http://mp3.com/cynicproject'}
![]() | First, you create an instance of MP3FileInfo, without passing it a filename. (You can get away with this because the filename argument of the __init__ method is optional.) Since MP3FileInfo has no __init__ method of its own, Python walks up the ancestor tree and finds the __init__ method of FileInfo. This __init__ method manually calls the __init__ method of UserDict and then sets the name key to filename, which is None, since you didn't pass a filename. Thus, mp3file initially looks like a dictionary with one key, name, whose value is None. |
![]() | Now the real fun begins. Setting the name key of mp3file triggers the __setitem__ method on MP3FileInfo (not UserDict), which notices that you're setting the name key with a real value and calls self.__parse. Although you haven't traced through the __parse method yet, you can see from the output that it sets several other keys: album, artist, genre, title, year, and comment. |
![]() | Modifying the name key will go through the same process again: Python calls __setitem__, which calls self.__parse, which sets all the other keys. |
Python has more special methods than just __getitem__ and __setitem__. Some of them let you emulate functionality that you may not even know about.
This example shows some of the other special methods in UserDict.
def __repr__(self): return repr(self.data)def __cmp__(self, dict):
if isinstance(dict, UserDict): return cmp(self.data, dict.data) else: return cmp(self.data, dict) def __len__(self): return len(self.data)
def __delitem__(self, key): del self.data[key]
![]() | __repr__ is a special method that is called when you call repr(instance). The repr function is a built-in function that returns a string representation of an object. It works on any object, not just class instances. You're already intimately familiar with repr and you don't even know it. In the interactive window, when you type just a variable name and press the ENTER key, Python uses repr to display the variable's value. Go create a dictionary d with some data and then print repr(d) to see for yourself. |
![]() | __cmp__ is called when you compare class instances. In general, you can compare any two Python objects, not just class instances, by using ==. There are rules that define when built-in datatypes are considered equal; for instance, dictionaries are equal when they have all the same keys and values, and strings are equal when they are the same length and contain the same sequence of characters. For class instances, you can define the __cmp__ method and code the comparison logic yourself, and then you can use == to compare instances of your class and Python will call your __cmp__ special method for you. |
![]() | __len__ is called when you call len(instance). The len function is a built-in function that returns the length of an object. It works on any object that could reasonably be thought of as having a length. The len of a string is its number of characters; the len of a dictionary is its number of keys; the len of a list or tuple is its number of elements. For class instances, define the __len__ method and code the length calculation yourself, and then call len(instance) and Python will call your __len__ special method for you. |
![]() | __delitem__ is called when you call del instance[key], which you may remember as the way to delete individual items from a dictionary. When you use del on a class instance, Python calls the __delitem__ special method for you. |
![]() |
|
In Java, you determine whether two string variables reference the same physical memory location by using str1 == str2. This is called object identity, and it is written in Python as str1 is str2. To compare string values in Java, you would use str1.equals(str2); in Python, you would use str1 == str2. Java programmers who have been taught to believe that the world is a better place because == in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such “gotchas”. |
At this point, you may be thinking, “All this work just to do something in a class that I can do with a built-in datatype.” And it's true that life would be easier (and the entire UserDict class would be unnecessary) if you could inherit from built-in datatypes like a dictionary. But even if you could, special methods would still be useful, because they can be used in any class, not just wrapper classes like UserDict.
Special methods mean that any class can store key/value pairs like a dictionary, just by defining the __setitem__ method. Any class can act like a sequence, just by defining the __getitem__ method. Any class that defines the __cmp__ method can be compared with ==. And if your class represents something that has a length, don't define a GetLength method; define the __len__ method and use len(instance).
![]() |
|
While other object-oriented languages only let you define the physical model of an object (“this object has a GetLength method”), Python's special class methods like __len__ allow you to define the logical model of an object (“this object has a length”). |
Python has a lot of other special methods. There's a whole set of them that let classes act like numbers, allowing you to add, subtract, and do other arithmetic operations on class instances. (The canonical example of this is a class that represents complex numbers, numbers with both real and imaginary components.) The __call__ method lets a class act like a function, allowing you to call a class instance directly. And there are other special methods that allow classes to have read-only and write-only data attributes; you'll talk more about those in later chapters.
You already know about data attributes, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself.
class MP3FileInfo(FileInfo): "store ID3v1.0 MP3 tags" tagDataMap = {"title" : ( 3, 33, stripnulls), "artist" : ( 33, 63, stripnulls), "album" : ( 63, 93, stripnulls), "year" : ( 93, 97, stripnulls), "comment" : ( 97, 126, stripnulls), "genre" : (127, 128, ord)}
>>> import fileinfo >>> fileinfo.MP3FileInfo<class fileinfo.MP3FileInfo at 01257FDC> >>> fileinfo.MP3FileInfo.tagDataMap
{'title': (3, 33, <function stripnulls at 0260C8D4>), 'genre': (127, 128, <built-in function ord>), 'artist': (33, 63, <function stripnulls at 0260C8D4>), 'year': (93, 97, <function stripnulls at 0260C8D4>), 'comment': (97, 126, <function stripnulls at 0260C8D4>), 'album': (63, 93, <function stripnulls at 0260C8D4>)} >>> m = fileinfo.MP3FileInfo()
>>> m.tagDataMap {'title': (3, 33, <function stripnulls at 0260C8D4>), 'genre': (127, 128, <built-in function ord>), 'artist': (33, 63, <function stripnulls at 0260C8D4>), 'year': (93, 97, <function stripnulls at 0260C8D4>), 'comment': (97, 126, <function stripnulls at 0260C8D4>), 'album': (63, 93, <function stripnulls at 0260C8D4>)}
![]() |
|
In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the static keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the __init__ method. |
Class attributes can be used as class-level constants (which is how you use them in MP3FileInfo), but they are not really constants. You can also change them.
![]() |
|
There are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of None, you can do it, but don't come running to me when your code is impossible to debug. |
>>> class counter: ... count = 0... def __init__(self): ... self.__class__.count += 1
... >>> counter <class __main__.counter at 010EAECC> >>> counter.count
0 >>> c = counter() >>> c.count
1 >>> counter.count 1 >>> d = counter()
>>> d.count 2 >>> c.count 2 >>> counter.count 2
Like most languages, Python has the concept of private elements:
Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.
If the name of a Python function, class method, or attribute starts with (but doesn't end with) two underscores, it's private; everything else is public. Python has no concept of protected class methods (accessible only in their own class and descendant classes). Class methods are either private (accessible only in their own class) or public (accessible from anywhere).
In MP3FileInfo, there are two methods: __parse and __setitem__. As you have already discussed, __setitem__ is a special method; normally, you would call it indirectly by using the dictionary syntax on a class instance, but it is public, and you could call it directly (even from outside the fileinfo module) if you had a really good reason. However, __parse is private, because it has two underscores at the beginning of its name.
![]() |
|
In Python, all special methods (like __setitem__) and built-in attributes (like __doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and attributes this way, because it will only confuse you (and others) later. |
>>> import fileinfo >>> m = fileinfo.MP3FileInfo() >>> m.__parse("/music/_singles/kairo.mp3")Traceback (innermost last): File "<interactive input>", line 1, in ? AttributeError: 'MP3FileInfo' instance has no attribute '__parse'
That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses getattr to create a proxy to a remote web service.
The next chapter will continue using this code sample to explore other Python concepts, such as exceptions, file objects, and for loops.
Before diving into the next chapter, make sure you're comfortable doing all of these things:
In this chapter, you will dive into exceptions, file objects, for loops, and the os and sys modules. If you've used exceptions in another programming language, you can skim the first section to get a sense of Python's syntax. Be sure to tune in again for file handling.
Like many other programming languages, Python has exception handling via try...except blocks.
![]() |
|
Python uses try...except to handle exceptions and raise to generate them. Java and C++ use try...catch to handle exceptions, and throw to generate them. |
Exceptions are everywhere in Python. Virtually every module in the standard Python library uses them, and Python itself will raise them in a lot of different circumstances. You've already seen them repeatedly throughout this book.
In each of these cases, you were simply playing around in the Python IDE: an error occurred, the exception was printed (depending on your IDE, perhaps in an intentionally jarring shade of red), and that was that. This is called an unhandled exception. When the exception was raised, there was no code to explicitly notice it and deal with it, so it bubbled its way back to the default behavior built in to Python, which is to spit out some debugging information and give up. In the IDE, that's no big deal, but if that happened while your actual Python program was running, the entire program would come to a screeching halt.
An exception doesn't need result in a complete program crash, though. Exceptions, when raised, can be handled. Sometimes an exception is really because you have a bug in your code (like accessing a variable that doesn't exist), but many times, an exception is something you can anticipate. If you're opening a file, it might not exist. If you're connecting to a database, it might be unavailable, or you might not have the correct security credentials to access it. If you know a line of code may raise an exception, you should handle the exception using a try...except block.
>>> fsock = open("/notthere", "r")Traceback (innermost last): File "<interactive input>", line 1, in ? IOError: [Errno 2] No such file or directory: '/notthere' >>> try: ... fsock = open("/notthere")
... except IOError:
... print "The file does not exist, exiting gracefully" ... print "This line will always print"
The file does not exist, exiting gracefully This line will always print
Exceptions may seem unfriendly (after all, if you don't catch the exception, your entire program will crash), but consider the alternative. Would you rather get back an unusable file object to a non-existent file? You'd need to check its validity somehow anyway, and if you forgot, somewhere down the line, your program would give you strange errors somewhere down the line that you would need to trace back to the source. I'm sure you've experienced this, and you know it's not fun. With exceptions, errors occur immediately, and you can handle them in a standard way at the source of the problem.
There are a lot of other uses for exceptions besides handling actual error conditions. A common use in the standard Python library is to try to import a module, and then check whether it worked. Importing a module that does not exist will raise an ImportError exception. You can use this to define multiple levels of functionality based on which modules are available at run-time, or to support multiple platforms (where platform-specific code is separated into different modules).
You can also define your own exceptions by creating a class that inherits from the built-in Exception class, and then raise your exceptions with the raise command. See the further reading section if you're interested in doing this.
The next example demonstrates how to use an exception to support platform-specific functionality. This code comes from the getpass module, a wrapper module for getting a password from the user. Getting a password is accomplished differently on UNIX, Windows, and Mac OS platforms, but this code encapsulates all of those differences.
# Bind the name getpass to the appropriate function try: import termios, TERMIOSexcept ImportError: try: import msvcrt
except ImportError: try: from EasyDialogs import AskPassword
except ImportError: getpass = default_getpass
else:
getpass = AskPassword else: getpass = win_getpass else: getpass = unix_getpass
Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file.
>>> f = open("/music/_singles/kairo.mp3", "rb")>>> f
<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988> >>> f.mode
'rb' >>> f.name
'/music/_singles/kairo.mp3'
![]() | The open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. (print open.__doc__ displays a great explanation of all the possible modes.) |
![]() | The open function returns an object (by now, this should not surprise you). A file object has several useful attributes. |
![]() | The mode attribute of a file object tells you in which mode the file was opened. |
![]() | The name attribute of a file object tells you the name of the file that the file object has open. |
After you open a file, the first thing you'll want to do is read from it, as shown in the next example.
>>> f <open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988> >>> f.tell()0 >>> f.seek(-128, 2)
>>> f.tell()
7542909 >>> tagData = f.read(128)
>>> tagData 'TAGKAIRO****THE BEST GOA ***DJ MARY-JANE*** Rave Mix 2000http://mp3.com/DJMARYJANE \037' >>> f.tell()
7543037
Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's important to close files as soon as you're finished with them.
>>> f <open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988> >>> f.closedFalse >>> f.close()
>>> f <closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988> >>> f.closed
True >>> f.seek(0)
Traceback (innermost last): File "<interactive input>", line 1, in ? ValueError: I/O operation on closed file >>> f.tell() Traceback (innermost last): File "<interactive input>", line 1, in ? ValueError: I/O operation on closed file >>> f.read() Traceback (innermost last): File "<interactive input>", line 1, in ? ValueError: I/O operation on closed file >>> f.close()
![]() | The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False). |
![]() | To close a file, call the close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) that the system hadn't gotten around to actually writing yet, and releases the system resources. |
![]() | The closed attribute confirms that the file is closed. |
![]() | Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; they all raise an exception. |
![]() | Calling close on a file object whose file is already closed does not raise an exception; it fails silently. |
Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle errors.
try:fsock = open(filename, "rb", 0)
try: fsock.seek(-128, 2)
tagdata = fsock.read(128)
finally:
fsock.close() . . . except IOError:
pass
![]() | Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a try...except block. (Hey, isn't standardized indentation great? This is where you start to appreciate it.) |
![]() | The open function may raise an IOError. (Maybe the file doesn't exist.) |
![]() | The seek method may raise an IOError. (Maybe the file is smaller than 128 bytes.) |
![]() | The read method may raise an IOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.) |
![]() | This is new: a try...finally block. Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods. That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before. |
![]() | At last, you handle your IOError exception. This could be the IOError exception raised by the call to open, seek, or read. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, pass is a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the next line of code after the try...except block. |
As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:
Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open it and start writing.
>>> logfile = open('test.log', 'w')>>> logfile.write('test succeeded')
>>> logfile.close() >>> print file('test.log').read()
test succeeded >>> logfile = open('test.log', 'a')
>>> logfile.write('line 2') >>> logfile.close() >>> print file('test.log').read()
test succeededline 2
Like most other languages, Python has for loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.
Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters or other iteratable entities. But in Python, a for loop simply iterates over a list, the same way list comprehensions work.
>>> li = ['a', 'b', 'e'] >>> for s in li:... print s
a b e >>> print "\n".join(li)
a b e
![]() | The syntax for a for loop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element. |
![]() | Like an if statement or any other indented block, a for loop can have any number of lines of code in it. |
![]() | This is the reason you haven't seen the for loop yet: you haven't needed it yet. It's amazing how often you use for loops in other languages when all you really want is a join or a list comprehension. |
Doing a “normal” (by Visual Basic standards) counter for loop is also simple.
>>> for i in range(5):... print i 0 1 2 3 4 >>> li = ['a', 'b', 'c', 'd', 'e'] >>> for i in range(len(li)):
... print li[i] a b c d e
![]() | As you saw in Example 3.20, “Assigning Consecutive Values”, range produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress occasionally) useful to have a counter loop. |
![]() | Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example. |
for loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a for loop to iterate through a dictionary.
>>> import os >>> for k, v in os.environ.items():![]()
... print "%s=%s" % (k, v) USERPROFILE=C:\Documents and Settings\mpilgrim OS=Windows_NT COMPUTERNAME=MPILGRIM USERNAME=mpilgrim [...snip...] >>> print "\n".join(["%s=%s" % (k, v) ... for k, v in os.environ.items()])
USERPROFILE=C:\Documents and Settings\mpilgrim OS=Windows_NT COMPUTERNAME=MPILGRIM USERNAME=mpilgrim [...snip...]
![]() | os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty. |
![]() | os.environ.items() returns a list of tuples: [(key1, value1), (key2, value2), ...]. The for loop iterates through this list. The first round, it assigns key1 to k and value1 to v, so k = USERPROFILE and v = C:\Documents and Settings\mpilgrim. In the second round, k gets the second key, OS, and v gets the corresponding value, Windows_NT. |
![]() | With multi-variable assignment and list comprehensions, you can replace the entire for loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string. Other programmers prefer to write this out as a for loop. The output is the same in either case, although this version is slightly faster, because there is only one print statement instead of many. |
Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5.
tagDataMap = {"title" : ( 3, 33, stripnulls), "artist" : ( 33, 63, stripnulls), "album" : ( 63, 93, stripnulls), "year" : ( 93, 97, stripnulls), "comment" : ( 97, 126, stripnulls), "genre" : (127, 128, ord)}. . . if tagdata[:3] == "TAG": for tag, (start, end, parseFunc) in self.tagDataMap.items():
self[tag] = parseFunc(tagdata[start:end])
![]() | tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference. |
![]() | This looks complicated, but it's not. The structure of the for variables matches the structure of the elements of the list returned by items. Remember that items returns a list of tuples of the form (key, value). The first element of that list is ("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets "title", start gets 3, end gets 33, and parseFunc gets the function stripnulls. |
![]() | Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like. |
Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules.
>>> import sys>>> print '\n'.join(sys.modules.keys())
win32api os.path os exceptions __main__ ntpath nt sys __builtin__ site signal UserDict stat
This example demonstrates how to use sys.modules.
>>> import fileinfo>>> print '\n'.join(sys.modules.keys()) win32api os.path os fileinfo exceptions __main__ ntpath nt sys __builtin__ site signal UserDict stat >>> fileinfo <module 'fileinfo' from 'fileinfo.pyc'> >>> sys.modules["fileinfo"]
<module 'fileinfo' from 'fileinfo.pyc'>
The next example shows how to use the __module__ class attribute with the sys.modules dictionary to get a reference to the module in which a class is defined.
>>> from fileinfo import MP3FileInfo >>> MP3FileInfo.__module__'fileinfo' >>> sys.modules[MP3FileInfo.__module__]
<module 'fileinfo' from 'fileinfo.pyc'>
![]() | Every Python class has a built-in class attribute __module__, which is the name of the module in which the class is defined. |
![]() | Combining this with the sys.modules dictionary, you can get a reference to the module in which a class is defined. |
Now you're ready to see how sys.modules is used in fileinfo.py, the sample program introduced in Chapter 5. This example shows that portion of the code.
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):"get file info class from filename extension" subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
![]() | This is a function with two arguments; filename is required, but module is optional and defaults to the module that contains the FileInfo class. This looks inefficient, because you might expect Python to evaluate the sys.modules expression every time the function is called. In fact, Python evaluates default expressions only once, the first time the module is imported. As you'll see later, you never call this function with a module argument, so module serves as a function-level constant. |
![]() | You'll plow through this line later, after you dive into the os module. For now, take it on faith that subclass ends up as the name of a class, like MP3FileInfo. |
![]() | You already know about getattr, which gets a reference to an object by name. hasattr is a complementary function that checks whether an object has a particular attribute; in this case, whether a module has a particular class (although it works for any object and any attribute, just like getattr). In English, this line of code says, “If this module has the class named by subclass then return it, otherwise return the base class FileInfo.” |
The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.
>>> import os >>> os.path.join("c:\\music\\ap\\", "mahadeva.mp3")![]()
'c:\\music\\ap\\mahadeva.mp3' >>> os.path.join("c:\\music\\ap", "mahadeva.mp3")
'c:\\music\\ap\\mahadeva.mp3' >>> os.path.expanduser("~")
'c:\\Documents and Settings\\mpilgrim\\My Documents' >>> os.path.join(os.path.expanduser("~"), "Python")
'c:\\Documents and Settings\\mpilgrim\\My Documents\\Python'
![]() | os.path is a reference to a module -- which module depends on your platform. Just as getpass encapsulates differences between platforms by setting getpass to a platform-specific function, os encapsulates differences between platforms by setting path to a platform-specific module. |
![]() | The join function of os.path constructs a pathname out of one or more partial pathnames. In this case, it simply concatenates strings. (Note that dealing with pathnames on Windows is annoying because the backslash character must be escaped.) |
![]() | In this slightly less trivial case, join will add an extra backslash to the pathname before joining it to the filename. I was overjoyed when I discovered this, since addSlashIfNecessary is one of the stupid little functions I always need to write when building up my toolbox in a new language. Do not write this stupid little function in Python; smart people have already taken care of it for you. |
![]() | expanduser will expand a pathname that uses ~ to represent the current user's home directory. This works on any platform where users have a home directory, like Windows, UNIX, and Mac OS X; it has no effect on Mac OS. |
![]() | Combining these techniques, you can easily construct pathnames for directories and files under the user's home directory. |
>>> os.path.split("c:\\music\\ap\\mahadeva.mp3")('c:\\music\\ap', 'mahadeva.mp3') >>> (filepath, filename) = os.path.split("c:\\music\\ap\\mahadeva.mp3")
>>> filepath
'c:\\music\\ap' >>> filename
'mahadeva.mp3' >>> (shortname, extension) = os.path.splitext(filename)
>>> shortname 'mahadeva' >>> extension '.mp3'
![]() | The split function splits a full pathname and returns a tuple containing the path and filename. Remember when I said you could use multi-variable assignment to return multiple values from a function? Well, split is such a function. |
![]() | You assign the return value of the split function into a tuple of two variables. Each variable receives the value of the corresponding element of the returned tuple. |
![]() | The first variable, filepath, receives the value of the first element of the tuple returned from split, the file path. |
![]() | The second variable, filename, receives the value of the second element of the tuple returned from split, the filename. |
![]() | os.path also contains a function splitext, which splits a filename and returns a tuple containing the filename and the file extension. You use the same technique to assign each of them to separate variables. |
>>> os.listdir("c:\\music\\_singles\\")['a_time_long_forgotten_con.mp3', 'hellraiser.mp3', 'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 'spinning.mp3'] >>> dirname = "c:\\" >>> os.listdir(dirname)
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'cygwin', 'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'IO.SYS', 'MSDOS.SYS', 'Music', 'NTDETECT.COM', 'ntldr', 'pagefile.sys', 'Program Files', 'Python20', 'RECYCLER', 'System Volume Information', 'TEMP', 'WINNT'] >>> [f for f in os.listdir(dirname) ... if os.path.isfile(os.path.join(dirname, f))]
['AUTOEXEC.BAT', 'boot.ini', 'CONFIG.SYS', 'IO.SYS', 'MSDOS.SYS', 'NTDETECT.COM', 'ntldr', 'pagefile.sys'] >>> [f for f in os.listdir(dirname) ... if os.path.isdir(os.path.join(dirname, f))]
['cygwin', 'docbook', 'Documents and Settings', 'Incoming', 'Inetpub', 'Music', 'Program Files', 'Python20', 'RECYCLER', 'System Volume Information', 'TEMP', 'WINNT']
![]() | The listdir function takes a pathname and returns a list of the contents of the directory. |
![]() | listdir returns both files and folders, with no indication of which is which. |
![]() | You can use list filtering and the isfile function of the os.path module to separate the files from the folders. isfile takes a pathname and returns 1 if the path represents a file, and 0 otherwise. Here you're using os.path.join to ensure a full pathname, but isfile also works with a partial path, relative to the current working directory. You can use os.getcwd() to get the current working directory. |
![]() | os.path also has a isdir function which returns 1 if the path represents a directory, and 0 otherwise. You can use this to get a list of the subdirectories within a directory. |
def listDirectory(directory, fileExtList): "get list of file info objects for files of particular extensions" fileList = [os.path.normcase(f) for f in os.listdir(directory)]![]()
fileList = [os.path.join(directory, f) for f in fileList if os.path.splitext(f)[1] in fileExtList]
![]()
![]()
![]() |
|
Whenever possible, you should use the functions in os and os.path for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python. |
There is one other way to get the contents of a directory. It's very powerful, and it uses the sort of wildcards that you may already be familiar with from working on the command line.
>>> os.listdir("c:\\music\\_singles\\")['a_time_long_forgotten_con.mp3', 'hellraiser.mp3', 'kairo.mp3', 'long_way_home1.mp3', 'sidewinder.mp3', 'spinning.mp3'] >>> import glob >>> glob.glob('c:\\music\\_singles\\*.mp3')
['c:\\music\\_singles\\a_time_long_forgotten_con.mp3', 'c:\\music\\_singles\\hellraiser.mp3', 'c:\\music\\_singles\\kairo.mp3', 'c:\\music\\_singles\\long_way_home1.mp3', 'c:\\music\\_singles\\sidewinder.mp3', 'c:\\music\\_singles\\spinning.mp3'] >>> glob.glob('c:\\music\\_singles\\s*.mp3')
['c:\\music\\_singles\\sidewinder.mp3', 'c:\\music\\_singles\\spinning.mp3'] >>> glob.glob('c:\\music\\*\\*.mp3')
![]()
Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all fits together.
def listDirectory(directory, fileExtList):"get list of file info objects for files of particular extensions" fileList = [os.path.normcase(f) for f in os.listdir(directory)] fileList = [os.path.join(directory, f) for f in fileList if os.path.splitext(f)[1] in fileExtList]
def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):
"get file info class from filename extension" subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
return [getFileInfoClass(f)(f) for f in fileList]
![]() | listDirectory is the main attraction of this entire module. It takes a directory (like c:\music\_singles\ in my case) and a list of interesting file extensions (like ['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in that directory. And it does it in just a few straightforward lines of code. |
![]() | As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in directory that have an interesting file extension (as specified by fileExtList). |
![]() | Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function. The nested function getFileInfoClass can be called only from the function in which it is defined, listDirectory. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code it. |
![]() | Now that you've seen the os module, this line should make more sense. It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting. So c:\music\ap\mahadeva.mp3 becomes .mp3 becomes .MP3 becomes MP3 becomes MP3FileInfo. |
![]() | Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually exists in this module. If it does, you return the class, otherwise you return the base class FileInfo. This is a very important point: this function returns a class. Not an instance of a class, but the class itself. |
![]() | For each file in the “interesting files” list (fileList), you call getFileInfoClass with the filename (f). Calling getFileInfoClass(f) returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever it is) and pass the filename (f again), to the __init__ method. As you saw earlier in this chapter, the __init__ method of FileInfo sets self["name"], which triggers __setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a list of the resulting instances. |
Note that listDirectory is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.
The fileinfo.py program introduced in Chapter 5 should now make perfect sense.
"""Framework for getting filetype-specific metadata. Instantiate appropriate class with filename. Returned object acts like a dictionary, with key-value pairs for each piece of metadata. import fileinfo info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3") print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()]) Or use listDirectory function to get info on all files in a directory. for info in fileinfo.listDirectory("/music/ap/", [".mp3"]): ... Framework can be extended by adding classes for particular file types, e.g. HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for parsing its files appropriately; see MP3FileInfo for example. """ import os import sys from UserDict import UserDict def stripnulls(data): "strip whitespace and nulls" return data.replace("\00", "").strip() class FileInfo(UserDict): "store file metadata" def __init__(self, filename=None): UserDict.__init__(self) self["name"] = filename class MP3FileInfo(FileInfo): "store ID3v1.0 MP3 tags" tagDataMap = {"title" : ( 3, 33, stripnulls), "artist" : ( 33, 63, stripnulls), "album" : ( 63, 93, stripnulls), "year" : ( 93, 97, stripnulls), "comment" : ( 97, 126, stripnulls), "genre" : (127, 128, ord)} def __parse(self, filename): "parse ID3v1.0 tags from MP3 file" self.clear() try: fsock = open(filename, "rb", 0) try: fsock.seek(-128, 2) tagdata = fsock.read(128) finally: fsock.close() if tagdata[:3] == "TAG": for tag, (start, end, parseFunc) in self.tagDataMap.items(): self[tag] = parseFunc(tagdata[start:end]) except IOError: pass def __setitem__(self, key, item): if key == "name" and item: self.__parse(item) FileInfo.__setitem__(self, key, item) def listDirectory(directory, fileExtList): "get list of file info objects for files of particular extensions" fileList = [os.path.normcase(f) for f in os.listdir(directory)] fileList = [os.path.join(directory, f) for f in fileList if os.path.splitext(f)[1] in fileExtList] def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): "get file info class from filename extension" subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] return hasattr(module, subclass) and getattr(module, subclass) or FileInfo return [getFileInfoClass(f)(f) for f in fileList] if __name__ == "__main__": for info in listDirectory("/music/_singles/", [".mp3"]): print "\n".join(["%s=%s" % (k, v) for k, v in info.items()]) print
Before diving into the next chapter, make sure you're comfortable doing the following things:
Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the re module to get an overview of the available functions and their arguments.
Strings have methods for searching (index, find, and count), replacing (replace), and parsing (split), but they are limited to the simplest of cases. The search methods look for a single, hard-coded substring, and they are always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace and split methods have the same limitations.
If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you're combining them with split and join and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions to make them practically self-documenting.
This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful.) This example shows how I approached the problem.
>>> s = '100 NORTH MAIN ROAD' >>> s.replace('ROAD', 'RD.')'100 NORTH MAIN RD.' >>> s = '100 NORTH BROAD ROAD' >>> s.replace('ROAD', 'RD.')
'100 NORTH BRD. RD.' >>> s[:-4] + s[-4:].replace('ROAD', 'RD.')
'100 NORTH BROAD RD.' >>> import re
>>> re.sub('ROAD$', 'RD.', s)
![]()
'100 NORTH BROAD RD.'
Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching 'ROAD' at the end of the address, was not good enough, because not all addresses included a street designation at all; some just ended with the street name. Most of the time, I got away with it, but if the street name was 'BROAD', then the regular expression would match 'ROAD' at the end of the string as part of the word 'BROAD', which is not what I wanted.
>>> s = '100 BROAD' >>> re.sub('ROAD$', 'RD.', s) '100 BRD.' >>> re.sub('\\bROAD$', 'RD.', s)'100 BROAD' >>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD' >>> s = '100 BROAD ROAD APT. 3' >>> re.sub(r'\bROAD$', 'RD.', s)
'100 BROAD ROAD APT. 3' >>> re.sub(r'\bROAD\b', 'RD.', s)
'100 BROAD RD. APT 3'
You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
The following are some general rules for constructing Roman numerals:
What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.
>>> import re >>> pattern = '^M?M?M?$'>>> re.search(pattern, 'M')
<SRE_Match object at 0106FB58> >>> re.search(pattern, 'MM')
<SRE_Match object at 0106C290> >>> re.search(pattern, 'MMM')
<SRE_Match object at 0106AA38> >>> re.search(pattern, 'MMMM')
>>> re.search(pattern, '')
<SRE_Match object at 0106F4A8>
The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed, depending on its value.
So there are four possible patterns:
The last two patterns can be combined:
This example shows how to validate the hundreds place of a Roman numeral.
>>> import re >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$'>>> re.search(pattern, 'MCM')
<SRE_Match object at 01070390> >>> re.search(pattern, 'MD')
<SRE_Match object at 01073A50> >>> re.search(pattern, 'MMMCCC')
<SRE_Match object at 010748A8> >>> re.search(pattern, 'MCMC')
>>> re.search(pattern, '')
<SRE_Match object at 01071D98>
Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But let's look at another way to express the pattern.
In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
>>> import re >>> pattern = '^M?M?M?$' >>> re.search(pattern, 'M')<_sre.SRE_Match object at 0x008EE090> >>> pattern = '^M?M?M?$' >>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EEB48> >>> pattern = '^M?M?M?$' >>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EE090> >>> re.search(pattern, 'MMMM')
>>>
>>> pattern = '^M{0,3}$'>>> re.search(pattern, 'M')
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MM')
<_sre.SRE_Match object at 0x008EE090> >>> re.search(pattern, 'MMM')
<_sre.SRE_Match object at 0x008EEDA8> >>> re.search(pattern, 'MMMM')
>>>
![]() |
|
There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write a lot of test cases to make sure they behave the same way on all relevant inputs. You'll talk more about writing test cases later in this book. |
Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$' >>> re.search(pattern, 'MCMXL')<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCML')
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLX')
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLXXX')
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLXXXX')
>>>
The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.
>>> pattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.
>>> pattern = '^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$' >>> re.search(pattern, 'MDLV')<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MMDCLXVI')
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MMMMDCCCLXXXVIII')
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'I')
<_sre.SRE_Match object at 0x008EEB48>
If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
In the next section you'll explore an alternate syntax that can help keep your expressions maintainable.
So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.
Python allows you to do this with something called verbose regular expressions. A verbose regular expression is different from a compact regular expression in two ways:
This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make it a verbose regular expression. This example shows how.
>>> pattern = """ ^ # beginning of string M{0,4} # thousands - 0 to 4 M's (CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's), # or 500-800 (D, followed by 0 to 3 C's) (XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's), # or 50-80 (L, followed by 0 to 3 X's) (IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's), # or 5-8 (V, followed by 0 to 3 I's) $ # end of string """ >>> re.search(pattern, 'M', re.VERBOSE)<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MMMMDCCCLXXXVIII', re.VERBOSE)
<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'M')
![]()
![]() | The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them: re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it's a lot more readable. |
![]() | This matches the start of the string, then one of a possible four M, then CM, then L and three of a possible three X, then IX, then the end of the string. |
![]() | This matches the start of the string, then four of a possible four M, then D and three of a possible three C, then L and three of a possible three X, then V and three of a possible three I, then the end of the string. |
![]() | This does not match. Why? Because it doesn't have the re.VERBOSE flag, so the re.search function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose. |
So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.
This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
Here are the phone numbers I needed to be able to accept:
Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.
Let's work through developing a solution for phone number parsing. This example shows the first step.
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')>>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212') >>> phonePattern.search('800-555-1212-1234')
>>>
![]() | Always read regular expressions from left to right. This one matches the beginning of the string, and then (\d{3}). What's \d{3}? Well, the {3} means “match exactly three numeric digits”; it's a variation on the {n,m} syntax you saw earlier. \d means “any numeric digit” (0 through 9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string. |
![]() | To get access to the groups that the regular expression parser remembered along the way, use the groups() method on the object that the search function returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits. |
![]() | This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For that, you'll need to expand the regular expression. |
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')>>> phonePattern.search('800-555-1212-1234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('800 555 1212 1234')
>>> >>> phonePattern.search('800-555-1212')
>>>
The next example shows the regular expression to handle separators between the different parts of the phone number.
>>> phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')>>> phonePattern.search('800 555 1212 1234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('800-555-1212-1234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('80055512121234')
>>> >>> phonePattern.search('800-555-1212')
>>>
The next example shows the regular expression for handling phone numbers without separators.
>>> phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')>>> phonePattern.search('80055512121234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('800.555.1212 x1234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '') >>> phonePattern.search('(800)5551212 x1234')
>>>
The next example shows how to handle leading characters in phone numbers.
>>> phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')>>> phonePattern.search('(800)5551212 ext. 1234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('800-555-1212').groups()
('800', '555', '1212', '') >>> phonePattern.search('work 1-(800) 555.1212 #1234')
>>>
Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let's take a different approach: don't explicitly match the beginning of the string at all. This approach is shown in the next example.
>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()
('800', '555', '1212', '1234') >>> phonePattern.search('800-555-1212')
('800', '555', '1212', '') >>> phonePattern.search('80055512121234')
('800', '555', '1212', '1234')
See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?
While you still understand the final answer (and it is the final answer; if you've discovered a case it doesn't handle, I don't want to know about it), let's write it out as a verbose regular expression, before you forget why you made the choices you made.
>>> phonePattern = re.compile(r''' # don't match beginning of string, number can start anywhere (\d{3}) # area code is 3 digits (e.g. '800') \D* # optional separator is any number of non-digits (\d{3}) # trunk is 3 digits (e.g. '555') \D* # optional separator (\d{4}) # rest of number is 4 digits (e.g. '1212') \D* # optional separator (\d*) # extension is optional and can be any number of digits $ # end of string ''', re.VERBOSE) >>> phonePattern.search('work 1-(800) 555.1212 #1234').groups()('800', '555', '1212', '1234') >>> phonePattern.search('800-555-1212')
('800', '555', '1212', '')
This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely overwhelmed by them now, believe me, you ain't seen nothing yet.
You should now be familiar with the following techniques:
Regular expressions are extremely powerful, but they are not the correct solution for every problem. You should learn enough about them to know when they are appropriate, when they will solve your problems, and when they will cause more problems than they solve.
Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. | ||
--Jamie Zawinski, in comp.emacs.xemacs |
I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.
Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the doc strings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how any of these class methods ever get called. Don't worry, all will be revealed in due time.
If you have not already done so, you can download this and other examples used in this book.
from sgmllib import SGMLParser import htmlentitydefs class BaseHTMLProcessor(SGMLParser): def reset(self): # extend (called by SGMLParser.__init__) self.pieces = [] SGMLParser.reset(self) def unknown_starttag(self, tag, attrs): # called for each start tag # attrs is a list of (attr, value) tuples # e.g. for <pre class="screen">, tag="pre", attrs=[("class", "screen")] # Ideally we would like to reconstruct original tag and attributes, but # we may end up quoting attribute values that weren't quoted in the source # document, or we may change the type of quotes around the attribute value # (single to double quotes). # Note that improperly embedded non-HTML code (like client-side Javascript) # may be parsed incorrectly by the ancestor, causing runtime script errors. # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->) # to ensure that it will pass through this parser unaltered (in handle_comment). strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) self.pieces.append("<%(tag)s%(strattrs)s>" % locals()) def unknown_endtag(self, tag): # called for each end tag, e.g. for </pre>, tag will be "pre" # Reconstruct the original end tag. self.pieces.append("</%(tag)s>" % locals()) def handle_charref(self, ref): # called for each character reference, e.g. for " ", ref will be "160" # Reconstruct the original character reference. self.pieces.append("&#%(ref)s;" % locals()) def handle_entityref(self, ref): # called for each entity reference, e.g. for "©", ref will be "copy" # Reconstruct the original entity reference. self.pieces.append("&%(ref)s" % locals()) # standard HTML entities are closed with a semicolon; other entities are not if htmlentitydefs.entitydefs.has_key(ref): self.pieces.append(";") def handle_data(self, text): # called for each block of plain text, i.e. outside of any tag and # not containing any character or entity references # Store the original text verbatim. self.pieces.append(text) def handle_comment(self, text): # called for each HTML comment, e.g. <!-- insert Javascript code here --> # Reconstruct the original comment. # It is especially important that the source document enclose client-side # code (like Javascript) within comments so it can pass through this # processor undisturbed; see comments in unknown_starttag for details. self.pieces.append("<!--%(text)s-->" % locals()) def handle_pi(self, text): # called for each processing instruction, e.g. <?instruction> # Reconstruct original processing instruction. self.pieces.append("<?%(text)s>" % locals()) def handle_decl(self, text): # called for the DOCTYPE, if present, e.g. # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" # "http://www.w3.org/TR/html4/loose.dtd"> # Reconstruct original DOCTYPE self.pieces.append("<!%(text)s>" % locals()) def output(self): """Return processed HTML as a single string""" return "".join(self.pieces)
import re from BaseHTMLProcessor import BaseHTMLProcessor class Dialectizer(BaseHTMLProcessor): subs = () def reset(self): # extend (called from __init__ in ancestor) # Reset all data attributes self.verbatim = 0 BaseHTMLProcessor.reset(self) def start_pre(self, attrs): # called for every <pre> tag in HTML source # Increment verbatim mode count, then handle tag like normal self.verbatim += 1 self.unknown_starttag("pre", attrs) def end_pre(self): # called for every </pre> tag in HTML source # Decrement verbatim mode count self.unknown_endtag("pre") self.verbatim -= 1 def handle_data(self, text): # override # called for every block of text in HTML source # If in verbatim mode, save text unaltered; # otherwise process the text with a series of substitutions self.pieces.append(self.verbatim and text or self.process(text)) def process(self, text): # called from handle_data # Process text block by performing series of regular expression # substitutions (actual substitions are defined in descendant) for fromPattern, toPattern in self.subs: text = re.sub(fromPattern, toPattern, text) return text class ChefDialectizer(Dialectizer): """convert HTML to Swedish Chef-speak based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman """ subs = ((r'a([nu])', r'u\1'), (r'A([nu])', r'U\1'), (r'a\B', r'e'), (r'A\B', r'E'), (r'en\b', r'ee'), (r'\Bew', r'oo'), (r'\Be\b', r'e-a'), (r'\be', r'i'), (r'\bE', r'I'), (r'\Bf', r'ff'), (r'\Bir', r'ur'), (r'(\w*?)i(\w*?)$', r'\1ee\2'), (r'\bow', r'oo'), (r'\bo', r'oo'), (r'\bO', r'Oo'), (r'the', r'zee'), (r'The', r'Zee'), (r'th\b', r't'), (r'\Btion', r'shun'), (r'\Bu', r'oo'), (r'\BU', r'Oo'), (r'v', r'f'), (r'V', r'F'), (r'w', r'w'), (r'W', r'W'), (r'([a-z])[.]', r'\1. Bork Bork Bork!')) class FuddDialectizer(Dialectizer): """convert HTML to Elmer Fudd-speak""" subs = ((r'[rl]', r'w'), (r'qu', r'qw'), (r'th\b', r'f'), (r'th', r'd'), (r'n[.]', r'n, uh-hah-hah-hah.')) class OldeDialectizer(Dialectizer): """convert HTML to mock Middle English""" subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'), (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'), (r'ick\b', r'yk'), (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'), (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'), (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'), (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'), (r'([aeiou])re\b', r'\1r'), (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'), (r'tion\b', r'cioun'), (r'ion\b', r'ioun'), (r'aid', r'ayde'), (r'ai', r'ey'), (r'ay\b', r'y'), (r'ay', r'ey'), (r'ant', r'aunt'), (r'ea', r'ee'), (r'oa', r'oo'), (r'ue', r'e'), (r'oe', r'o'), (r'ou', r'ow'), (r'ow', r'ou'), (r'\bhe', r'hi'), (r've\b', r'veth'), (r'se\b', r'e'), (r"'s\b", r'es'), (r'ic\b', r'ick'), (r'ics\b', r'icc'), (r'ical\b', r'ick'), (r'tle\b', r'til'), (r'll\b', r'l'), (r'ould\b', r'olde'), (r'own\b', r'oune'), (r'un\b', r'onne'), (r'rry\b', r'rye'), (r'est\b', r'este'), (r'pt\b', r'pte'), (r'th\b', r'the'), (r'ch\b', r'che'), (r'ss\b', r'sse'), (r'([wybdp])\b', r'\1e'), (r'([rnt])\b', r'\1\1e'), (r'from', r'fro'), (r'when', r'whan')) def translate(url, dialectName="chef"): """fetch URL and translate using dialect dialect in ("chef", "fudd", "olde")""" import urllib sock = urllib.urlopen(url) htmlSource = sock.read() sock.close() parserName = "%sDialectizer" % dialectName.capitalize() parserClass = globals()[parserName] parser = parserClass() parser.feed(htmlSource) parser.close() return parser.output() def test(url): """test all dialects against URL""" for dialect in ("chef", "fudd", "olde"): outfile = "%s.html" % dialect fsock = open(outfile, "wb") fsock.write(translate(url, dialect)) fsock.close() import webbrowser webbrowser.open_new(outfile) if __name__ == "__main__": test("http://diveintopython.org/odbchelper_list.html")
Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the code listings and screen examples were left untouched.
<div class="abstract"> <p>Lists awe <span class="application">Pydon</span>'s wowkhowse datatype. If youw onwy expewience wif wists is awways in <span class="application">Visuaw Basic</span> ow (God fowbid) de datastowe in <span class="application">Powewbuiwdew</span>, bwace youwsewf fow <span class="application">Pydon</span> wists.</p> </div>
HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.
The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.
sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.
SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:
![]() |
|
Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1. |
sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.
![]() |
|
In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces. |
Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython.org/.
c:\python23\lib> type "c:\downloads\diveintopython\html\toc\index.html"
<!DOCTYPE html
PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>Dive Into Python</title>
<link rel="stylesheet" href="diveintopython.css" type="text/css">
... rest of file omitted for brevity ...
Running this through the test suite of sgmllib.py yields this output:
c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython\html\toc\index.html" data: '\n\n' start tag: <html lang="en" > data: '\n ' start tag: <head> data: '\n ' start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" > data: '\n \n ' start tag: <title> data: 'Dive Into Python' end tag: </title> data: '\n ' start tag: <link rel="stylesheet" href="diveintopython.css" type="text/css" > data: '\n ' ... rest of output omitted for brevity ...
Here's the roadmap for the rest of the chapter:
Along the way, you'll also learn about locals, globals, and dictionary-based string formatting.
To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.
The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.
>>> import urllib>>> sock = urllib.urlopen("http://diveintopython.org/")
>>> htmlSource = sock.read()
>>> sock.close()
>>> print htmlSource
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head> <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'> <title>Dive Into Python</title> <link rel='stylesheet' href='diveintopython.css' type='text/css'> <link rev='made' href='mailto:mark@diveintopython.org'> <meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'> <meta name='description' content='a free Python tutorial for experienced programmers'> </head> <body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'> <table cellpadding='0' cellspacing='0' border='0' width='100%'> <tr><td class='header' width='1%' valign='top'>diveintopython.org</td> <td width='99%' align='right'><hr size='1' noshade></td></tr> <tr><td class='tagline' colspan='2'>Python for experienced programmers</td></tr> [...snip...]
![]() | The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages). |
![]() | The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object. |
![]() | The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list. |
![]() | When you're done with the object, make sure to close it, just like a normal file object. |
![]() | You now have the complete HTML of the home page of http://diveintopython.org/ in a string, and you're ready to parse it. |
If you have not already done so, you can download this and other examples used in this book.
from sgmllib import SGMLParser class URLLister(SGMLParser): def reset(self):SGMLParser.reset(self) self.urls = [] def start_a(self, attrs):
href = [v for k, v in attrs if k=='href']
![]()
if href: self.urls.extend(href)
![]() | reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance. |
![]() | start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list. |
![]() | You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension. |
![]() | String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs. |
>>> import urllib, urllister >>> usock = urllib.urlopen("http://diveintopython.org/") >>> parser = urllister.URLLister() >>> parser.feed(usock.read())>>> usock.close()
>>> parser.close()
>>> for url in parser.urls: print url
toc/index.html #download #languages toc/index.html appendix/history.html download/diveintopython-html-5.0.zip download/diveintopython-pdf-5.0.zip download/diveintopython-word-5.0.zip download/diveintopython-text-5.0.zip download/diveintopython-html-flat-5.0.zip download/diveintopython-xml-5.0.zip download/diveintopython-common-5.0.zip ... rest of output omitted for brevity ...
![]() | Call the feed method, defined in SGMLParser, to get HTML into the parser.[1] It takes a string, which is what usock.read() returns. |
![]() | Like files, you should close your URL objects as soon as you're done with them. |
![]() | You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed. |
![]() | Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.) |
SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.
BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data.
class BaseHTMLProcessor(SGMLParser): def reset(self):self.pieces = [] SGMLParser.reset(self) def unknown_starttag(self, tag, attrs):
strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) self.pieces.append("<%(tag)s%(strattrs)s>" % locals()) def unknown_endtag(self, tag):
self.pieces.append("</%(tag)s>" % locals()) def handle_charref(self, ref):
self.pieces.append("&#%(ref)s;" % locals()) def handle_entityref(self, ref):
self.pieces.append("&%(ref)s" % locals()) if htmlentitydefs.entitydefs.has_key(ref): self.pieces.append(";") def handle_data(self, text):
self.pieces.append(text) def handle_comment(self, text):
self.pieces.append("<!--%(text)s-->" % locals()) def handle_pi(self, text):
self.pieces.append("<?%(text)s>" % locals()) def handle_decl(self, text): self.pieces.append("<!%(text)s>" % locals())
![]() | reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but Python is much more efficient at dealing with lists.[2] |
![]() | Since BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you'll untangle that (and also the odd-looking locals function) later in this chapter. |
![]() | Reconstructing end tags is much simpler; just take the tag name and wrap it in the </...> brackets. |
![]() | When SGMLParser finds a character reference, it calls handle_charref with the bare reference. If the HTML document contains the reference  , ref will be 160. Reconstructing the original complete character reference just involves wrapping ref in &#...; characters. |
![]() | Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference requires wrapping ref in &...; characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs. Hence the extra if statement.) |
![]() | Blocks of text are simply appended to self.pieces unaltered. |
![]() | HTML comments are wrapped in <!--...--> characters. |
![]() | Processing instructions are wrapped in <?...> characters. |
![]() |
|
The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes. SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script within HTML comments. |
def output(self):"""Return processed HTML as a single string""" return "".join(self.pieces)
Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.
Remember locals? You first saw it here:
def unknown_starttag(self, tag, attrs): strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
No, wait, you can't learn about locals yet. First, you need to learn about namespaces. This is dry stuff, but it's important, so pay attention.
Python uses what are called namespaces to keep track of variables. A namespace is just like a dictionary where the keys are names of variables and the dictionary values are the values of those variables. In fact, you can access a namespace as a Python dictionary, as you'll see in a minute.
At any particular point in a Python program, there are several namespaces available. Each function has its own namespace, called the local namespace, which keeps track of the function's variables, including function arguments and locally defined variables. Each module has its own namespace, called the global namespace, which keeps track of the module's variables, including functions, classes, any other imported modules, and module-level variables and constants. And there is the built-in namespace, accessible from any module, which holds built-in functions and exceptions.
When a line of code asks for the value of a variable x, Python will search for that variable in all the available namespaces, in order:
If Python doesn't find x in any of these namespaces, it gives up and raises a NameError with the message There is no variable named 'x', which you saw back in Example 3.18, “Referencing an Unbound Variable”, but you didn't appreciate how much work Python was doing before giving you that error.
![]() |
|
Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambda function, Python will search for that variable in the current (nested or lambda) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or lambda) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:from __future__ import nested_scopes |
Are you confused yet? Don't despair! This is really cool, I promise. Like many things in Python, namespaces are directly accessible at run-time. How? Well, the local namespace is accessible via the built-in locals function, and the global (module level) namespace is accessible via the built-in globals function.
>>> def foo(arg):... x = 1 ... print locals() ... >>> foo(7)
{'arg': 7, 'x': 1} >>> foo('bar')
{'arg': 'bar', 'x': 1}
What locals does for the local (function) namespace, globals does for the global (module) namespace. globals is more exciting, though, because a module's namespace is more exciting.[3] Not only does the module's namespace include module-level variables and constants, it includes all the functions and classes defined in the module. Plus, it includes anything that was imported into the module.
Remember the difference between from module import and import module? With import module, the module itself is imported, but it retains its own namespace, which is why you need to use the module name to access any of its functions or attributes: module.function. But with from module import, you're actually importing specific functions and attributes from another module into your own namespace, which is why you access them directly without referencing the original module they came from. With the globals function, you can actually see this happen.
Look at the following block of code at the bottom of BaseHTMLProcessor.py:
if __name__ == "__main__": for k, v in globals().items():print k, "=", v
![]() | Just so you don't get intimidated, remember that you've seen all this before. The globals function returns a dictionary, and you're iterating through the dictionary using the items method and multi-variable assignment. The only thing new here is the globals function. |
Now running the script from the command line gives this output (note that your output may be slightly different, depending on your platform and where you installed Python):
c:\docbook\dip\py> python BaseHTMLProcessor.py
SGMLParser = sgmllib.SGMLParserhtmlentitydefs = <module 'htmlentitydefs' from 'C:\Python23\lib\htmlentitydefs.py'>
BaseHTMLProcessor = __main__.BaseHTMLProcessor
__name__ = __main__
... rest of output omitted for brevity...
![]() | SGMLParser was imported from sgmllib, using from module import. That means that it was imported directly into the module's namespace, and here it is. |
![]() | Contrast this with htmlentitydefs, which was imported using import. That means that the htmlentitydefs module itself is in the namespace, but the entitydefs variable defined within htmlentitydefs is not. |
![]() | This module only defines one class, BaseHTMLProcessor, and here it is. Note that the value here is the class itself, not a specific instance of the class. |
![]() | Remember the if __name__ trick? When running a module (as opposed to importing it from another module), the built-in __name__ attribute is a special value, __main__. Since you ran this module as a script from the command line, __name__ is __main__, which is why the little test code to print the globals got executed. |
![]() |
|
Using the locals and globals functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors the functionality of the getattr function, which allows you to access arbitrary functions dynamically by providing the function name as a string. |
There is one other important difference between the locals and globals functions, which you should learn now before it bites you. It will bite you anyway, but at least then you'll remember learning it.
def foo(arg): x = 1 print locals()locals()["x"] = 2
print "x=",x
z = 7 print "z=",z foo(3) globals()["z"] = 8
print "z=",z
![]()
Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're constantly switching between reading the string and reading the tuple of values.
There is an alternative form of string formatting that uses dictionaries instead of tuples of values.
>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"} >>> "%(pwd)s" % params'secret' >>> "%(pwd)s is not a good password for %(uid)s" % params
'secret is not a good password for sa' >>> "%(database)s of mind, %(database)s of body" % params
'master of mind, master of body'
So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of meaningful keys and values already. Like locals.
def handle_comment(self, text): self.pieces.append("<!--%(text)s-->" % locals())![]()
def unknown_starttag(self, tag, attrs): strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
![]()
![]() | When this method is called, attrs is a list of key/value tuples, just like the items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down:
|
![]() | Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is 'a', the final result would be '<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces. |
![]() |
|
Using dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a slight performance hit in making the call to locals, since locals builds a copy of the local namespace. |
A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor.
BaseHTMLProcessor consumes HTML (since it's descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes at all. It is this last side effect that you can take advantage of.
>>> htmlSource = """... <html> ... <head> ... <title>Test page</title> ... </head> ... <body> ... <ul> ... <li><a href=index.html>Home</a></li> ... <li><a href=toc.html>Table of contents</a></li> ... <li><a href=history.html>Revision history</a></li> ... </body> ... </html> ... """ >>> from BaseHTMLProcessor import BaseHTMLProcessor >>> parser = BaseHTMLProcessor() >>> parser.feed(htmlSource)
>>> print parser.output()
<html> <head> <title>Test page</title> </head> <body> <ul> <li><a href="index.html">Home</a></li> <li><a href="toc.html">Table of contents</a></li> <li><a href="history.html">Revision history</a></li> </body> </html>
![]() | Note that the attribute values of the href attributes in the <a> tags are not properly quoted. (Also note that you're using triple quotes for something other than a doc string. And directly in the IDE, no less. They're very useful.) |
![]() | Feed the parser. |
![]() | Using the output function defined in BaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think about how much has actually happened here: SGMLParser parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; BaseHTMLProcessor used those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you called parser.output, which joined all the pieces of HTML into one string. |
Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered.
To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre.
def start_pre(self, attrs):self.verbatim += 1
self.unknown_starttag("pre", attrs)
def end_pre(self):
self.unknown_endtag("pre")
self.verbatim -= 1
![]() | start_pre is called every time SGMLParser finds a <pre> tag in the HTML source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, attrs, which contains the attributes of the tag (if any). attrs is a list of key/value tuples, just like unknown_starttag takes. |
![]() | In the reset method, you initialize a data attribute that serves as a counter for <pre> tags. Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you'll decrement the counter. (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.) In a minute, you'll see how this counter is put to good use. |
![]() | That's it, that's the only special processing you do for <pre> tags. Now you pass the list of attributes along to unknown_starttag so it can do the default processing. |
![]() | end_pre is called every time SGMLParser finds a </pre> tag. Since end tags can not contain attributes, the method takes no parameters. |
![]() | First, you want to do the default processing, just like any other end tag. |
![]() | Second, you decrement your counter to signal that this <pre> block has been closed. |
At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding.
def finish_starttag(self, tag, attrs):try: method = getattr(self, 'start_' + tag)
except AttributeError:
try: method = getattr(self, 'do_' + tag)
except AttributeError: self.unknown_starttag(tag, attrs)
return -1 else: self.handle_starttag(tag, method, attrs)
return 0 else: self.stack.append(tag) self.handle_starttag(tag, method, attrs) return 1
def handle_starttag(self, tag, method, attrs): method(attrs)
![]() | At this point, SGMLParser has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag). |
![]() | The “magic” of SGMLParser is nothing more than your old friend, getattr. What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself. Here the object is self, the current instance. So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class. |
![]() | getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped the call to getattr inside a try...except block and explicitly caught the AttributeError. |
![]() | Since you didn't find a start_xxx method, you'll also look for a do_xxx method before giving up. This alternate naming scheme is generally used for standalone tags, like <br>, which have no corresponding end tag. But you can use either naming scheme; as you can see, SGMLParser tries both for every tag. (You shouldn't define both a start_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.) |
![]() | Another AttributeError, which means that the call to getattr failed with do_xxx. Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag. |
![]() | Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block. Logically, that means that you did find a do_xxx method for this tag, so you're going to call it. |
![]() | By the way, don't worry about these different return values; in theory they mean something, but they're never actually used. Don't worry about the self.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now. |
![]() | start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call the method (start_xxx or do_xxx) with the list of attributes. Remember, method is a function, returned from getattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs. |
Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, you need to override the handle_data method.
def handle_data(self, text):self.pieces.append(self.verbatim and text or self.process(text))
![]() | handle_data is called with only one argument, the text to process. |
![]() | In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick. |
You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough for one chapter.
It's time to put everything you've learned so far to good use. I hope you were paying attention.
def translate(url, dialectName="chef"):import urllib
sock = urllib.urlopen(url)
htmlSource = sock.read() sock.close()
![]() | The translate function has an optional argument dialectName, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute. |
![]() | Hey, wait a minute, there's an import statement in this function! That's perfectly legal in Python. You're used to seeing import statements at the top of a program, which means that the imported module is available anywhere in the program. But you can also import modules within a function, which means that the imported module is only available within the function. If you have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you'll appreciate this.) |
![]() | Now you get the source of the given URL. |
parserName = "%sDialectizer" % dialectName.capitalize()parserClass = globals()[parserName]
parser = parserClass()
![]()
![]() | capitalize is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else to lowercase. Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string 'chef', parserName will be the string 'ChefDialectizer'. |
![]() | You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer. |
![]() | Finally, you have a class object (parserClass), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable like a function, and out pops an instance of the class. If parserClass is the class ChefDialectizer, parser will be an instance of the class ChefDialectizer. |
Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName.
Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.
Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page.
Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away you go.
parser.feed(htmlSource)parser.close()
return parser.output()
![]()
![]() | After all that imagining, this is going to seem pretty boring, but the feed function is what does the entire transformation. You had the entire HTML source in a single string, so you only had to call feed once. However, you can call feed as often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same. |
![]() | Because feed maintains an internal buffer, you should always call the parser's close method when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing the last few bytes. |
![]() | Remember, output is the function you defined on BaseHTMLProcessor that joins all the pieces of output you've buffered and returns them in a single string. |
And just like that, you've “translated” a web page, given nothing but a URL and the name of a dialect.
Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.
Along with these examples, you should be comfortable doing all of the following things:
[1] The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image.
[2] The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list just adds the element and updates the index. Since strings can not be changed after they are created, code like s = s + newpiece will create an entirely new string out of the concatenation of the original and the new piece, then throw away the original string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets longer, so doing s = s + newpiece in a loop is deadly. In technical terms, appending n items to a list is O(n), while appending n items to a string is O(n2).
[3] I don't get out much.
[4] All right, it's not that common a question. It's not up there with “What editor should I use to write Python code?” (answer: Emacs) or “Is Python better or worse than Perl?” (answer: “Perl is worse than Python because people wanted it worse.” -Larry Wall, 10/14/1998) But questions about HTML processing pop up in one form or another about once a month, and among those questions, this is a popular one.
These next two chapters are about XML processing in Python. It would be helpful if you already knew what an XML document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make sense to you, there are many XML tutorials that can explain the basics.
If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use getattr for method dispatching.
Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer science.
There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM.
The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output in more depth throughout these next two chapters.
If you have not already done so, you can download this and other examples used in this book.
"""Kant Generator for Python Generates mock philosophy based on a context-free grammar Usage: python kgp.py [options] [source] Options: -g ..., --grammar=... use specified grammar file or URL -h, --help show this help -d show debugging information while parsing Examples: kgp.py generates several paragraphs of Kantian philosophy kgp.py -g husserl.xml generates several paragraphs of Husserl kpg.py "<xref id='paragraph'/>" generates a paragraph of Kant kgp.py template.xml reads from template.xml to decide what to generate """ from xml.dom import minidom import random import toolbox import sys import getopt _debug = 0 class NoSourceError(Exception): pass class KantGenerator: """generates mock philosophy based on a context-free grammar""" def __init__(self, grammar, source=None): self.loadGrammar(grammar) self.loadSource(source and source or self.getDefaultSource()) self.refresh() def _load(self, source): """load XML input source, return parsed XML document - a URL of a remote XML file ("http://diveintopython.org/kant.xml") - a filename of a local XML file ("~/diveintopython/common/py/kant.xml") - standard input ("-") - the actual XML document, as a string """ sock = toolbox.openAnything(source) xmldoc = minidom.parse(sock).documentElement sock.close() return xmldoc def loadGrammar(self, grammar): """load context-free grammar""" self.grammar = self._load(grammar) self.refs = {} for ref in self.grammar.getElementsByTagName("ref"): self.refs[ref.attributes["id"].value] = ref def loadSource(self, source): """load source""" self.source = self._load(source) def getDefaultSource(self): """guess default source of the current grammar The default source will be one of the <ref>s that is not cross-referenced. This sounds complicated but it's not. Example: The default source for kant.xml is "<xref id='section'/>", because 'section' is the one <ref> that is not <xref>'d anywhere in the grammar. In most grammars, the default source will produce the longest (and most interesting) output. """ xrefs = {} for xref in self.grammar.getElementsByTagName("xref"): xrefs[xref.attributes["id"].value] = 1 xrefs = xrefs.keys() standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs] if not standaloneXrefs: raise NoSourceError, "can't guess source, and no source specified" return '<xref id="%s"/>' % random.choice(standaloneXrefs) def reset(self): """reset parser""" self.pieces = [] self.capitalizeNextWord = 0 def refresh(self): """reset output buffer, re-parse entire source file, and return output Since parsing involves a good deal of randomness, this is an easy way to get new output without having to reload a grammar file each time. """ self.reset() self.parse(self.source) return self.output() def output(self): """output generated text""" return "".join(self.pieces) def randomChildElement(self, node): """choose a random child element of a node This is a utility method used by do_xref and do_choice. """ choices = [e for e in node.childNodes if e.nodeType == e.ELEMENT_NODE] chosen = random.choice(choices) if _debug: sys.stderr.write('%s available choices: %s\n' % \ (len(choices), [e.toxml() for e in choices])) sys.stderr.write('Chosen: %s\n' % chosen.toxml()) return chosen def parse(self, node): """parse a single XML node A parsed XML document (from minidom.parse) is a tree of nodes of various types. Each node is represented by an instance of the corresponding Python class (Element for a tag, Text for text data, Document for the top-level document). The following statement constructs the name of a class method based on the type of node we're parsing ("parse_Element" for an Element node, "parse_Text" for a Text node, etc.) and then calls the method. """ parseMethod = getattr(self, "parse_%s" % node.__class__.__name__) parseMethod(node) def parse_Document(self, node): """parse the document node The document node by itself isn't interesting (to us), but its only child, node.documentElement, is: it's the root node of the grammar. """ self.parse(node.documentElement) def parse_Text(self, node): """parse a text node The text of a text node is usually added to the output buffer verbatim. The one exception is that <p class='sentence'> sets a flag to capitalize the first letter of the next word. If that flag is set, we capitalize the text and reset the flag. """ text = node.data if self.capitalizeNextWord: self.pieces.append(text[0].upper()) self.pieces.append(text[1:]) self.capitalizeNextWord = 0 else: self.pieces.append(text) def parse_Element(self, node): """parse an element An XML element corresponds to an actual tag in the source: <xref id='...'>, <p chance='...'>, <choice>, etc. Each element type is handled in its own method. Like we did in parse(), we construct a method name based on the name of the element ("do_xref" for an <xref> tag, etc.) and call the method. """ handlerMethod = getattr(self, "do_%s" % node.tagName) handlerMethod(node) def parse_Comment(self, node): """parse a comment The grammar can contain XML comments, but we ignore them """ pass def do_xref(self, node): """handle <xref id='...'> tag An <xref id='...'> tag is a cross-reference to a <ref id='...'> tag. <xref id='sentence'/> evaluates to a randomly chosen child of <ref id='sentence'>. """ id = node.attributes["id"].value self.parse(self.randomChildElement(self.refs[id])) def do_p(self, node): """handle <p> tag The <p> tag is the core of the grammar. It can contain almost anything: freeform text, <choice> tags, <xref> tags, even other <p> tags. If a "class='sentence'" attribute is found, a flag is set and the next word will be capitalized. If a "chance='X'" attribute is found, there is an X% chance that the tag will be evaluated (and therefore a (100-X)% chance that it will be completely ignored) """ keys = node.attributes.keys() if "class" in keys: if node.attributes["class"].value == "sentence": self.capitalizeNextWord = 1 if "chance" in keys: chance = int(node.attributes["chance"].value) doit = (chance > random.randrange(100)) else: doit = 1 if doit: for child in node.childNodes: self.parse(child) def do_choice(self, node): """handle <choice> tag A <choice> tag contains one or more <p> tags. One <p> tag is chosen at random and evaluated; the rest are ignored. """ self.parse(self.randomChildElement(node)) def usage(): print __doc__ def main(argv): grammar = "kant.xml" try: opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) except getopt.GetoptError: usage() sys.exit(2) for opt, arg in opts: if opt in ("-h", "--help"): usage() sys.exit() elif opt == '-d': global _debug _debug = 1 elif opt in ("-g", "--grammar"): grammar = arg source = "".join(args) k = KantGenerator(grammar, source) print k.output() if __name__ == "__main__": main(sys.argv[1:])
"""Miscellaneous utility functions""" def openAnything(source): """URI, filename, or string --> stream This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. Examples: >>> from xml.dom import minidom >>> sock = openAnything("http://localhost/kant.xml") >>> doc = minidom.parse(sock) >>> sock.close() >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml") >>> doc = minidom.parse(sock) >>> sock.close() >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>") >>> doc = minidom.parse(sock) >>> sock.close() """ if hasattr(source, "read"): return source if source == '-': import sys return sys.stdin # try to open with urllib (if source is http, ftp, or file URL) import urllib try: return urllib.urlopen(source) except (IOError, OSError): pass # try to open with native open function (if source is pathname) try: return open(source) except (IOError, OSError): pass # treat source as string import StringIO return StringIO.StringIO(str(source))
Run the program kgp.py by itself, and it will parse the default XML-based grammar, in kant.xml, and print several paragraphs worth of philosophy in the style of Immanuel Kant.
[you@localhost kgp]$ python kgp.py
As is shown in the writings of Hume, our a priori concepts, in
reference to ends, abstract from all content of knowledge; in the study
of space, the discipline of human reason, in accordance with the
principles of philosophy, is the clue to the discovery of the
Transcendental Deduction. The transcendental aesthetic, in all
theoretical sciences, occupies part of the sphere of human reason
concerning the existence of our ideas in general; still, the
never-ending regress in the series of empirical conditions constitutes
the whole content for the transcendental unity of apperception. What
we have alone been able to show is that, even as this relates to the
architectonic of human reason, the Ideal may not contradict itself, but
it is still possible that it may be in contradictions with the
employment of the pure employment of our hypothetical judgements, but
natural causes (and I assert that this is the case) prove the validity
of the discipline of pure reason. As we have already seen, time (and
it is obvious that this is true) proves the validity of time, and the
architectonic of human reason, in the full sense of these terms,
abstracts from all content of knowledge. I assert, in the case of the
discipline of practical reason, that the Antinomies are just as
necessary as natural causes, since knowledge of the phenomena is a
posteriori.
The discipline of human reason, as I have elsewhere shown, is by
its very nature contradictory, but our ideas exclude the possibility of
the Antinomies. We can deduce that, on the contrary, the pure
employment of philosophy, on the contrary, is by its very nature
contradictory, but our sense perceptions are a representation of, in
the case of space, metaphysics. The thing in itself is a
representation of philosophy. Applied logic is the clue to the
discovery of natural causes. However, what we have alone been able to
show is that our ideas, in other words, should only be used as a canon
for the Ideal, because of our necessary ignorance of the conditions.
[...snip...]
This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least the sort of thing that Kant would have agreed with), some of it is blatantly false, and most of it is simply incoherent. But all of it is in the style of Immanuel Kant.
Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major.
The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous example was derived from the grammar file, kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be completely different.
[you@localhost kgp]$ python kgp.py -g binary.xml 00101001 [you@localhost kgp]$ python kgp.py -g binary.xml 10110100
You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where.
Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages.
>>> from xml.dom import minidom>>> xmldoc = minidom.parse('~/diveintopython/common/py/kgp/binary.xml')
That sounds complicated, but it's really not. Looking at the actual implementation may help. Packages are little more than directories of modules; nested packages are subdirectories. The modules within a package (or a nested package) are still just .py files, like always, except that they're in a subdirectory instead of the main lib/ directory of your Python installation.
Python21/ root Python installation (home of the executable)
|
+--lib/ library directory (home of the standard library modules)
|
+-- xml/ xml package (really just a directory with other stuff in it)
|
+--sax/ xml.sax package (again, just a directory)
|
+--dom/ xml.dom package (contains minidom.py)
|
+--parsers/ xml.parsers package (used internally)
So when you say from xml.dom import minidom, Python figures out that that means “look in the xml directory for a dom directory, and look in that for the minidom module, and import it as minidom”. But Python is even smarter than that; not only can you import entire modules contained within a package, you can selectively import specific classes or functions from a module contained within a package. You can also import the package itself as a module. The syntax is all the same; Python figures out what you mean based on the file layout of the package, and automatically does the right thing.
>>> from xml.dom import minidom>>> minidom <module 'xml.dom.minidom' from 'C:\Python21\lib\xml\dom\minidom.pyc'> >>> minidom.Element <class xml.dom.minidom.Element at 01095744> >>> from xml.dom.minidom import Element
>>> Element <class xml.dom.minidom.Element at 01095744> >>> minidom.Element <class xml.dom.minidom.Element at 01095744> >>> from xml import dom
>>> dom <module 'xml.dom' from 'C:\Python21\lib\xml\dom\__init__.pyc'> >>> import xml
>>> xml <module 'xml' from 'C:\Python21\lib\xml\__init__.pyc'>
![]() | Here you're importing a module (minidom) from a nested package (xml.dom). The result is that minidom is imported into your namespace, and in order to reference classes within the minidom module (like Element), you need to preface them with the module name. |
![]() | Here you are importing a class (Element) from a module (minidom) from a nested package (xml.dom). The result is that Element is imported directly into your namespace. Note that this does not interfere with the previous import; the Element class can now be referenced in two ways (but it's all still the same class). |
![]() | Here you are importing the dom package (a nested package of xml) as a module in and of itself. Any level of a package can be treated as a module, as you'll see in a moment. It can even have its own attributes and methods, just the modules you've seen before. |
![]() | Here you are importing the root level xml package as a module. |
So how can a package (which is just a directory on disk) be imported and treated as a module (which is always a file on disk)? The answer is the magical __init__.py file. You see, packages are not simply directories; they are directories with a specific file, __init__.py, inside. This file defines the attributes and methods of the package. For instance, xml.dom contains a Node class, which is defined in xml/dom/__init__.py. When you import a package as a module (like dom from xml), you're really importing its __init__.py file.
![]() |
|
A package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, but it has to exist. But if __init__.py doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages. |
So why bother with packages? Well, they provide a way to logically group related modules. Instead of having an xml package with sax and dom packages inside, the authors could have chosen to put all the sax functionality in xmlsax.py and all the dom functionality in xmldom.py, or even put all of it in a single module. But that would have been unwieldy (as of this writing, the XML package has over 3000 lines of code) and difficult to manage (separate source files mean multiple people can work on different areas simultaneously).
If you ever find yourself writing a large subsystem in Python (or, more likely, when you realize that your small subsystem has grown into a large one), invest some time designing a good package architecture. It's one of the many things Python is good at, so take advantage of it.
As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you.
>>> from xml.dom import minidom>>> xmldoc = minidom.parse('~/diveintopython/common/py/kgp/binary.xml')
>>> xmldoc
<xml.dom.minidom.Document instance at 010BE87C> >>> print xmldoc.toxml()
<?xml version="1.0" ?> <grammar> <ref id="bit"> <p>0</p> <p>1</p> </ref> <ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> </grammar>
![]() | As you saw in the previous section, this imports the minidom module from the xml.dom package. |
![]() | Here is the one line of code that does all the work: minidom.parse takes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) But you can also pass a file object, or even a file-like object. You'll take advantage of this flexibility later in this chapter. |
![]() | The object returned from minidom.parse is a Document object, a descendant of the Node class. This Document object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed to minidom.parse. |
![]() | toxml is a method of the Node class (and is therefore available on the Document object you got from minidom.parse). toxml prints out the XML that this Node represents. For the Document node, this prints out the entire XML document. |
Now that you have an XML document in memory, you can start traversing through it.
>>> xmldoc.childNodes[<DOM Element: grammar at 17538908>] >>> xmldoc.childNodes[0]
<DOM Element: grammar at 17538908> >>> xmldoc.firstChild
<DOM Element: grammar at 17538908>
>>> grammarNode = xmldoc.firstChild >>> print grammarNode.toxml()<grammar> <ref id="bit"> <p>0</p> <p>1</p> </ref> <ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> </grammar>
>>> grammarNode.childNodes[<DOM Text node "\n">, <DOM Element: ref at 17533332>, \ <DOM Text node "\n">, <DOM Element: ref at 17549660>, <DOM Text node "\n">] >>> print grammarNode.firstChild.toxml()
>>> print grammarNode.childNodes[1].toxml()
<ref id="bit"> <p>0</p> <p>1</p> </ref> >>> print grammarNode.childNodes[3].toxml()
<ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> >>> print grammarNode.lastChild.toxml()
![]()
>>> grammarNode <DOM Element: grammar at 19167148> >>> refNode = grammarNode.childNodes[1]>>> refNode <DOM Element: ref at 17987740> >>> refNode.childNodes
[<DOM Text node "\n">, <DOM Text node " ">, <DOM Element: p at 19315844>, \ <DOM Text node "\n">, <DOM Text node " ">, \ <DOM Element: p at 19462036>, <DOM Text node "\n">] >>> pNode = refNode.childNodes[2] >>> pNode <DOM Element: p at 19315844> >>> print pNode.toxml()
<p>0</p> >>> pNode.firstChild
<DOM Text node "0"> >>> pNode.firstChild.data
u'0'
Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode.
You'll get to all that in a minute, but first, some background.
Historical note. Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets. Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things. Then think about trying to store these documents in the same place (like in the same database table); you would need to store the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around. Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.
To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.[5] Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number. Unicode data is never ambiguous.
Of course, there is still the matter of all these legacy encoding systems. 7-bit ASCII, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit ASCII. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it (241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit ASCII for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters for other languages with the remaining numbers, 256 through 65535.
When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an XML document which explicitly specifies the encoding scheme.
And on that note, let's get back to Python.
Python has had unicode support throughout the language since version 2.0. The XML package uses unicode to store all parsed XML data, but you can use unicode anywhere.
>>> s = u'Dive in'>>> s u'Dive in' >>> print s
Dive in
>>> s = u'La Pe\xf1a'>>> print s
Traceback (innermost last): File "<interactive input>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> print s.encode('latin-1')
La Peña
Remember I said Python usually converted unicode to ASCII whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which you can customize.
# sitecustomize.py# this file can be anywhere in your Python path, # but it usually goes in ${pythondir}/lib/site-packages/ import sys sys.setdefaultencoding('iso-8859-1')
![]()
>>> import sys >>> sys.getdefaultencoding()'iso-8859-1' >>> s = u'La Pe\xf1a' >>> print s
La Peña
If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:
#!/usr/bin/env python # -*- coding: UTF-8 -*-
Now, what about XML? Well, every XML document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R is popular for Russian texts. The encoding, if specified, is in the header of the XML document.
<?xml version="1.0" encoding="koi8-r"?><preface> <title>Предисловие</title>
</preface>
>>> from xml.dom import minidom >>> xmldoc = minidom.parse('russiansample.xml')>>> title = xmldoc.getElementsByTagName('title')[0].firstChild.data >>> title
u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435' >>> print title
Traceback (innermost last): File "<interactive input>", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) >>> convertedtitle = title.encode('koi8-r')
>>> convertedtitle '\xf0\xd2\xc5\xc4\xc9\xd3\xcc\xcf\xd7\xc9\xc5' >>> print convertedtitle
Предисловие
To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle in Python. If your XML documents are all 7-bit ASCII (like the examples in this chapter), you will literally never think about unicode. Python will convert the ASCII data in the XML documents into unicode while parsing, and auto-coerce it back to ASCII whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.
Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName.
For this section, you'll be using the binary.xml grammar file, which looks like this:
<?xml version="1.0"?>
<!DOCTYPE grammar PUBLIC "-//diveintopython.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
<grammar>
<ref id="bit">
<p>0</p>
<p>1</p>
</ref>
<ref id="byte">
<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
</ref>
</grammar>
It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits.
>>> from xml.dom import minidom >>> xmldoc = minidom.parse('binary.xml') >>> reflist = xmldoc.getElementsByTagName('ref')>>> reflist [<DOM Element: ref at 136138108>, <DOM Element: ref at 136144292>] >>> print reflist[0].toxml() <ref id="bit"> <p>0</p> <p>1</p> </ref> >>> print reflist[1].toxml() <ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref>
>>> firstref = reflist[0]>>> print firstref.toxml() <ref id="bit"> <p>0</p> <p>1</p> </ref> >>> plist = firstref.getElementsByTagName("p")
>>> plist [<DOM Element: p at 136140116>, <DOM Element: p at 136142172>] >>> print plist[0].toxml()
<p>0</p> >>> print plist[1].toxml() <p>1</p>
>>> plist = xmldoc.getElementsByTagName("p")>>> plist [<DOM Element: p at 136140116>, <DOM Element: p at 136142172>, <DOM Element: p at 136146124>] >>> plist[0].toxml()
'<p>0</p>' >>> plist[1].toxml() '<p>1</p>' >>> plist[2].toxml()
'<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>'
XML elements can have one or more attributes, and it is incredibly simple to access them once you have parsed an XML document.
For this section, you'll be using the binary.xml grammar file that you saw in the previous section.
![]() |
|
This section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these more clearly. |
>>> xmldoc = minidom.parse('binary.xml') >>> reflist = xmldoc.getElementsByTagName('ref') >>> bitref = reflist[0] >>> print bitref.toxml() <ref id="bit"> <p>0</p> <p>1</p> </ref> >>> bitref.attributes<xml.dom.minidom.NamedNodeMap instance at 0x81e0c9c> >>> bitref.attributes.keys()
![]()
[u'id'] >>> bitref.attributes.values()
[<xml.dom.minidom.Attr instance at 0x81d5044>] >>> bitref.attributes["id"]
<xml.dom.minidom.Attr instance at 0x81d5044>
![]() | Each Element object has an attribute called attributes, which is a NamedNodeMap object. This sounds scary, but it's not, because a NamedNodeMap is an object that acts like a dictionary, so you already know how to use it. |
![]() | Treating the NamedNodeMap as a dictionary, you can get a list of the names of the attributes of this element by using attributes.keys(). This element has only one attribute, 'id'. |
![]() | Attribute names, like all other text in an XML document, are stored in unicode. |
![]() | Again treating the NamedNodeMap as a dictionary, you can get a list of the values of the attributes by using attributes.values(). The values are themselves objects, of type Attr. You'll see how to get useful information out of this object in the next example. |
![]() | Still treating the NamedNodeMap as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been paying extra-close attention will already know how the NamedNodeMap class accomplishes this neat trick: by defining a __getitem__ special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.) |
>>> a = bitref.attributes["id"] >>> a <xml.dom.minidom.Attr instance at 0x81d5044> >>> a.nameu'id' >>> a.value
u'bit'
![]() |
|
Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attr objects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes by name, like the keys of a dictionary. |
OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on other aspects that make the program more flexible: using streams for input processing, using getattr for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code.
Before moving on to the next chapter, you should be comfortable doing all of these things:
[5] This, sadly, is still an oversimplification. Unicode now has been extended to handle ancient Chinese, Korean, and Japanese texts, which had so many different characters that the 2-byte unicode system could not represent them all. But Python doesn't currently support that out of the box, and I don't know if there is a project afoot to add it. You've reached the limits of my expertise, sorry.
One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.
Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close it when they're done. But they don't. Instead, they take a file-like object.
In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left off and returns the next chunk of data.
This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply calls the object's read method, the function can handle any kind of input source without specific code to handle each kind.
In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object.
>>> from xml.dom import minidom >>> fsock = open('binary.xml')>>> xmldoc = minidom.parse(fsock)
>>> fsock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?> <grammar> <ref id="bit"> <p>0</p> <p>1</p> </ref> <ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> </grammar>
![]() | First, you open the file on disk. This gives you a file object. |
![]() | You pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk. |
![]() | Be sure to call the close method of the file object after you're done with it. minidom.parse will not do this for you. |
![]() | Calling the toxml() method on the returned XML document prints out the entire thing. |
Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet.
>>> import urllib >>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf')>>> xmldoc = minidom.parse(usock)
>>> usock.close()
>>> print xmldoc.toxml()
<?xml version="1.0" ?> <rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <channel> <title>Slashdot</title> <link>http://slashdot.org/</link> <description>News for nerds, stuff that matters</description> </channel> <image> <title>Slashdot</title> <url>http://images.slashdot.org/topics/topicslashdot.gif</url> <link>http://slashdot.org/</link> </image> <item> <title>To HDTV or Not to HDTV?</title> <link>http://slashdot.org/article.pl?sid=01/12/28/0421241</link> </item> [...snip...]
![]() | As you saw in a previous chapter, urlopen takes a web page URL and returns a file-like object. Most importantly, this object has a read method which returns the HTML source of the web page. |
![]() | Now you pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects. |
![]() | As soon as you're done with it, be sure to close the file-like object that urlopen gives you. |
![]() | By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site. |
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> xmldoc = minidom.parseString(contents)>>> print xmldoc.toxml() <?xml version="1.0" ?> <grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying.
If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO.
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> import StringIO >>> ssock = StringIO.StringIO(contents)>>> ssock.read()
"<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> ssock.read()
'' >>> ssock.seek(0)
>>> ssock.read(15)
'<grammar><ref i' >>> ssock.read(15) "d='bit'><p>0</p" >>> ssock.read() '><p>1</p></ref></grammar>' >>> ssock.close()
>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>" >>> ssock = StringIO.StringIO(contents) >>> xmldoc = minidom.parse(ssock)>>> ssock.close() >>> print xmldoc.toxml() <?xml version="1.0" ?> <grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well.
def openAnything(source):# try to open with urllib (if source is http, ftp, or file URL) import urllib try: return urllib.urlopen(source)
except (IOError, OSError): pass # try to open with native open function (if source is pathname) try: return open(source)
except (IOError, OSError): pass # treat source as string import StringIO return StringIO.StringIO(str(source))
![]() | The openAnything function takes a single parameter, source, and returns a file-like object. source is a string of some sort; it can either be a URL (like 'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like 'binary.xml'), or a string that contains actual XML data to be parsed. |
![]() | First, you see if source is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if urllib ever supports new types of URLs in the future, you will also support them without recoding. If urllib is able to open source, then the return kicks you out of the function immediately and the following try statements never execute. |
![]() | On the other hand, if urllib yelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors. |
![]() | By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIO to create a file-like object out of it and return that. (In fact, since you're using the str function, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its __str__ special method.) |
Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it.
UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you.
Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.)
>>> for i in range(3): ... print 'Dive in'Dive in Dive in Dive in >>> import sys >>> for i in range(3): ... sys.stdout.write('Dive in')
Dive inDive inDive in >>> for i in range(3): ... sys.stderr.write('Dive in')
Dive inDive inDive in
![]() | As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in range function to build simple counter loops that repeat something a set number of times. |
![]() | stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write. |
![]() | In the simplest case, stdout and stderr send their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Like stdout, stderr does not add carriage returns for you; if you want them, add them yourself. |
stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output.
[you@localhost kgp]$ python stdout.py Dive in [you@localhost kgp]$ cat out.log This message will be logged instead of displayed
(On Windows, you can use type instead of cat to display the contents of a file.)
If you have not already done so, you can download this and other examples used in this book.
#stdout.py import sys print 'Dive in'saveout = sys.stdout
fsock = open('out.log', 'w')
sys.stdout = fsock
print 'This message will be logged instead of displayed'
sys.stdout = saveout
fsock.close()
![]()
Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout.
[you@localhost kgp]$ python stderr.py [you@localhost kgp]$ cat error.log Traceback (most recent line last): File "stderr.py", line 5, in ? raise Exception, 'this error will be logged' Exception: this error will be logged
If you have not already done so, you can download this and other examples used in this book.
#stderr.py import sys fsock = open('error.log', 'w')sys.stderr = fsock
raise Exception, 'this error will be logged'
![]()
![]()
Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going through the hassle of redirecting it outright.
>>> print 'entering function' entering function >>> import sys >>> print >> sys.stderr, 'entering function'entering function
Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting one program's output to the next program's input.
[you@localhost kgp]$ python kgp.py -g binary.xml01100111 [you@localhost kgp]$ cat binary.xml
<?xml version="1.0"?> <!DOCTYPE grammar PUBLIC "-//diveintopython.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd"> <grammar> <ref id="bit"> <p>0</p> <p>1</p> </ref> <ref id="byte"> <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\ <xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p> </ref> </grammar> [you@localhost kgp]$ cat binary.xml | python kgp.py -g -
![]()
10110001
![]() | As you saw in Section 9.1, “Diving in”, this will print a string of eight random bits, 0 or 1. |
![]() | This simply prints out the entire contents of binary.xml. (Windows users should use type instead of cat.) |
![]() | This prints the contents of binary.xml, but the “|” character, called the “pipe” character, means that the contents will not be printed to the screen. Instead, they will become the standard input of the next command, which in this case calls your Python script. |
![]() | Instead of specifying a module (like binary.xml), you specify “-”, which causes your script to load the grammar from standard input instead of from a specific file on disk. (More on how this happens in the next example.) So the effect is the same as the first syntax, where you specified the grammar filename directly, but think of the expansion possibilities here. Instead of simply doing cat binary.xml, you could run a script that dynamically generates the grammar, then you can pipe it into your script. It could come from anywhere: a database, or some grammar-generating meta-script, or whatever. The point is that you don't need to change your kgp.py script at all to incorporate any of this functionality. All you need to do is be able to take grammar files from standard input, and you can separate all the other logic into another program. |
So how does the script “know” to read from standard input when the grammar file is “-”? It's not magic; it's just code.
def openAnything(source): if source == "-":import sys return sys.stdin # try to open with urllib (if source is http, ftp, or file URL) import urllib try: [... snip ...]
![]() | This is the openAnything function from toolbox.py, which you previously examined in Section 10.1, “Abstracting input sources”. All you've done is add three lines of code at the beginning of the function to check if the source is “-”; if so, you return sys.stdin. Really, that's it! Remember, stdin is a file-like object with a read method, so the rest of the code (in kgp.py, where you call openAnything) doesn't change a bit. |
kgp.py employs several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes.
A grammar file defines a series of ref elements. Each ref contains one or more p elements, which can contain a lot of different things, including xrefs. Whenever you encounter an xref, you look for a corresponding ref element with the same id attribute, and choose one of the ref element's children and parse it. (You'll see how this random choice is made in the next section.)
This is how you build up the grammar: define ref elements for the smallest pieces, then define ref elements which "include" the first ref elements by using xref, and so forth. Then you parse the "largest" reference and follow each xref, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in an xref, so the output is different each time.
This is all very flexible, but there is one downside: performance. When you find an xref and need to find the corresponding ref element, you have a problem. The xref has an id attribute, and you want to find the ref element that has that same id attribute, but there is no easy way to do that. The slow way to do it would be to get the entire list of ref elements each time, then manually loop through and look at each id attribute. The fast way is to do that once and build a cache, in the form of a dictionary.
def loadGrammar(self, grammar): self.grammar = self._load(grammar) self.refs = {}for ref in self.grammar.getElementsByTagName("ref"):
self.refs[ref.attributes["id"].value] = ref
![]()
![]() | Start by creating an empty dictionary, self.refs. |
![]() | As you saw in Section 9.5, “Searching for elements”, getElementsByTagName returns a list of all the elements of a particular name. You easily can get a list of all the ref elements, then simply loop through that list. |
![]() | As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax. So the keys of the self.refs dictionary will be the values of the id attribute of each ref element. |
![]() | The values of the self.refs dictionary will be the ref elements themselves. As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object. |
Once you build this cache, whenever you come across an xref and need to find the ref element with the same id attribute, you can simply look it up in self.refs.
def do_xref(self, node): id = node.attributes["id"].value self.parse(self.randomChildElement(self.refs[id]))
You'll explore the randomChildElement function in the next section.
Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a ref element can have several p elements, each of which can contain many things, including other p elements. You want to find just the p elements that are children of the ref, not p elements that are children of other p elements.
You might think you could simply use getElementsByTagName for this, but you can't. getElementsByTagName searches recursively and returns a single list for all the elements it finds. Since p elements can contain other p elements, you can't use getElementsByTagName, because it would return nested p elements that you don't want. To find only direct child elements, you'll need to do it yourself.
def randomChildElement(self, node): choices = [e for e in node.childNodes if e.nodeType == e.ELEMENT_NODE]![]()
![]()
chosen = random.choice(choices)
return chosen
![]() | As you saw in Example 9.9, “Getting child nodes”, the childNodes attribute returns a list of all the child nodes of an element. |
![]() | However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodes contains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the children that are elements. |
![]() | Each node has a nodeType attribute, which can be ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, or any number of other values. The complete list of possible values is in the __init__.py file in the xml.dom package. (See Section 9.2, “Packages” for more on packages.) But you're just interested in nodes that are elements, so you can filter the list to only include those nodes whose nodeType is ELEMENT_NODE. |
![]() | Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called random which includes several useful functions. The random.choice function takes a list of any number of items and returns a random item. For example, if the ref elements contains several p elements, then choices would be a list of p elements, and chosen would end up being assigned exactly one of them, selected at random. |
The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a Document object. The Document then contains one or more Element objects (for actual XML tags), each of which may contain other Element objects, Text objects (for bits of text), or Comment objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type.
>>> from xml.dom import minidom >>> xmldoc = minidom.parse('kant.xml')>>> xmldoc <xml.dom.minidom.Document instance at 0x01359DE8> >>> xmldoc.__class__
<class xml.dom.minidom.Document at 0x01105D40> >>> xmldoc.__class__.__name__
'Document'
![]() | Assume for a moment that kant.xml is in the current directory. |
![]() | As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a Document object, as defined in the minidom.py in the xml.dom package. As you saw in Section 5.4, “Instantiating Classes”, __class__ is built-in attribute of every Python object. |
![]() | Furthermore, __name__ is a built-in attribute of every Python class, and it is a string. This string is not mysterious; it's the same as the class name you type when you define a class yourself. (See Section 5.3, “Defining Classes”.) |
Fine, so now you can get the class name of any particular XML node (since each XML node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is getattr, which you first saw in Section 4.4, “Getting Object References With getattr”.
def parse(self, node): parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)![]()
parseMethod(node)
def parse_Document(self, node):self.parse(node.documentElement) def parse_Text(self, node):
text = node.data if self.capitalizeNextWord: self.pieces.append(text[0].upper()) self.pieces.append(text[1:]) self.capitalizeNextWord = 0 else: self.pieces.append(text) def parse_Comment(self, node):
pass def parse_Element(self, node):
handlerMethod = getattr(self, "do_%s" % node.tagName) handlerMethod(node)
In this example, the dispatch functions parse and parse_Element simply find other methods in the same class. If your processing is very complex (or you have many different tag names), you could break up your code into separate modules, and use dynamic importing to import each module and call whatever functions you needed. Dynamic importing will be discussed in Chapter 16, Functional Programming.
Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short- or long-style flags to specify various options. None of this is XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.
It's difficult to talk about command-line processing without understanding how command-line arguments are exposed to your Python program, so let's write a simple program to see them.
If you have not already done so, you can download this and other examples used in this book.
#argecho.py import sys for arg in sys.argv:print arg
[you@localhost py]$ python argecho.pyargecho.py [you@localhost py]$ python argecho.py abc def
argecho.py abc def [you@localhost py]$ python argecho.py --help
argecho.py --help [you@localhost py]$ python argecho.py -m kant.xml
argecho.py -m kant.xml
![]() | The first thing to know about sys.argv is that it contains the name of the script you're calling. You will actually use this knowledge to your advantage later, in Chapter 16, Functional Programming. Don't worry about it for now. |
![]() | Command-line arguments are separated by spaces, and each shows up as a separate element in the sys.argv list. |
![]() | Command-line flags, like --help, also show up as their own element in the sys.argv list. |
![]() | To make things even more interesting, some command-line flags themselves take arguments. For instance, here you have a flag (-m) which takes an argument (kant.xml). Both the flag itself and the flag's argument are simply sequential elements in the sys.argv list. No attempt is made to associate one with the other; all you get is a list. |
So as you can see, you certainly have all the information passed on the command line, but then again, it doesn't look like it's going to be all that easy to actually use it. For simple programs that only take a single argument and have no flags, you can simply use sys.argv[1] to access the argument. There's no shame in this; I do it all the time. For more complex programs, you need the getopt module.
def main(argv): grammar = "kant.xml"try: opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
except getopt.GetoptError:
usage()
sys.exit(2) ... if __name__ == "__main__": main(sys.argv[1:])
So what are all those parameters you pass to the getopt function? Well, the first one is simply the raw list of command-line flags and arguments (not including the first element, the script name, which you already chopped off before calling the main function). The second is the list of short command-line flags that the script accepts.
The first and third flags are simply standalone flags; you specify them or you don't, and they do things (print help) or change state (turn on debugging). However, the second flag (-g) must be followed by an argument, which is the name of the grammar file to read from. In fact it can be a filename or a web address, and you don't know which yet (you'll figure it out later), but you know it has to be something. So you tell getopt this by putting a colon after the g in that second parameter to the getopt function.
To further complicate things, the script accepts either short flags (like -h) or long flags (like --help), and you want them to do the same thing. This is what the third parameter to getopt is for, to specify a list of the long flags that correspond to the short flags you specified in the second parameter.
Three things of note here:
Confused yet? Let's look at the actual code and see if it makes sense in context.
def main(argv):grammar = "kant.xml" try: opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) except getopt.GetoptError: usage() sys.exit(2) for opt, arg in opts:
if opt in ("-h", "--help"):
usage() sys.exit() elif opt == '-d':
global _debug _debug = 1 elif opt in ("-g", "--grammar"):
grammar = arg source = "".join(args)
k = KantGenerator(grammar, source) print k.output()
You've covered a lot of ground. Let's step back and see how all the pieces fit together.
To start with, this is a script that takes its arguments on the command line, using the getopt module.
def main(argv): ... try: opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="]) except getopt.GetoptError: ... for opt, arg in opts: ...
You create a new instance of the KantGenerator class, and pass it the grammar file and source that may or may not have been specified on the command line.
k = KantGenerator(grammar, source)
The KantGenerator instance automatically loads the grammar, which is an XML file. You use your custom openAnything function to open the file (which could be stored in a local file or a remote web server), then use the built-in minidom parsing functions to parse the XML into a tree of Python objects.
def _load(self, source): sock = toolbox.openAnything(source) xmldoc = minidom.parse(sock).documentElement sock.close()
Oh, and along the way, you take advantage of your knowledge of the structure of the XML document to set up a little cache of references, which are just elements in the XML document.
def loadGrammar(self, grammar): for ref in self.grammar.getElementsByTagName("ref"): self.refs[ref.attributes["id"].value] = ref
If you specified some source material on the command line, you use that; otherwise you rip through the grammar looking for the "top-level" reference (that isn't referenced by anything else) and use that as a starting point.
def getDefaultSource(self): xrefs = {} for xref in self.grammar.getElementsByTagName("xref"): xrefs[xref.attributes["id"].value] = 1 xrefs = xrefs.keys() standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs] return '<xref id="%s"/>' % random.choice(standaloneXrefs)
Now you rip through the source material. The source material is also XML, and you parse it one node at a time. To keep the code separated and more maintainable, you use separate handlers for each node type.
def parse_Element(self, node): handlerMethod = getattr(self, "do_%s" % node.tagName) handlerMethod(node)
You bounce through the grammar, parsing all the children of each p element,
def do_p(self, node): ... if doit: for child in node.childNodes: self.parse(child)
replacing choice elements with a random child,
def do_choice(self, node): self.parse(self.randomChildElement(node))
and replacing xref elements with a random child of the corresponding ref element, which you previously cached.
def do_xref(self, node): id = node.attributes["id"].value self.parse(self.randomChildElement(self.refs[id]))
Eventually, you parse your way down to plain text,
def parse_Text(self, node): text = node.data ... self.pieces.append(text)
which you print out.
def main(argv): ... k = KantGenerator(grammar, source) print k.output()
Python comes with powerful libraries for parsing and manipulating XML documents. The minidom takes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, error handling, even the ability to take input from the piped result of a previous program.
Before moving on to the next chapter, you should be comfortable doing all of these things:
You've learned about HTML processing and XML processing, and along the way you saw how to download a web page and how to parse XML from a URL, but let's dive into the more general topic of HTTP web services.
Simply stated, HTTP web services are programmatic ways of sending and receiving data from remote servers using the operations of HTTP directly. If you want to get data from the server, use a straight HTTP GET; if you want to send new data to the server, use HTTP POST. (Some more advanced HTTP web service APIs also define ways of modifying existing data and deleting data, using HTTP PUT and HTTP DELETE.) In other words, the “verbs” built into the HTTP protocol (GET, POST, PUT, and DELETE) map directly to application-level operations for receiving, sending, modifying, and deleting data.
The main advantage of this approach is simplicity, and its simplicity has proven popular with a lot of different sites. Data -- usually XML data -- can be built and stored statically, or generated dynamically by a server-side script, and all major languages include an HTTP library for downloading it. Debugging is also easier, because you can load up the web service in any web browser and see the raw data. Modern browsers will even nicely format and pretty-print XML data for you, to allow you to quickly navigate through it.
Examples of pure XML-over-HTTP web services:
In later chapters, you'll explore APIs which use HTTP as a transport for sending and receiving data, but don't map application semantics to the underlying HTTP semantics. (They tunnel everything over HTTP POST.) But this chapter will concentrate on using HTTP GET to get data from a remote server, and you'll explore several HTTP features you can use to get the maximum benefit out of pure HTTP web services.
Here is a more advanced version of the openanything module that you saw in the previous chapter:
If you have not already done so, you can download this and other examples used in this book.
import urllib2, urlparse, gzip from StringIO import StringIO USER_AGENT = 'OpenAnything/1.0 +http://diveintopython.org/http_web_services/' class SmartRedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): result = urllib2.HTTPRedirectHandler.http_error_301( self, req, fp, code, msg, headers) result.status = code return result def http_error_302(self, req, fp, code, msg, headers): result = urllib2.HTTPRedirectHandler.http_error_302( self, req, fp, code, msg, headers) result.status = code return result class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler): def http_error_default(self, req, fp, code, msg, headers): result = urllib2.HTTPError( req.get_full_url(), code, msg, headers, fp) result.status = code return result def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT): '''URL, filename, or string --> stream This function lets you define parsers that take any input source (URL, pathname to local or network file, or actual data as a string) and deal with it in a uniform manner. Returned object is guaranteed to have all the basic stdio read methods (read, readline, readlines). Just .close() the object when you're done with it. If the etag argument is supplied, it will be used as the value of an If-None-Match request header. If the lastmodified argument is supplied, it must be a formatted date/time string in GMT (as returned in the Last-Modified header of a previous request). The formatted date/time will be used as the value of an If-Modified-Since request header. If the agent argument is supplied, it will be used as the value of a User-Agent request header. ''' if hasattr(source, 'read'): return source if source == '-': return sys.stdin if urlparse.urlparse(source)[0] == 'http': # open URL with urllib2 request = urllib2.Request(source) request.add_header('User-Agent', agent) if etag: request.add_header('If-None-Match', etag) if lastmodified: request.add_header('If-Modified-Since', lastmodified) request.add_header('Accept-encoding', 'gzip') opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler()) return opener.open(request) # try to open with native open function (if source is a filename) try: return open(source) except (IOError, OSError): pass # treat source as string return StringIO(str(source)) def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): '''Fetch data and metadata from a URL, file, stream, or string''' result = {} f = openAnything(source, etag, last_modified, agent) result['data'] = f.read() if hasattr(f, 'headers'): # save ETag, if the server sent one result['etag'] = f.headers.get('ETag') # save Last-Modified header, if the server sent one result['lastmodified'] = f.headers.get('Last-Modified') if f.headers.get('content-encoding', '') == 'gzip': # data came back gzip-compressed, decompress it result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read() if hasattr(f, 'url'): result['url'] = f.url result['status'] = 200 if hasattr(f, 'status'): result['status'] = f.status f.close() return result
Let's say you want to download a resource over HTTP, such as a syndicated Atom feed. But you don't just want to download it once; you want to download it over and over again, every hour, to get the latest news from the site that's offering the news feed. Let's do it the quick-and-dirty way first, and then see how you can do better.
>>> import urllib >>> data = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read()>>> print data <?xml version="1.0" encoding="iso-8859-1"?> <feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en"> <title mode="escaped">dive into mark</title> <link rel="alternate" type="text/html" href="http://diveintomark.org/"/> <-- rest of feed omitted for brevity -->
So what's wrong with this? Well, for a quick one-off during testing or development, there's nothing wrong with it. I do it all the time. I wanted the contents of the feed, and I got the contents of the feed. The same technique works for any web page. But once you start thinking in terms of a web service that you want to access on a regular basis -- and remember, you said you were planning on retrieving this syndicated feed once an hour -- then you're being inefficient, and you're being rude.
Let's talk about some of the basic features of HTTP.
There are five important features of HTTP which you should support.
The User-Agent is simply a way for a client to tell a server who it is when it requests a web page, a syndicated feed, or any sort of web service over HTTP. When the client requests a resource, it should always announce who it is, as specifically as possible. This allows the server-side administrator to get in touch with the client-side developer if anything is going fantastically wrong.
By default, Python sends a generic User-Agent: Python-urllib/1.15. In the next section, you'll see how to change this to something more specific.
Sometimes resources move around. Web sites get reorganized, pages move to new addresses. Even web services can reorganize. A syndicated feed at http://example.com/index.xml might be moved to http://example.com/xml/atom.xml. Or an entire domain might move, as an organization expands and reorganizes; for instance, http://www.example.com/index.xml might be redirected to http://server-farm-1.example.com/index.xml.
Every time you request any kind of resource from an HTTP server, the server includes a status code in its response. Status code 200 means “everything's normal, here's the page you asked for”. Status code 404 means “page not found”. (You've probably seen 404 errors while browsing the web.)
HTTP has two different ways of signifying that a resource has moved. Status code 302 is a temporary redirect; it means “oops, that got moved over here temporarily” (and then gives the temporary address in a Location: header). Status code 301 is a permanent redirect; it means “oops, that got moved permanently” (and then gives the new address in a Location: header). If you get a 302 status code and a new address, the HTTP specification says you should use the new address to get what you asked for, but the next time you want to access the same resource, you should retry the old address. But if you get a 301 status code and a new address, you're supposed to use the new address from then on.
urllib.urlopen will automatically “follow” redirects when it receives the appropriate status code from the HTTP server, but unfortunately, it doesn't tell you when it does so. You'll end up getting data you asked for, but you'll never know that the underlying library “helpfully” followed a redirect for you. So you'll continue pounding away at the old address, and each time you'll get redirected to the new address. That's two round trips instead of one: not very efficient! Later in this chapter, you'll see how to work around this so you can deal with permanent redirects properly and efficiently.
Some data changes all the time. The home page of CNN.com is constantly updating every few minutes. On the other hand, the home page of Google.com only changes once every few weeks (when they put up a special holiday logo, or advertise a new service). Web services are no different; usually the server knows when the data you requested last changed, and HTTP provides a way for the server to include this last-modified date along with the data you requested.
If you ask for the same data a second time (or third, or fourth), you can tell the server the last-modified date that you got last time: you send an If-Modified-Since header with your request, with the date you got back from the server last time. If the data hasn't changed since then, the server sends back a special HTTP status code 304, which means “this data hasn't changed since the last time you asked for it”. Why is this an improvement? Because when the server sends a 304, it doesn't re-send the data. All you get is the status code. So you don't need to download the same data over and over again if it hasn't changed; the server assumes you have the data cached locally.
All modern web browsers support last-modified date checking. If you've ever visited a page, re-visited the same page a day later and found that it hadn't changed, and wondered why it loaded so quickly the second time -- this could be why. Your web browser cached the contents of the page locally the first time, and when you visited the second time, your browser automatically sent the last-modified date it got from the server the first time. The server simply says 304: Not Modified, so your browser knows to load the page from its cache. Web services can be this smart too.
Python's URL library has no built-in support for last-modified date checking, but since you can add arbitrary headers to each request and read arbitrary headers in each response, you can add support for it yourself.
ETags are an alternate way to accomplish the same thing as the last-modified date checking: don't re-download data that hasn't changed. The way it works is, the server sends some sort of hash of the data (in an ETag header) along with the data you requested. Exactly how this hash is determined is entirely up to the server. The second time you request the same data, you include the ETag hash in an If-None-Match: header, and if the data hasn't changed, the server will send you back a 304 status code. As with the last-modified date checking, the server just sends the 304; it doesn't send you the same data a second time. By including the ETag hash in your second request, you're telling the server that there's no need to re-send the same data if it still matches this hash, since you still have the data from the last time.
Python's URL library has no built-in support for ETags, but you'll see how to add it later in this chapter.
The last important HTTP feature is gzip compression. When you talk about HTTP web services, you're almost always talking about moving XML back and forth over the wire. XML is text, and quite verbose text at that, and text generally compresses well. When you request a resource over HTTP, you can ask the server that, if it has any new data to send you, to please send it in compressed format. You include the Accept-encoding: gzip header in your request, and if the server supports compression, it will send you back gzip-compressed data and mark it with a Content-encoding: gzip header.
Python's URL library has no built-in support for gzip compression per se, but you can add arbitrary headers to the request. And Python comes with a separate gzip module, which has functions you can use to decompress the data yourself.
Note that our little one-line script to download a syndicated feed did not support any of these HTTP features. Let's see how you can improve it.
First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and more features.
>>> import httplib >>> httplib.HTTPConnection.debuglevel = 1>>> import urllib >>> feeddata = urllib.urlopen('http://diveintomark.org/xml/atom.xml').read() connect: (diveintomark.org, 80)
send: ' GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org
User-agent: Python-urllib/1.15
' reply: 'HTTP/1.1 200 OK\r\n'
header: Date: Wed, 14 Apr 2004 22:27:30 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Content-Type: application/atom+xml header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT
header: ETag: "e8284-68e0-4de30f80"
header: Accept-Ranges: bytes header: Content-Length: 26848 header: Connection: close
The first step to improving your HTTP web services client is to identify yourself properly with a User-Agent. To do that, you need to move beyond the basic urllib and dive into urllib2.
>>> import httplib >>> httplib.HTTPConnection.debuglevel = 1>>> import urllib2 >>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml')
>>> opener = urllib2.build_opener()
>>> feeddata = opener.open(request).read()
connect: (diveintomark.org, 80) send: ' GET /xml/atom.xml HTTP/1.0 Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Wed, 14 Apr 2004 23:23:12 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Content-Type: application/atom+xml header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT header: ETag: "e8284-68e0-4de30f80" header: Accept-Ranges: bytes header: Content-Length: 26848 header: Connection: close
![]() | If you still have your Python IDE open from the previous section's example, you can skip this, but this turns on HTTP debugging so you can see what you're actually sending over the wire, and what gets sent back. |
![]() | Fetching an HTTP resource with urllib2 is a three-step process, for good reasons that will become clear shortly. The first step is to create a Request object, which takes the URL of the resource you'll eventually get around to retrieving. Note that this step doesn't actually retrieve anything yet. |
![]() | The second step is to build a URL opener. This can take any number of handlers, which control how responses are handled. But you can also build an opener without any custom handlers, which is what you're doing here. You'll see how to define and use custom handlers later in this chapter when you explore redirects. |
![]() | The final step is to tell the opener to open the URL, using the Request object you created. As you can see from all the debugging information that gets printed, this step actually retrieves the resource and stores the returned data in feeddata. |
>>> request<urllib2.Request instance at 0x00250AA8> >>> request.get_full_url() http://diveintomark.org/xml/atom.xml >>> request.add_header('User-Agent', ... 'OpenAnything/1.0 +http://diveintopython.org/')
>>> feeddata = opener.open(request).read()
connect: (diveintomark.org, 80) send: ' GET /xml/atom.xml HTTP/1.0 Host: diveintomark.org User-agent: OpenAnything/1.0 +http://diveintopython.org/
' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Wed, 14 Apr 2004 23:45:17 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Content-Type: application/atom+xml header: Last-Modified: Wed, 14 Apr 2004 22:14:38 GMT header: ETag: "e8284-68e0-4de30f80" header: Accept-Ranges: bytes header: Content-Length: 26848 header: Connection: close
Now that you know how to add custom HTTP headers to your web service requests, let's look at adding support for Last-Modified and ETag headers.
These examples show the output with debugging turned off. If you still have it turned on from the previous section, you can turn it off by setting httplib.HTTPConnection.debuglevel = 0. Or you can just leave debugging on, if that helps you.
>>> import urllib2 >>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') >>> opener = urllib2.build_opener() >>> firstdatastream = opener.open(request) >>> firstdatastream.headers.dict{'date': 'Thu, 15 Apr 2004 20:42:41 GMT', 'server': 'Apache/2.0.49 (Debian GNU/Linux)', 'content-type': 'application/atom+xml', 'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT', 'etag': '"e842a-3e53-55d97640"', 'content-length': '15955', 'accept-ranges': 'bytes', 'connection': 'close'} >>> request.add_header('If-Modified-Since', ... firstdatastream.headers.get('Last-Modified'))
>>> seconddatastream = opener.open(request)
Traceback (most recent call last): File "<stdin>", line 1, in ? File "c:\python23\lib\urllib2.py", line 326, in open '_open', req) File "c:\python23\lib\urllib2.py", line 306, in _call_chain result = func(*args) File "c:\python23\lib\urllib2.py", line 901, in http_open return self.do_open(httplib.HTTP, req) File "c:\python23\lib\urllib2.py", line 895, in do_open return self.parent.error('http', req, fp, code, msg, hdrs) File "c:\python23\lib\urllib2.py", line 352, in error return self._call_chain(*args) File "c:\python23\lib\urllib2.py", line 306, in _call_chain result = func(*args) File "c:\python23\lib\urllib2.py", line 412, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 304: Not Modified
![]() | Remember all those HTTP headers you saw printed out when you turned on debugging? This is how you can get access to them programmatically: firstdatastream.headers is an object that acts like a dictionary and allows you to get any of the individual headers returned from the HTTP server. |
![]() | On the second request, you add the If-Modified-Since header with the last-modified date from the first request. If the data hasn't changed, the server should return a 304 status code. |
![]() | Sure enough, the data hasn't changed. You can see from the traceback that urllib2 throws a special exception, HTTPError, in response to the 304 status code. This is a little unusual, and not entirely helpful. After all, it's not an error; you specifically asked the server not to send you any data if it hadn't changed, and the data didn't change, so the server told you it wasn't sending you any data. That's not an error; that's exactly what you were hoping for. |
urllib2 also raises an HTTPError exception for conditions that you would think of as errors, such as 404 (page not found). In fact, it will raise HTTPError for any status code other than 200 (OK), 301 (permanent redirect), or 302 (temporary redirect). It would be more helpful for your purposes to capture the status code and simply return it, without throwing an exception. To do that, you'll need to define a custom URL handler.
This custom URL handler is part of openanything.py.
class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):def http_error_default(self, req, fp, code, msg, headers):
result = urllib2.HTTPError( req.get_full_url(), code, msg, headers, fp) result.status = code
return result
![]() | urllib2 is designed around URL handlers. Each handler is just a class that can define any number of methods. When something happens -- like an HTTP error, or even a 304 code -- urllib2 introspects into the list of defined handlers for a method that can handle it. You used a similar introspection in Chapter 9, XML Processing to define handlers for different node types, but urllib2 is more flexible, and introspects over as many handlers as are defined for the current request. |
![]() | urllib2 searches through the defined handlers and calls the http_error_default method when it encounters a 304 status code from the server. By defining a custom error handler, you can prevent urllib2 from raising an exception. Instead, you create the HTTPError object, but return it instead of raising it. |
![]() | This is the key part: before returning, you save the status code returned by the HTTP server. This will allow you easy access to it from the calling program. |
>>> request.headers{'If-modified-since': 'Thu, 15 Apr 2004 19:45:21 GMT'} >>> import openanything >>> opener = urllib2.build_opener( ... openanything.DefaultErrorHandler())
>>> seconddatastream = opener.open(request) >>> seconddatastream.status
304 >>> seconddatastream.read()
''
Handling ETag works much the same way, but instead of checking for Last-Modified and sending If-Modified-Since, you check for ETag and send If-None-Match. Let's start with a fresh IDE session.
>>> import urllib2, openanything >>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') >>> opener = urllib2.build_opener( ... openanything.DefaultErrorHandler()) >>> firstdatastream = opener.open(request) >>> firstdatastream.headers.get('ETag')'"e842a-3e53-55d97640"' >>> firstdata = firstdatastream.read() >>> print firstdata
<?xml version="1.0" encoding="iso-8859-1"?> <feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en"> <title mode="escaped">dive into mark</title> <link rel="alternate" type="text/html" href="http://diveintomark.org/"/> <-- rest of feed omitted for brevity --> >>> request.add_header('If-None-Match', ... firstdatastream.headers.get('ETag'))
>>> seconddatastream = opener.open(request) >>> seconddatastream.status
304 >>> seconddatastream.read()
''
![]() |
|
In these examples, the HTTP server has supported both Last-Modified and ETag headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively in case a server only supports one or the other, or neither. |
You can support permanent and temporary redirects using a different kind of custom URL handler.
First, let's see why a redirect handler is necessary in the first place.
>>> import urllib2, httplib >>> httplib.HTTPConnection.debuglevel = 1>>> request = urllib2.Request( ... 'http://diveintomark.org/redir/example301.xml')
>>> opener = urllib2.build_opener() >>> f = opener.open(request) connect: (diveintomark.org, 80) send: ' GET /redir/example301.xml HTTP/1.0 Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Date: Thu, 15 Apr 2004 22:06:25 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Location: http://diveintomark.org/xml/atom.xml
header: Content-Length: 338 header: Connection: close header: Content-Type: text/html; charset=iso-8859-1 connect: (diveintomark.org, 80) send: ' GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Thu, 15 Apr 2004 22:06:25 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT header: ETag: "e842a-3e53-55d97640" header: Accept-Ranges: bytes header: Content-Length: 15955 header: Connection: close header: Content-Type: application/atom+xml >>> f.url
'http://diveintomark.org/xml/atom.xml' >>> f.headers.dict {'content-length': '15955', 'accept-ranges': 'bytes', 'server': 'Apache/2.0.49 (Debian GNU/Linux)', 'last-modified': 'Thu, 15 Apr 2004 19:45:21 GMT', 'connection': 'close', 'etag': '"e842a-3e53-55d97640"', 'date': 'Thu, 15 Apr 2004 22:06:25 GMT', 'content-type': 'application/atom+xml'} >>> f.status Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: addinfourl instance has no attribute 'status'
This is suboptimal, but easy to fix. urllib2 doesn't behave exactly as you want it to when it encounters a 301 or 302, so let's override its behavior. How? With a custom URL handler, just like you did to handle 304 codes.
This class is defined in openanything.py.
class SmartRedirectHandler(urllib2.HTTPRedirectHandler):def http_error_301(self, req, fp, code, msg, headers): result = urllib2.HTTPRedirectHandler.http_error_301(
self, req, fp, code, msg, headers) result.status = code
return result def http_error_302(self, req, fp, code, msg, headers):
result = urllib2.HTTPRedirectHandler.http_error_302( self, req, fp, code, msg, headers) result.status = code return result
So what has this bought us? You can now build a URL opener with the custom redirect handler, and it will still automatically follow redirects, but now it will also expose the redirect status code.
>>> request = urllib2.Request('http://diveintomark.org/redir/example301.xml') >>> import openanything, httplib >>> httplib.HTTPConnection.debuglevel = 1 >>> opener = urllib2.build_opener( ... openanything.SmartRedirectHandler())>>> f = opener.open(request) connect: (diveintomark.org, 80) send: 'GET /redir/example301.xml HTTP/1.0 Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 301 Moved Permanently\r\n'
header: Date: Thu, 15 Apr 2004 22:13:21 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Location: http://diveintomark.org/xml/atom.xml header: Content-Length: 338 header: Connection: close header: Content-Type: text/html; charset=iso-8859-1 connect: (diveintomark.org, 80) send: ' GET /xml/atom.xml HTTP/1.0 Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Thu, 15 Apr 2004 22:13:21 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT header: ETag: "e842a-3e53-55d97640" header: Accept-Ranges: bytes header: Content-Length: 15955 header: Connection: close header: Content-Type: application/atom+xml >>> f.status
301 >>> f.url 'http://diveintomark.org/xml/atom.xml'
The same redirect handler can also tell you that you shouldn't update your address book.
>>> request = urllib2.Request( ... 'http://diveintomark.org/redir/example302.xml')>>> f = opener.open(request) connect: (diveintomark.org, 80) send: ' GET /redir/example302.xml HTTP/1.0 Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 302 Found\r\n'
header: Date: Thu, 15 Apr 2004 22:18:21 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Location: http://diveintomark.org/xml/atom.xml header: Content-Length: 314 header: Connection: close header: Content-Type: text/html; charset=iso-8859-1 connect: (diveintomark.org, 80) send: ' GET /xml/atom.xml HTTP/1.0
Host: diveintomark.org User-agent: Python-urllib/2.1 ' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Thu, 15 Apr 2004 22:18:21 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT header: ETag: "e842a-3e53-55d97640" header: Accept-Ranges: bytes header: Content-Length: 15955 header: Connection: close header: Content-Type: application/atom+xml >>> f.status
302 >>> f.url http://diveintomark.org/xml/atom.xml
The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed, which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since XML data compresses very well.
Servers won't give you compressed data unless you tell them you can handle it.
>>> import urllib2, httplib >>> httplib.HTTPConnection.debuglevel = 1 >>> request = urllib2.Request('http://diveintomark.org/xml/atom.xml') >>> request.add_header('Accept-encoding', 'gzip')>>> opener = urllib2.build_opener() >>> f = opener.open(request) connect: (diveintomark.org, 80) send: ' GET /xml/atom.xml HTTP/1.0 Host: diveintomark.org User-agent: Python-urllib/2.1 Accept-encoding: gzip
' reply: 'HTTP/1.1 200 OK\r\n' header: Date: Thu, 15 Apr 2004 22:24:39 GMT header: Server: Apache/2.0.49 (Debian GNU/Linux) header: Last-Modified: Thu, 15 Apr 2004 19:45:21 GMT header: ETag: "e842a-3e53-55d97640" header: Accept-Ranges: bytes header: Vary: Accept-Encoding header: Content-Encoding: gzip
header: Content-Length: 6289
header: Connection: close header: Content-Type: application/atom+xml
>>> compresseddata = f.read()>>> len(compresseddata) 6289 >>> import StringIO >>> compressedstream = StringIO.StringIO(compresseddata)
>>> import gzip >>> gzipper = gzip.GzipFile(fileobj=compressedstream)
>>> data = gzipper.read()
>>> print data
<?xml version="1.0" encoding="iso-8859-1"?> <feed version="0.3" xmlns="http://purl.org/atom/ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xml:lang="en"> <title mode="escaped">dive into mark</title> <link rel="alternate" type="text/html" href="http://diveintomark.org/"/> <-- rest of feed omitted for brevity --> >>> len(data) 15955
![]() | Continuing from the previous example, f is the file-like object returned from the URL opener. Using its read() method would ordinarily get you the uncompressed data, but since this data has been gzip-compressed, this is just the first step towards getting the data you really want. |
![]() | OK, this step is a little bit of messy workaround. Python has a gzip module, which reads (and actually writes) gzip-compressed files on disk. But you don't have a file on disk, you have a gzip-compressed buffer in memory, and you don't want to write out a temporary file just so you can uncompress it. So what you're going to do is create a file-like object out of the in-memory data (compresseddata), using the StringIO module. You first saw the StringIO module in the previous chapter, but now you've found another use for it. |
![]() | Now you can create an instance of GzipFile, and tell it that its “file” is the file-like object compressedstream. |
![]() | This is the line that does all the actual work: “reading” from GzipFile will decompress the data. Strange? Yes, but it makes sense in a twisted kind of way. gzipper is a file-like object which represents a gzip-compressed file. That “file” is not a real file on disk, though; gzipper is really just “reading” from the file-like object you created with StringIO to wrap the compressed data, which is only in memory in the variable compresseddata. And where did that compressed data come from? You originally downloaded it from a remote HTTP server by “reading” from the file-like object you built with urllib2.build_opener. And amazingly, this all just works. Every step in the chain has no idea that the previous step is faking it. |
![]() | Look ma, real data. (15955 bytes of it, in fact.) |
“But wait!” I hear you cry. “This could be even easier!” I know what you're thinking. You're thinking that opener.open returns a file-like object, so why not cut out the StringIO middleman and just pass f directly to GzipFile? OK, maybe you weren't thinking that, but don't worry about it, because it doesn't work.
>>> f = opener.open(request)>>> f.headers.get('Content-Encoding')
'gzip' >>> data = gzip.GzipFile(fileobj=f).read()
Traceback (most recent call last): File "<stdin>", line 1, in ? File "c:\python23\lib\gzip.py", line 217, in read self._read(readsize) File "c:\python23\lib\gzip.py", line 252, in _read pos = self.fileobj.tell() # Save current position AttributeError: addinfourl instance has no attribute 'tell'
You've seen all the pieces for building an intelligent HTTP web services client. Now let's see how they all fit together.
This function is defined in openanything.py.
def openAnything(source, etag=None, lastmodified=None, agent=USER_AGENT): # non-HTTP code omitted for brevity if urlparse.urlparse(source)[0] == 'http':# open URL with urllib2 request = urllib2.Request(source) request.add_header('User-Agent', agent)
if etag: request.add_header('If-None-Match', etag)
if lastmodified: request.add_header('If-Modified-Since', lastmodified)
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener(SmartRedirectHandler(), DefaultErrorHandler())
return opener.open(request)
![]()
This function is defined in openanything.py.
def fetch(source, etag=None, last_modified=None, agent=USER_AGENT): '''Fetch data and metadata from a URL, file, stream, or string''' result = {} f = openAnything(source, etag, last_modified, agent)result['data'] = f.read()
if hasattr(f, 'headers'): # save ETag, if the server sent one result['etag'] = f.headers.get('ETag')
# save Last-Modified header, if the server sent one result['lastmodified'] = f.headers.get('Last-Modified')
if f.headers.get('content-encoding', '') == 'gzip':
# data came back gzip-compressed, decompress it result['data'] = gzip.GzipFile(fileobj=StringIO(result['data']])).read() if hasattr(f, 'url'):
result['url'] = f.url result['status'] = 200 if hasattr(f, 'status'):
result['status'] = f.status f.close() return result
>>> import openanything >>> useragent = 'MyHTTPWebServicesApp/1.0' >>> url = 'http://diveintopython.org/redir/example301.xml' >>> params = openanything.fetch(url, agent=useragent)>>> params
{'url': 'http://diveintomark.org/xml/atom.xml', 'lastmodified': 'Thu, 15 Apr 2004 19:45:21 GMT', 'etag': '"e842a-3e53-55d97640"', 'status': 301, 'data': '<?xml version="1.0" encoding="iso-8859-1"?> <feed version="0.3" <-- rest of data omitted for brevity -->'} >>> if params['status'] == 301:
... url = params['url'] >>> newparams = openanything.fetch( ... url, params['etag'], params['lastmodified'], useragent)
>>> newparams {'url': 'http://diveintomark.org/xml/atom.xml', 'lastmodified': None, 'etag': '"e842a-3e53-55d97640"', 'status': 304, 'data': ''}
![]()
![]() | The very first time you fetch a resource, you don't have an ETag hash or Last-Modified date, so you'll leave those out. (They're optional parameters.) |
![]() | What you get back is a dictionary of several useful headers, the HTTP status code, and the actual data returned from the server. openanything handles the gzip compression internally; you don't care about that at this level. |
![]() | If you ever get a 301 status code, that's a permanent redirect, and you need to update your URL to the new address. |
![]() | The second time you fetch the same resource, you have all sorts of information to pass back: a (possibly updated) URL, the ETag from the last time, the Last-Modified date from the last time, and of course your User-Agent. |
![]() | What you get back is again a dictionary, but the data hasn't changed, so all you got was a 304 status code and no data. |
The openanything.py and its functions should now make perfect sense.
There are 5 important features of HTTP web services that every client should support:
Chapter 11 focused on document-oriented web services over HTTP. The “input parameter” was the URL, and the “return value” was an actual XML document which it was your responsibility to parse.
This chapter will focus on SOAP web services, which take a more structured approach. Rather than dealing with HTTP requests and XML documents directly, SOAP allows you to simulate calling functions that return native data types. As you will see, the illusion is almost perfect; you can “call” a function through a SOAP library, with the standard Python calling syntax, and the function appears to return Python objects and values. But under the covers, the SOAP library has actually performed a complex transaction involving multiple XML documents and a remote server.
SOAP is a complex specification, and it is somewhat misleading to say that SOAP is all about calling remote functions. Some people would pipe up to add that SOAP allows for one-way asynchronous message passing, and document-oriented web services. And those people would be correct; SOAP can be used that way, and in many different ways. But this chapter will focus on so-called “RPC-style” SOAP -- calling a remote function and getting results back.
You use Google, right? It's a popular search engine. Have you ever wished you could programmatically access Google search results? Now you can. Here is a program to search Google from Python.
from SOAPpy import WSDL # you'll need to configure these two values; # see http://www.google.com/apis/ WSDLFILE = '/path/to/copy/of/GoogleSearch.wsdl' APIKEY = 'YOUR_GOOGLE_API_KEY' _server = WSDL.Proxy(WSDLFILE) def search(q): """Search Google and return list of {title, link, description}""" results = _server.doGoogleSearch( APIKEY, q, 0, 10, False, "", False, "", "utf-8", "utf-8") return [{"title": r.title.encode("utf-8"), "link": r.URL.encode("utf-8"), "description": r.snippet.encode("utf-8")} for r in results.resultElements] if __name__ == '__main__': import sys for r in search(sys.argv[1])[:5]: print r['title'] print r['link'] print r['description'] print
You can import this as a module and use it from a larger program, or you can run the script from the command line. On the command line, you give the search query as a command-line argument, and it prints out the URL, title, and description of the top five Google search results.
Here is the sample output for a search for the word “python”.
C:\diveintopython\common\py> python search.py "python" <b>Python</b> Programming Language http://www.python.org/ Home page for <b>Python</b>, an interpreted, interactive, object-oriented, extensible<br> programming language. <b>...</b> <b>Python</b> is OSI Certified Open Source: OSI Certified. <b>Python</b> Documentation Index http://www.python.org/doc/ <b>...</b> New-style classes (aka descrintro). Regular expressions. Database API. Email Us.<br> docs@<b>python</b>.org. (c) 2004. <b>Python</b> Software Foundation. <b>Python</b> Documentation. <b>...</b> Download <b>Python</b> Software http://www.python.org/download/ Download Standard <b>Python</b> Software. <b>Python</b> 2.3.3 is the current production<br> version of <b>Python</b>. <b>...</b> <b>Python</b> is OSI Certified Open Source: Pythonline http://www.pythonline.com/ Dive Into <b>Python</b> http://diveintopython.org/ Dive Into <b>Python</b>. <b>Python</b> from novice to pro. Find: <b>...</b> It is also available in multiple<br> languages. Read Dive Into <b>Python</b>. This book is still being written. <b>...</b>
Unlike the other code in this book, this chapter relies on libraries that do not come pre-installed with Python.
Before you can dive into SOAP web services, you'll need to install three libraries: PyXML, fpconst, and SOAPpy.
The first library you need is PyXML, an advanced set of XML libraries that provide more functionality than the built-in XML libraries we studied in Chapter 9.
Here is the procedure for installing PyXML:
Go to http://pyxml.sourceforge.net/, click Downloads, and download the latest version for your operating system.
If you are using Windows, there are several choices. Make sure to download the version of PyXML that matches the version of Python you are using.
Double-click the installer. If you download PyXML 0.8.3 for Windows and Python 2.3, the installer program will be PyXML-0.8.3.win32-py2.3.exe.
Step through the installer program.
After the installation is complete, close the installer. There will not be any visible indication of success (no programs installed on the Start Menu or shortcuts installed on the desktop). PyXML is simply a collection of XML libraries used by other programs.
To verify that you installed PyXML correctly, run your Python IDE and check the version of the XML libraries you have installed, as shown here.
The second library you need is fpconst, a set of constants and functions for working with IEEE754 double-precision special values. This provides support for the special values Not-a-Number (NaN), Positive Infinity (Inf), and Negative Infinity (-Inf), which are part of the SOAP datatype specification.
Here is the procedure for installing fpconst:
Download the latest version of fpconst from http://www.analytics.washington.edu/statcomp/projects/rzope/fpconst/.
There are two downloads available, one in .tar.gz format, the other in .zip format. If you are using Windows, download the .zip file; otherwise, download the .tar.gz file.
Decompress the downloaded file. On Windows XP, you can right-click on the file and choose Extract All; on earlier versions of Windows, you will need a third-party program such as WinZip. On Mac OS X, you can double-click the compressed file to decompress it with Stuffit Expander.
Open a command prompt and navigate to the directory where you decompressed the fpconst files.
Type python setup.py install to run the installation program.
To verify that you installed fpconst correctly, run your Python IDE and check the version number.
The third and final requirement is the SOAP library itself: SOAPpy.
Here is the procedure for installing SOAPpy:
Go to http://pywebsvcs.sourceforge.net/ and select Latest Official Release under the SOAPpy section.
There are two downloads available. If you are using Windows, download the .zip file; otherwise, download the .tar.gz file.
Decompress the downloaded file, just as you did with fpconst.
Open a command prompt and navigate to the directory where you decompressed the SOAPpy files.
Type python setup.py install to run the installation program.
To verify that you installed SOAPpy correctly, run your Python IDE and check the version number.
The heart of SOAP is the ability to call remote functions. There are a number of public access SOAP servers that provide simple functions for demonstration purposes.
The most popular public access SOAP server is http://www.xmethods.net/. This example uses a demonstration function that takes a United States zip code and returns the current temperature in that region.
>>> from SOAPpy import SOAPProxy>>> url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter' >>> namespace = 'urn:xmethods-Temperature'
>>> server = SOAPProxy(url, namespace)
>>> server.getTemp('27502')
80.0
![]() | You access the remote SOAP server through a proxy class, SOAPProxy. The proxy handles all the internals of SOAP for you, including creating the XML request document out of the function name and argument list, sending the request over HTTP to the remote SOAP server, parsing the XML response document, and creating native Python values to return. You'll see what these XML documents look like in the next section. |
![]() | Every SOAP service has a URL which handles all the requests. The same URL is used for all function calls. This particular service only has a single function, but later in this chapter you'll see examples of the Google API, which has several functions. The service URL is shared by all functions.Each SOAP service also has a namespace, which is defined by the server and is completely arbitrary. It's simply part of the configuration required to call SOAP methods. It allows the server to share a single service URL and route requests between several unrelated services. It's like dividing Python modules into packages. |
![]() | You're creating the SOAPProxy with the service URL and the service namespace. This doesn't make any connection to the SOAP server; it simply creates a local Python object. |
![]() | Now with everything configured properly, you can actually call remote SOAP methods as if they were local functions. You pass arguments just like a normal function, and you get a return value just like a normal function. But under the covers, there's a heck of a lot going on. |
Let's peek under those covers.
The SOAP libraries provide an easy way to see what's going on behind the scenes.
Turning on debugging is a simple matter of setting two flags in the SOAPProxy's configuration.
>>> from SOAPpy import SOAPProxy >>> url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter' >>> n = 'urn:xmethods-Temperature' >>> server = SOAPProxy(url, namespace=n)>>> server.config.dumpSOAPOut = 1
>>> server.config.dumpSOAPIn = 1 >>> temperature = server.getTemp('27502')
*** Outgoing SOAP ****************************************************** <?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/1999/XMLSchema"> <SOAP-ENV:Body> <ns1:getTemp xmlns:ns1="urn:xmethods-Temperature" SOAP-ENC:root="1"> <v1 xsi:type="xsd:string">27502</v1> </ns1:getTemp> </SOAP-ENV:Body> </SOAP-ENV:Envelope> ************************************************************************ *** Incoming SOAP ****************************************************** <?xml version='1.0' encoding='UTF-8'?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <SOAP-ENV:Body> <ns1:getTempResponse xmlns:ns1="urn:xmethods-Temperature" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <return xsi:type="xsd:float">80.0</return> </ns1:getTempResponse> </SOAP-ENV:Body> </SOAP-ENV:Envelope> ************************************************************************ >>> temperature 80.0
Most of the XML request document that gets sent to the server is just boilerplate. Ignore all the namespace declarations; they're going to be the same (or similar) for all SOAP calls. The heart of the “function call” is this fragment within the <Body> element:
<ns1:getTempxmlns:ns1="urn:xmethods-Temperature"
SOAP-ENC:root="1"> <v1 xsi:type="xsd:string">27502</v1>
</ns1:getTemp>
![]() | The element name is the function name, getTemp. SOAPProxy uses getattr as a dispatcher. Instead of calling separate local methods based on the method name, it actually uses the method name to construct the XML request document. |
![]() | The function's XML element is contained in a specific namespace, which is the namespace you specified when you created the SOAPProxy object. Don't worry about the SOAP-ENC:root; that's boilerplate too. |
![]() | The arguments of the function also got translated into XML. SOAPProxy introspects each argument to determine its datatype (in this case it's a string). The argument datatype goes into the xsi:type attribute, followed by the actual string value. |
The XML return document is equally easy to understand, once you know what to ignore. Focus on this fragment within the <Body>:
<ns1:getTempResponsexmlns:ns1="urn:xmethods-Temperature"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <return xsi:type="xsd:float">80.0</return>
</ns1:getTempResponse>
The SOAPProxy class proxies local method calls and transparently turns then into invocations of remote SOAP methods. As you've seen, this is a lot of work, and SOAPProxy does it quickly and transparently. What it doesn't do is provide any means of method introspection.
Consider this: the previous two sections showed an example of calling a simple remote SOAP method with one argument and one return value, both of simple data types. This required knowing, and keeping track of, the service URL, the service namespace, the function name, the number of arguments, and the datatype of each argument. If any of these is missing or wrong, the whole thing falls apart.
That shouldn't come as a big surprise. If I wanted to call a local function, I would need to know what package or module it was in (the equivalent of service URL and namespace). I would need to know the correct function name and the correct number of arguments. Python deftly handles datatyping without explicit types, but I would still need to know how many argument to pass, and how many return values to expect.
The big difference is introspection. As you saw in Chapter 4, Python excels at letting you discover things about modules and functions at runtime. You can list the available functions within a module, and with a little work, drill down to individual function declarations and arguments.
WSDL lets you do that with SOAP web services. WSDL stands for “Web Services Description Language”. Although designed to be flexible enough to describe many types of web services, it is most often used to describe SOAP web services.
A WSDL file is just that: a file. More specifically, it's an XML file. It usually lives on the same server you use to access the SOAP web services it describes, although there's nothing special about it. Later in this chapter, we'll download the WSDL file for the Google API and use it locally. That doesn't mean we're calling Google locally; the WSDL file still describes the remote functions sitting on Google's server.
A WSDL file contains a description of everything involved in calling a SOAP web service:
In other words, a WSDL file tells you everything you need to know to be able to call a SOAP web service.
Like many things in the web services arena, WSDL has a long and checkered history, full of political strife and intrigue. I will skip over this history entirely, since it bores me to tears. There were other standards that tried to do similar things, but WSDL won, so let's learn how to use it.
The most fundamental thing that WSDL allows you to do is discover the available methods offered by a SOAP server.
>>> from SOAPpy import WSDL>>> wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl') >>> server = WSDL.Proxy(wsdlFile)
>>> server.methods.keys()
[u'getTemp']
Okay, so you know that this SOAP server offers a single method: getTemp. But how do you call it? The WSDL proxy object can tell you that too.
>>> callInfo = server.methods['getTemp']>>> callInfo.inparams
[<SOAPpy.wstools.WSDLTools.ParameterInfo instance at 0x00CF3AD0>] >>> callInfo.inparams[0].name
u'zipcode' >>> callInfo.inparams[0].type
(u'http://www.w3.org/2001/XMLSchema', u'string')
WSDL also lets you introspect into a function's return values.
>>> callInfo.outparams[<SOAPpy.wstools.WSDLTools.ParameterInfo instance at 0x00CF3AF8>] >>> callInfo.outparams[0].name
u'return' >>> callInfo.outparams[0].type (u'http://www.w3.org/2001/XMLSchema', u'float')
Let's put it all together, and call a SOAP web service through a WSDL proxy.
>>> from SOAPpy import WSDL >>> wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl') >>> server = WSDL.Proxy(wsdlFile)>>> server.getTemp('90210')
66.0 >>> server.soapproxy.config.dumpSOAPOut = 1
>>> server.soapproxy.config.dumpSOAPIn = 1 >>> temperature = server.getTemp('90210') *** Outgoing SOAP ****************************************************** <?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance" xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/1999/XMLSchema"> <SOAP-ENV:Body> <ns1:getTemp xmlns:ns1="urn:xmethods-Temperature" SOAP-ENC:root="1"> <v1 xsi:type="xsd:string">90210</v1> </ns1:getTemp> </SOAP-ENV:Body> </SOAP-ENV:Envelope> ************************************************************************ *** Incoming SOAP ****************************************************** <?xml version='1.0' encoding='UTF-8'?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <SOAP-ENV:Body> <ns1:getTempResponse xmlns:ns1="urn:xmethods-Temperature" SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"> <return xsi:type="xsd:float">66.0</return> </ns1:getTempResponse> </SOAP-ENV:Body> </SOAP-ENV:Envelope> ************************************************************************ >>> temperature 66.0
Let's finally turn to the sample code that you saw that the beginning of this chapter, which does something more useful and exciting than get the current temperature.
Google provides a SOAP API for programmatically accessing Google search results. To use it, you will need to sign up for Google Web Services.
Go to http://www.google.com/apis/ and create a Google account. This requires only an email address. After you sign up you will receive your Google API license key by email. You will need this key to pass as a parameter whenever you call Google's search functions.
Also on http://www.google.com/apis/, download the Google Web APIs developer kit. This includes some sample code in several programming languages (but not Python), and more importantly, it includes the WSDL file.
Decompress the developer kit file and find GoogleSearch.wsdl. Copy this file to some permanent location on your local drive. You will need it later in this chapter.
Once you have your developer key and your Google WSDL file in a known place, you can start poking around with Google Web Services.
>>> from SOAPpy import WSDL >>> server = WSDL.Proxy('/path/to/your/GoogleSearch.wsdl')>>> server.methods.keys()
[u'doGoogleSearch', u'doGetCachedPage', u'doSpellingSuggestion'] >>> callInfo = server.methods['doGoogleSearch'] >>> for arg in callInfo.inparams:
... print arg.name.ljust(15), arg.type key (u'http://www.w3.org/2001/XMLSchema', u'string') q (u'http://www.w3.org/2001/XMLSchema', u'string') start (u'http://www.w3.org/2001/XMLSchema', u'int') maxResults (u'http://www.w3.org/2001/XMLSchema', u'int') filter (u'http://www.w3.org/2001/XMLSchema', u'boolean') restrict (u'http://www.w3.org/2001/XMLSchema', u'string') safeSearch (u'http://www.w3.org/2001/XMLSchema', u'boolean') lr (u'http://www.w3.org/2001/XMLSchema', u'string') ie (u'http://www.w3.org/2001/XMLSchema', u'string') oe (u'http://www.w3.org/2001/XMLSchema', u'string')
Here is a brief synopsis of all the parameters to the doGoogleSearch function:
>>> from SOAPpy import WSDL >>> server = WSDL.Proxy('/path/to/your/GoogleSearch.wsdl') >>> key = 'YOUR_GOOGLE_API_KEY' >>> results = server.doGoogleSearch(key, 'mark', 0, 10, False, "", ... False, "", "utf-8", "utf-8")>>> len(results.resultElements)
10 >>> results.resultElements[0].URL
'http://diveintomark.org/' >>> results.resultElements[0].title 'dive into <b>mark</b>'
The results object contains more than the actual search results. It also contains information about the search itself, such as how long it took and how many results were found (even though only 10 were returned). The Google web interface shows this information, and you can access it programmatically too.
>>> results.searchTime0.224919 >>> results.estimatedTotalResultsCount
29800000 >>> results.directoryCategories
[<SOAPpy.Types.structType item at 14367400>: {'fullViewableName': 'Top/Arts/Literature/World_Literature/American/19th_Century/Twain,_Mark', 'specialEncoding': ''}] >>> results.directoryCategories[0].fullViewableName 'Top/Arts/Literature/World_Literature/American/19th_Century/Twain,_Mark'
![]() | This search took 0.224919 seconds. That does not include the time spent sending and receiving the actual SOAP XML documents. It's just the time that Google spent processing your request once it received it. |
![]() | In total, there were approximately 30 million results. You can access them 10 at a time by changing the start parameter and calling server.doGoogleSearch again. |
![]() | For some queries, Google also returns a list of related categories in the Google Directory. You can append these URLs to http://directory.google.com/ to construct the link to the directory category page. |
Of course, the world of SOAP web services is not all happiness and light. Sometimes things go wrong.
As you've seen throughout this chapter, SOAP involves several layers. There's the HTTP layer, since SOAP is sending XML documents to, and receiving XML documents from, an HTTP server. So all the debugging techniques you learned in Chapter 11, HTTP Web Services come into play here. You can import httplib and then set httplib.HTTPConnection.debuglevel = 1 to see the underlying HTTP traffic.
Beyond the underlying HTTP layer, there are a number of things that can go wrong. SOAPpy does an admirable job hiding the SOAP syntax from you, but that also means it can be difficult to determine where the problem is when things don't work.
Here are a few examples of common mistakes that I've made in using SOAP web services, and the errors they generated.
>>> from SOAPpy import SOAPProxy >>> url = 'http://services.xmethods.net:80/soap/servlet/rpcrouter' >>> server = SOAPProxy(url)>>> server.getTemp('27502')
<Fault SOAP-ENV:Server.BadTargetObjectURI: Unable to determine object id from call: is the method element namespaced?> Traceback (most recent call last): File "<stdin>", line 1, in ? File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 453, in __call__ return self.__r_call(*args, **kw) File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 475, in __r_call self.__hd, self.__ma) File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 389, in __call raise p SOAPpy.Types.faultType: <Fault SOAP-ENV:Server.BadTargetObjectURI: Unable to determine object id from call: is the method element namespaced?>
Misconfiguring the basic elements of the SOAP service is one of the problems that WSDL aims to solve. The WSDL file contains the service URL and namespace, so you can't get it wrong. Of course, there are still other things you can get wrong.
>>> wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl' >>> server = WSDL.Proxy(wsdlFile) >>> temperature = server.getTemp(27502)<Fault SOAP-ENV:Server: Exception while handling service request: services.temperature.TempService.getTemp(int) -- no signature match>
Traceback (most recent call last): File "<stdin>", line 1, in ? File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 453, in __call__ return self.__r_call(*args, **kw) File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 475, in __r_call self.__hd, self.__ma) File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 389, in __call raise p SOAPpy.Types.faultType: <Fault SOAP-ENV:Server: Exception while handling service request: services.temperature.TempService.getTemp(int) -- no signature match>
It's also possible to write Python code that expects a different number of return values than the remote function actually returns.
>>> wsdlFile = 'http://www.xmethods.net/sd/2001/TemperatureService.wsdl' >>> server = WSDL.Proxy(wsdlFile) >>> (city, temperature) = server.getTemp(27502)Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: unpack non-sequence
What about Google's web service? The most common problem I've had with it is that I forget to set the application key properly.
>>> from SOAPpy import WSDL >>> server = WSDL.Proxy(r'/path/to/local/GoogleSearch.wsdl') >>> results = server.doGoogleSearch('foo', 'mark', 0, 10, False, "",... False, "", "utf-8", "utf-8") <Fault SOAP-ENV:Server:
Exception from service object: Invalid authorization key: foo: <SOAPpy.Types.structType detail at 14164616>: {'stackTrace': 'com.google.soap.search.GoogleSearchFault: Invalid authorization key: foo at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe( QueryLimits.java:220) at com.google.soap.search.QueryLimits.validateKey(QueryLimits.java:127) at com.google.soap.search.GoogleSearchService.doPublicMethodChecks( GoogleSearchService.java:825) at com.google.soap.search.GoogleSearchService.doGoogleSearch( GoogleSearchService.java:121) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.soap.server.RPCRouter.invoke(RPCRouter.java:146) at org.apache.soap.providers.RPCJavaProvider.invoke( RPCJavaProvider.java:129) at org.apache.soap.server.http.RPCRouterServlet.doPost( RPCRouterServlet.java:288) at javax.servlet.http.HttpServlet.service(HttpServlet.java:760) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at com.google.gse.HttpConnection.runServlet(HttpConnection.java:237) at com.google.gse.HttpConnection.run(HttpConnection.java:195) at com.google.gse.DispatchQueue$WorkerThread.run(DispatchQueue.java:201) Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong size. at com.google.soap.search.UserKey.<init>(UserKey.java:59) at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe( QueryLimits.java:217) ... 14 more '}> Traceback (most recent call last): File "<stdin>", line 1, in ? File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 453, in __call__ return self.__r_call(*args, **kw) File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 475, in __r_call self.__hd, self.__ma) File "c:\python23\Lib\site-packages\SOAPpy\Client.py", line 389, in __call raise p SOAPpy.Types.faultType: <Fault SOAP-ENV:Server: Exception from service object: Invalid authorization key: foo: <SOAPpy.Types.structType detail at 14164616>: {'stackTrace': 'com.google.soap.search.GoogleSearchFault: Invalid authorization key: foo at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe( QueryLimits.java:220) at com.google.soap.search.QueryLimits.validateKey(QueryLimits.java:127) at com.google.soap.search.GoogleSearchService.doPublicMethodChecks( GoogleSearchService.java:825) at com.google.soap.search.GoogleSearchService.doGoogleSearch( GoogleSearchService.java:121) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.soap.server.RPCRouter.invoke(RPCRouter.java:146) at org.apache.soap.providers.RPCJavaProvider.invoke( RPCJavaProvider.java:129) at org.apache.soap.server.http.RPCRouterServlet.doPost( RPCRouterServlet.java:288) at javax.servlet.http.HttpServlet.service(HttpServlet.java:760) at javax.servlet.http.HttpServlet.service(HttpServlet.java:853) at com.google.gse.HttpConnection.runServlet(HttpConnection.java:237) at com.google.gse.HttpConnection.run(HttpConnection.java:195) at com.google.gse.DispatchQueue$WorkerThread.run(DispatchQueue.java:201) Caused by: com.google.soap.search.UserKeyInvalidException: Key was of wrong size. at com.google.soap.search.UserKey.<init>(UserKey.java:59) at com.google.soap.search.QueryLimits.lookUpAndLoadFromINSIfNeedBe( QueryLimits.java:217) ... 14 more '}>
SOAP web services are very complicated. The specification is very ambitious and tries to cover many different use cases for web services. This chapter has touched on some of the simpler use cases.
Before diving into the next chapter, make sure you're comfortable doing all of these things:
In previous chapters, you “dived in” by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen before the code gets written.
In the next few chapters, you're going to write, debug, and optimize a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in Section 7.3, “Case Study: Roman Numerals”, but now let's step back and consider what it would take to expand that into a two-way utility.
The rules for Roman numerals lead to a number of interesting observations:
Given all of this, what would you expect out of a set of functions to convert to and from Roman numerals?
Now that you've completely defined the behavior you expect from your conversion functions, you're going to do something a little unexpected: you're going to write a test suite that puts these functions through their paces and makes sure that they behave the way you want them to. You read that right: you're going to write code that tests code that you haven't written yet.
This is called unit testing, since the set of two conversion functions can be written and tested as a unit, separate from any larger program they may become part of later. Python has a framework for unit testing, the appropriately-named unittest module.
![]() |
|
unittest is included with Python 2.1 and later. Python 2.0 users can download it from pyunit.sourceforge.net. |
Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:
This is the complete test suite for your Roman numeral conversion functions, which are yet to be written but will eventually be in roman.py. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others. There are good reasons for this, as you'll see shortly.
If you have not already done so, you can download this and other examples used in this book.
"""Unit test for roman.py""" import roman import unittest class KnownValues(unittest.TestCase): knownValues = ( (1, 'I'), (2, 'II'), (3, 'III'), (4, 'IV'), (5, 'V'), (6, 'VI'), (7, 'VII'), (8, 'VIII'), (9, 'IX'), (10, 'X'), (50, 'L'), (100, 'C'), (500, 'D'), (1000, 'M'), (31, 'XXXI'), (148, 'CXLVIII'), (294, 'CCXCIV'), (312, 'CCCXII'), (421, 'CDXXI'), (528, 'DXXVIII'), (621, 'DCXXI'), (782, 'DCCLXXXII'), (870, 'DCCCLXX'), (941, 'CMXLI'), (1043, 'MXLIII'), (1110, 'MCX'), (1226, 'MCCXXVI'), (1301, 'MCCCI'), (1485, 'MCDLXXXV'), (1509, 'MDIX'), (1607, 'MDCVII'), (1754, 'MDCCLIV'), (1832, 'MDCCCXXXII'), (1993, 'MCMXCIII'), (2074, 'MMLXXIV'), (2152, 'MMCLII'), (2212, 'MMCCXII'), (2343, 'MMCCCXLIII'), (2499, 'MMCDXCIX'), (2574, 'MMDLXXIV'), (2646, 'MMDCXLVI'), (2723, 'MMDCCXXIII'), (2892, 'MMDCCCXCII'), (2975, 'MMCMLXXV'), (3051, 'MMMLI'), (3185, 'MMMCLXXXV'), (3250, 'MMMCCL'), (3313, 'MMMCCCXIII'), (3408, 'MMMCDVIII'), (3501, 'MMMDI'), (3610, 'MMMDCX'), (3743, 'MMMDCCXLIII'), (3844, 'MMMDCCCXLIV'), (3888, 'MMMDCCCLXXXVIII'), (3940, 'MMMCMXL'), (3999, 'MMMCMXCIX')) def testToRomanKnownValues(self): """toRoman should give known result with known input""" for integer, numeral in self.knownValues: result = roman.toRoman(integer) self.assertEqual(numeral, result) def testFromRomanKnownValues(self): """fromRoman should give known result with known input""" for integer, numeral in self.knownValues: result = roman.fromRoman(numeral) self.assertEqual(integer, result) class ToRomanBadInput(unittest.TestCase): def testTooLarge(self): """toRoman should fail with large input""" self.assertRaises(roman.OutOfRangeError, roman.toRoman, 4000) def testZero(self): """toRoman should fail with 0 input""" self.assertRaises(roman.OutOfRangeError, roman.toRoman, 0) def testNegative(self): """toRoman should fail with negative input""" self.assertRaises(roman.OutOfRangeError, roman.toRoman, -1) def testNonInteger(self): """toRoman should fail with non-integer input""" self.assertRaises(roman.NotIntegerError, roman.toRoman, 0.5) class FromRomanBadInput(unittest.TestCase): def testTooManyRepeatedNumerals(self): """fromRoman should fail with too many repeated numerals""" for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'): self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s) def testRepeatedPairs(self): """fromRoman should fail with repeated pairs of numerals""" for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'): self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s) def testMalformedAntecedent(self): """fromRoman should fail with malformed antecedents""" for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV', 'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'): self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s) class SanityCheck(unittest.TestCase): def testSanity(self): """fromRoman(toRoman(n))==n for all n""" for integer in range(1, 4000): numeral = roman.toRoman(integer) result = roman.fromRoman(numeral) self.assertEqual(integer, result) class CaseCheck(unittest.TestCase): def testToRomanCase(self): """toRoman should always return uppercase""" for integer in range(1, 4000): numeral = roman.toRoman(integer) self.assertEqual(numeral, numeral.upper()) def testFromRomanCase(self): """fromRoman should only accept uppercase input""" for integer in range(1, 4000): numeral = roman.toRoman(integer) roman.fromRoman(numeral.upper()) self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, numeral.lower()) if __name__ == "__main__": unittest.main()
The most fundamental part of unit testing is constructing individual test cases. A test case answers a single question about the code it is testing.
A test case should be able to...
Given that, let's build the first test case. You have the following requirement:
class KnownValues(unittest.TestCase):knownValues = ( (1, 'I'), (2, 'II'), (3, 'III'), (4, 'IV'), (5, 'V'), (6, 'VI'), (7, 'VII'), (8, 'VIII'), (9, 'IX'), (10, 'X'), (50, 'L'), (100, 'C'), (500, 'D'), (1000, 'M'), (31, 'XXXI'), (148, 'CXLVIII'), (294, 'CCXCIV'), (312, 'CCCXII'), (421, 'CDXXI'), (528, 'DXXVIII'), (621, 'DCXXI'), (782, 'DCCLXXXII'), (870, 'DCCCLXX'), (941, 'CMXLI'), (1043, 'MXLIII'), (1110, 'MCX'), (1226, 'MCCXXVI'), (1301, 'MCCCI'), (1485, 'MCDLXXXV'), (1509, 'MDIX'), (1607, 'MDCVII'), (1754, 'MDCCLIV'), (1832, 'MDCCCXXXII'), (1993, 'MCMXCIII'), (2074, 'MMLXXIV'), (2152, 'MMCLII'), (2212, 'MMCCXII'), (2343, 'MMCCCXLIII'), (2499, 'MMCDXCIX'), (2574, 'MMDLXXIV'), (2646, 'MMDCXLVI'), (2723, 'MMDCCXXIII'), (2892, 'MMDCCCXCII'), (2975, 'MMCMLXXV'), (3051, 'MMMLI'), (3185, 'MMMCLXXXV'), (3250, 'MMMCCL'), (3313, 'MMMCCCXIII'), (3408, 'MMMCDVIII'), (3501, 'MMMDI'), (3610, 'MMMDCX'), (3743, 'MMMDCCXLIII'), (3844, 'MMMDCCCXLIV'), (3888, 'MMMDCCCLXXXVIII'), (3940, 'MMMCMXL'), (3999, 'MMMCMXCIX'))
def testToRomanKnownValues(self):
"""toRoman should give known result with known input""" for integer, numeral in self.knownValues: result = roman.toRoman(integer)
![]()
self.assertEqual(numeral, result)
It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.
Remember the other requirements for toRoman:
In Python, functions indicate failure by raising exceptions, and the unittest module provides methods for testing whether a function raises a particular exception when given bad input.
class ToRomanBadInput(unittest.TestCase): def testTooLarge(self): """toRoman should fail with large input""" self.assertRaises(roman.OutOfRangeError, roman.toRoman, 4000)def testZero(self): """toRoman should fail with 0 input""" self.assertRaises(roman.OutOfRangeError, roman.toRoman, 0)
def testNegative(self): """toRoman should fail with negative input""" self.assertRaises(roman.OutOfRangeError, roman.toRoman, -1) def testNonInteger(self): """toRoman should fail with non-integer input""" self.assertRaises(roman.NotIntegerError, roman.toRoman, 0.5)
![]() | The TestCase class of the unittest provides the assertRaises method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments you're passing that function. (If the function you're testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you're testing.) Pay close attention to what you're doing here: instead of calling toRoman directly and manually checking that it raises a particular exception (by wrapping it in a try...except block), assertRaises has encapsulated all of that for us. All you do is give it the exception (roman.OutOfRangeError), the function (toRoman), and toRoman's arguments (4000), and assertRaises takes care of calling toRoman and checking to make sure that it raises roman.OutOfRangeError. (Also note that you're passing the toRoman function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned recently how handy it is that everything in Python is an object, including functions and exceptions?) |
![]() | Along with testing numbers that are too large, you need to test numbers that are too small. Remember, Roman numerals cannot express 0 or negative numbers, so you have a test case for each of those (testZero and testNegative). In testZero, you are testing that toRoman raises a roman.OutOfRangeError exception when called with 0; if it does not raise a roman.OutOfRangeError (either because it returns an actual value, or because it raises some other exception), this test is considered failed. |
![]() | Requirement #3 specifies that toRoman cannot accept a non-integer number, so here you test to make sure that toRoman raises a roman.NotIntegerError exception when called with 0.5. If toRoman does not raise a roman.NotIntegerError, this test is considered failed. |
The next two requirements are similar to the first three, except they apply to fromRoman instead of toRoman:
Requirement #4 is handled in the same way as requirement #1, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements #2 and #3, by testing a series of bad inputs and making sure fromRoman raises the appropriate exception.
class FromRomanBadInput(unittest.TestCase): def testTooManyRepeatedNumerals(self): """fromRoman should fail with too many repeated numerals""" for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'): self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)def testRepeatedPairs(self): """fromRoman should fail with repeated pairs of numerals""" for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'): self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s) def testMalformedAntecedent(self): """fromRoman should fail with malformed antecedents""" for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV', 'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'): self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
Often, you will find that a unit of code contains a set of reciprocal functions, usually in the form of conversion functions where one converts A to B and the other converts B to A. In these cases, it is useful to create a “sanity check” to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering any other sort of bug.
Consider this requirement:
class SanityCheck(unittest.TestCase): def testSanity(self): """fromRoman(toRoman(n))==n for all n""" for integer in range(1, 4000):![]()
numeral = roman.toRoman(integer) result = roman.fromRoman(numeral) self.assertEqual(integer, result)
![]() | You've seen the range function before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000). Thus, 1..3999, which is the valid range for converting to Roman numerals. |
![]() | I just wanted to mention in passing that integer is not a keyword in Python; here it's just a variable name like any other. |
![]() | The actual testing logic here is straightforward: take a number (integer), convert it to a Roman numeral (numeral), then convert it back to a number (result) and make sure you end up with the same number you started with. If not, assertEqual will raise an exception and the test will immediately be considered failed. If all the numbers match, assertEqual will always return silently, the entire testSanity method will eventually return silently, and the test will be considered passed. |
The last two requirements are different from the others because they seem both arbitrary and trivial:
In fact, they are somewhat arbitrary. You could, for instance, have stipulated that fromRoman accept lowercase and mixed case input. But they are not completely arbitrary; if toRoman is always returning uppercase output, then fromRoman must at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying the behavior up front. And if it's worth specifying, it's worth testing.
class CaseCheck(unittest.TestCase): def testToRomanCase(self): """toRoman should always return uppercase""" for integer in range(1, 4000): numeral = roman.toRoman(integer) self.assertEqual(numeral, numeral.upper())def testFromRomanCase(self): """fromRoman should only accept uppercase input""" for integer in range(1, 4000): numeral = roman.toRoman(integer) roman.fromRoman(numeral.upper())
![]()
self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, numeral.lower())
![]() | The most interesting thing about this test case is all the things it doesn't test. It doesn't test that the value returned from toRoman is right or even consistent; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might be tempted to combine this with the sanity check, since both run through the entire range of values and call toRoman.[6] But that would violate one of the fundamental rules: each test case should answer only a single question. Imagine that you combined this case check with the sanity check, and then that test case failed. You would need to do further analysis to figure out which part of the test case failed to determine what the problem was. If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure sign that you've mis-designed your test cases. |
![]() | There's a similar lesson to be learned here: even though “you know” that toRoman always returns uppercase, you are explicitly converting its return value to uppercase here to test that fromRoman accepts uppercase input. Why? Because the fact that toRoman always returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always returned lowercase, the testToRomanCase test case would need to change, but this test case would still work. This was another of the fundamental rules: each test case must be able to work in isolation from any of the others. Every test case is an island. |
![]() | Note that you're not assigning the return value of fromRoman to anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return value; it just tests that fromRoman accepts the uppercase input without raising an exception. |
![]() | This is a complicated line, but it's very similar to what you did in the ToRomanBadInput and FromRomanBadInput tests. You are testing to make sure that calling a particular function (roman.fromRoman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different exception, or returning a value without raising an exception at all), the test fails. |
In the next chapter, you'll see how to write code that passes these tests.
Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps in roman.py.
This file is available in py/roman/stage1/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
"""Convert to and from Roman numerals""" #Define exceptions class RomanError(Exception): passclass OutOfRangeError(RomanError): pass
class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass
def toRoman(n): """convert integer to Roman numeral""" pass
def fromRoman(s): """convert Roman numeral to integer""" pass
![]() | This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not required) that you subclass Exception, which is the base class that all built-in exceptions inherit from. Here I am defining RomanError (inherited from Exception) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily have inherited each individual exception from the Exception class directly. |
![]() | The OutOfRangeError and NotIntegerError exceptions will eventually be used by toRoman to flag various forms of invalid input, as specified in ToRomanBadInput. |
![]() | The InvalidRomanNumeralError exception will eventually be used by fromRoman to flag invalid input, as specified in FromRomanBadInput. |
![]() | At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word pass. |
Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to romantest.py and re-evaluate why you coded a test so useless that it passes with do-nothing functions.
Run romantest1.py with the -v command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:
fromRoman should only accept uppercase input ... ERROR toRoman should always return uppercase ... ERROR fromRoman should fail with malformed antecedents ... FAIL fromRoman should fail with repeated pairs of numerals ... FAIL fromRoman should fail with too many repeated numerals ... FAIL fromRoman should give known result with known input ... FAIL toRoman should give known result with known input ... FAIL fromRoman(toRoman(n))==n for all n ... FAIL toRoman should fail with non-integer input ... FAIL toRoman should fail with negative input ... FAIL toRoman should fail with large input ... FAIL toRoman should fail with 0 input ... FAIL ====================================================================== ERROR: fromRoman should only accept uppercase input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase roman1.fromRoman(numeral.upper()) AttributeError: 'None' object has no attribute 'upper' ====================================================================== ERROR: toRoman should always return uppercase ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase self.assertEqual(numeral, numeral.upper()) AttributeError: 'None' object has no attribute 'upper' ====================================================================== FAIL: fromRoman should fail with malformed antecedents ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with repeated pairs of numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with too many repeated numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should give known result with known input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues self.assertEqual(integer, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: 1 != None ====================================================================== FAIL: toRoman should give known result with known input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues self.assertEqual(numeral, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: I != None ====================================================================== FAIL: fromRoman(toRoman(n))==n for all n ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity self.assertEqual(integer, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: 1 != None ====================================================================== FAIL: toRoman should fail with non-integer input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger self.assertRaises(roman1.NotIntegerError, roman1.toRoman, 0.5) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: NotIntegerError ====================================================================== FAIL: toRoman should fail with negative input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, -1) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: OutOfRangeError ====================================================================== FAIL: toRoman should fail with large input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, 4000) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: OutOfRangeError ====================================================================== FAIL: toRoman should fail with 0 input---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, 0) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: OutOfRangeError
---------------------------------------------------------------------- Ran 12 tests in 0.040s
FAILED (failures=10, errors=2)
Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases.
This file is available in py/roman/stage2/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
"""Convert to and from Roman numerals""" #Define exceptions class RomanError(Exception): pass class OutOfRangeError(RomanError): pass class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass #Define digit mapping romanNumeralMap = (('M', 1000),('CM', 900), ('D', 500), ('CD', 400), ('C', 100), ('XC', 90), ('L', 50), ('XL', 40), ('X', 10), ('IX', 9), ('V', 5), ('IV', 4), ('I', 1)) def toRoman(n): """convert integer to Roman numeral""" result = "" for numeral, integer in romanNumeralMap: while n >= integer:
result += numeral n -= integer return result def fromRoman(s): """convert Roman numeral to integer""" pass
If you're not clear how toRoman works, add a print statement to the end of the while loop:
while n >= integer: result += numeral n -= integer print 'subtracting', integer, 'from input, adding', numeral, 'to output'
>>> import roman2 >>> roman2.toRoman(1424) subtracting 1000 from input, adding M to output subtracting 400 from input, adding CD to output subtracting 10 from input, adding X to output subtracting 10 from input, adding X to output subtracting 4 from input, adding IV to output 'MCDXXIV'
So toRoman appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.
Remember to run romantest2.py with the -v command-line flag to enable verbose mode.
fromRoman should only accept uppercase input ... FAIL toRoman should always return uppercase ... okfromRoman should fail with malformed antecedents ... FAIL fromRoman should fail with repeated pairs of numerals ... FAIL fromRoman should fail with too many repeated numerals ... FAIL fromRoman should give known result with known input ... FAIL toRoman should give known result with known input ... ok
fromRoman(toRoman(n))==n for all n ... FAIL toRoman should fail with non-integer input ... FAIL
toRoman should fail with negative input ... FAIL toRoman should fail with large input ... FAIL toRoman should fail with 0 input ... FAIL
![]() | toRoman does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already. |
![]() | Here's the big news: this version of the toRoman function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it. |
![]() | However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to be raised (via assertRaises), and you're never raising them. You'll do that in the next stage. |
Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.
====================================================================== FAIL: fromRoman should only accept uppercase input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase roman2.fromRoman, numeral.lower()) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with malformed antecedents ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with repeated pairs of numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with too many repeated numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should give known result with known input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues self.assertEqual(integer, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: 1 != None ====================================================================== FAIL: fromRoman(toRoman(n))==n for all n ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity self.assertEqual(integer, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: 1 != None ====================================================================== FAIL: toRoman should fail with non-integer input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger self.assertRaises(roman2.NotIntegerError, roman2.toRoman, 0.5) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: NotIntegerError ====================================================================== FAIL: toRoman should fail with negative input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, -1) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: OutOfRangeError ====================================================================== FAIL: toRoman should fail with large input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, 4000) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: OutOfRangeError ====================================================================== FAIL: toRoman should fail with 0 input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, 0) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: OutOfRangeError ---------------------------------------------------------------------- Ran 12 tests in 0.320s FAILED (failures=10)
Now that toRoman behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else).
This file is available in py/roman/stage3/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
"""Convert to and from Roman numerals""" #Define exceptions class RomanError(Exception): pass class OutOfRangeError(RomanError): pass class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass #Define digit mapping romanNumeralMap = (('M', 1000), ('CM', 900), ('D', 500), ('CD', 400), ('C', 100), ('XC', 90), ('L', 50), ('XL', 40), ('X', 10), ('IX', 9), ('V', 5), ('IV', 4), ('I', 1)) def toRoman(n): """convert integer to Roman numeral""" if not (0 < n < 4000):raise OutOfRangeError, "number out of range (must be 1..3999)"
if int(n) <> n:
raise NotIntegerError, "non-integers can not be converted" result = ""
for numeral, integer in romanNumeralMap: while n >= integer: result += numeral n -= integer return result def fromRoman(s): """convert Roman numeral to integer""" pass
>>> import roman3 >>> roman3.toRoman(4000) Traceback (most recent call last): File "<interactive input>", line 1, in ? File "roman3.py", line 27, in toRoman raise OutOfRangeError, "number out of range (must be 1..3999)" OutOfRangeError: number out of range (must be 1..3999) >>> roman3.toRoman(1.5) Traceback (most recent call last): File "<interactive input>", line 1, in ? File "roman3.py", line 29, in toRoman raise NotIntegerError, "non-integers can not be converted" NotIntegerError: non-integers can not be converted
fromRoman should only accept uppercase input ... FAIL toRoman should always return uppercase ... ok fromRoman should fail with malformed antecedents ... FAIL fromRoman should fail with repeated pairs of numerals ... FAIL fromRoman should fail with too many repeated numerals ... FAIL fromRoman should give known result with known input ... FAIL toRoman should give known result with known input ... okfromRoman(toRoman(n))==n for all n ... FAIL toRoman should fail with non-integer input ... ok
toRoman should fail with negative input ... ok
toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok
![]() | toRoman still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything. |
![]() | More exciting is the fact that all of the bad input tests now pass. This test, testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to toRoman, the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for. |
![]() | This test, testNegative, passes because of the not (0 < n < 4000) check, which raises an OutOfRangeError exception, which is what testNegative is looking for. |
====================================================================== FAIL: fromRoman should only accept uppercase input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase roman3.fromRoman, numeral.lower()) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with malformed antecedents ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with repeated pairs of numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with too many repeated numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should give known result with known input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues self.assertEqual(integer, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: 1 != None ====================================================================== FAIL: fromRoman(toRoman(n))==n for all n ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity self.assertEqual(integer, result) File "c:\python21\lib\unittest.py", line 273, in failUnlessEqual raise self.failureException, (msg or '%s != %s' % (first, second)) AssertionError: 1 != None ---------------------------------------------------------------------- Ran 12 tests in 0.401s FAILED (failures=6)
![]() |
|
The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module. |
Now that toRoman is done, it's time to start coding fromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than the toRoman function.
This file is available in py/roman/stage4/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
"""Convert to and from Roman numerals""" #Define exceptions class RomanError(Exception): pass class OutOfRangeError(RomanError): pass class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass #Define digit mapping romanNumeralMap = (('M', 1000), ('CM', 900), ('D', 500), ('CD', 400), ('C', 100), ('XC', 90), ('L', 50), ('XL', 40), ('X', 10), ('IX', 9), ('V', 5), ('IV', 4), ('I', 1)) # toRoman function omitted for clarity (it hasn't changed) def fromRoman(s): """convert Roman numeral to integer""" result = 0 index = 0 for numeral, integer in romanNumeralMap: while s[index:index+len(numeral)] == numeral:result += integer index += len(numeral) return result
![]() | The pattern here is the same as toRoman. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer values as often as possible, you match the “highest” Roman numeral character strings as often as possible. |
If you're not clear how fromRoman works, add a print statement to the end of the while loop:
while s[index:index+len(numeral)] == numeral: result += integer index += len(numeral) print 'found', numeral, 'of length', len(numeral), ', adding', integer
>>> import roman4 >>> roman4.fromRoman('MCMLXXII') found M , of length 1, adding 1000 found CM , of length 2, adding 900 found L , of length 1, adding 50 found X , of length 1, adding 10 found X , of length 1, adding 10 found I , of length 1, adding 1 found I , of length 1, adding 1 1972
fromRoman should only accept uppercase input ... FAIL toRoman should always return uppercase ... ok fromRoman should fail with malformed antecedents ... FAIL fromRoman should fail with repeated pairs of numerals ... FAIL fromRoman should fail with too many repeated numerals ... FAIL fromRoman should give known result with known input ... oktoRoman should give known result with known input ... ok fromRoman(toRoman(n))==n for all n ... ok
toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok
![]() | Two pieces of exciting news here. The first is that fromRoman works for good input, at least for all the known values you test. |
![]() | The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both toRoman and fromRoman work properly for all possible good values. (This is not guaranteed; it is theoretically possible that toRoman has a bug that produces the wrong Roman numeral for some particular set of inputs, and that fromRoman has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that toRoman generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write more comprehensive test cases until it doesn't bother you.) |
====================================================================== FAIL: fromRoman should only accept uppercase input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase roman4.fromRoman, numeral.lower()) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with malformed antecedents ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with repeated pairs of numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ====================================================================== FAIL: fromRoman should fail with too many repeated numerals ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s) File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ---------------------------------------------------------------------- Ran 12 tests in 1.222s FAILED (failures=4)
Now that fromRoman works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input. That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult than validating numeric input in toRoman, but you have a powerful tool at your disposal: regular expressions.
If you're not familiar with regular expressions and didn't read Chapter 7, Regular Expressions, now would be a good time.
As you saw in Section 7.3, “Case Study: Roman Numerals”, there are several simple rules for constructing a Roman numeral, using the letters M, D, C, L, X, V, and I. Let's review the rules:
This file is available in py/roman/stage5/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
"""Convert to and from Roman numerals""" import re #Define exceptions class RomanError(Exception): pass class OutOfRangeError(RomanError): pass class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass #Define digit mapping romanNumeralMap = (('M', 1000), ('CM', 900), ('D', 500), ('CD', 400), ('C', 100), ('XC', 90), ('L', 50), ('XL', 40), ('X', 10), ('IX', 9), ('V', 5), ('IV', 4), ('I', 1)) def toRoman(n): """convert integer to Roman numeral""" if not (0 < n < 4000): raise OutOfRangeError, "number out of range (must be 1..3999)" if int(n) <> n: raise NotIntegerError, "non-integers can not be converted" result = "" for numeral, integer in romanNumeralMap: while n >= integer: result += numeral n -= integer return result #Define pattern to detect valid Roman numerals romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'def fromRoman(s): """convert Roman numeral to integer""" if not re.search(romanNumeralPattern, s):
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s result = 0 index = 0 for numeral, integer in romanNumeralMap: while s[index:index+len(numeral)] == numeral: result += integer index += len(numeral) return result
![]() | This is just a continuation of the pattern you discussed in Section 7.3, “Case Study: Roman Numerals”. The tens places is either XC (90), XL (40), or an optional L followed by 0 to 3 optional X characters. The ones place is either IX (9), IV (4), or an optional V followed by 0 to 3 optional I characters. |
![]() | Having encoded all that logic into a regular expression, the code to check for invalid Roman numerals becomes trivial. If re.search returns an object, then the regular expression matched and the input is valid; otherwise, the input is invalid. |
At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of invalid Roman numerals. But don't take my word for it, look at the results:
fromRoman should only accept uppercase input ... oktoRoman should always return uppercase ... ok fromRoman should fail with malformed antecedents ... ok
fromRoman should fail with repeated pairs of numerals ... ok
fromRoman should fail with too many repeated numerals ... ok fromRoman should give known result with known input ... ok toRoman should give known result with known input ... ok fromRoman(toRoman(n))==n for all n ... ok toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok ---------------------------------------------------------------------- Ran 12 tests in 2.864s OK
![]() |
|
When all of your tests pass, stop coding. |
Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet.
>>> import roman5 >>> roman5.fromRoman("")0
![]() | Remember in the previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals? Well, it turns out that this is still true for the final version of the regular expression. And that's a bug; you want an empty string to raise an InvalidRomanNumeralError exception just like any other sequence of characters that don't represent a valid Roman numeral. |
After reproducing the bug, and before fixing it, you should write a test case that fails, thus illustrating the bug.
class FromRomanBadInput(unittest.TestCase): # previous test cases omitted for clarity (they haven't changed) def testBlank(self): """fromRoman should fail with blank string""" self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, "")![]()
Since your code has a bug, and you now have a test case that tests this bug, the test case will fail:
fromRoman should only accept uppercase input ... ok toRoman should always return uppercase ... ok fromRoman should fail with blank string ... FAIL fromRoman should fail with malformed antecedents ... ok fromRoman should fail with repeated pairs of numerals ... ok fromRoman should fail with too many repeated numerals ... ok fromRoman should give known result with known input ... ok toRoman should give known result with known input ... ok fromRoman(toRoman(n))==n for all n ... ok toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok ====================================================================== FAIL: fromRoman should fail with blank string ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage6\romantest61.py", line 137, in testBlank self.assertRaises(roman61.InvalidRomanNumeralError, roman61.fromRoman, "") File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises raise self.failureException, excName AssertionError: InvalidRomanNumeralError ---------------------------------------------------------------------- Ran 13 tests in 2.864s FAILED (failures=1)
Now you can fix the bug.
This file is available in py/roman/stage6/ in the examples directory.
def fromRoman(s): """convert Roman numeral to integer""" if not s:raise InvalidRomanNumeralError, 'Input can not be blank' if not re.search(romanNumeralPattern, s): raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s result = 0 index = 0 for numeral, integer in romanNumeralMap: while s[index:index+len(numeral)] == numeral: result += integer index += len(numeral) return result
fromRoman should only accept uppercase input ... ok toRoman should always return uppercase ... ok fromRoman should fail with blank string ... okfromRoman should fail with malformed antecedents ... ok fromRoman should fail with repeated pairs of numerals ... ok fromRoman should fail with too many repeated numerals ... ok fromRoman should give known result with known input ... ok toRoman should give known result with known input ... ok fromRoman(toRoman(n))==n for all n ... ok toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok ---------------------------------------------------------------------- Ran 13 tests in 2.834s OK
Coding this way does not make fixing bugs any easier. Simple bugs (like this one) require simple test cases; complex bugs will require complex test cases. In a testing-centric environment, it may seem like it takes longer to fix a bug, since you need to articulate in code exactly what the bug is (to write the test case), then fix the bug itself. Then if the test case doesn't pass right away, you need to figure out whether the fix was wrong, or whether the test case itself has a bug in it. However, in the long run, this back-and-forth between test code and code tested pays for itself, because it makes it more likely that bugs are fixed correctly the first time. Also, since you can easily re-run all the test cases along with your new one, you are much less likely to break old code when fixing new code. Today's unit test is tomorrow's regression test.
Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change.
Suppose, for instance, that you wanted to expand the range of the Roman numeral conversion functions. Remember the rule that said that no character could be repeated more than three times? Well, the Romans were willing to make an exception to that rule by having 4 M characters in a row to represent 4000. If you make this change, you'll be able to expand the range of convertible numbers from 1..3999 to 1..4999. But first, you need to make some changes to the test cases.
This file is available in py/roman/stage7/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
import roman71 import unittest class KnownValues(unittest.TestCase): knownValues = ( (1, 'I'), (2, 'II'), (3, 'III'), (4, 'IV'), (5, 'V'), (6, 'VI'), (7, 'VII'), (8, 'VIII'), (9, 'IX'), (10, 'X'), (50, 'L'), (100, 'C'), (500, 'D'), (1000, 'M'), (31, 'XXXI'), (148, 'CXLVIII'), (294, 'CCXCIV'), (312, 'CCCXII'), (421, 'CDXXI'), (528, 'DXXVIII'), (621, 'DCXXI'), (782, 'DCCLXXXII'), (870, 'DCCCLXX'), (941, 'CMXLI'), (1043, 'MXLIII'), (1110, 'MCX'), (1226, 'MCCXXVI'), (1301, 'MCCCI'), (1485, 'MCDLXXXV'), (1509, 'MDIX'), (1607, 'MDCVII'), (1754, 'MDCCLIV'), (1832, 'MDCCCXXXII'), (1993, 'MCMXCIII'), (2074, 'MMLXXIV'), (2152, 'MMCLII'), (2212, 'MMCCXII'), (2343, 'MMCCCXLIII'), (2499, 'MMCDXCIX'), (2574, 'MMDLXXIV'), (2646, 'MMDCXLVI'), (2723, 'MMDCCXXIII'), (2892, 'MMDCCCXCII'), (2975, 'MMCMLXXV'), (3051, 'MMMLI'), (3185, 'MMMCLXXXV'), (3250, 'MMMCCL'), (3313, 'MMMCCCXIII'), (3408, 'MMMCDVIII'), (3501, 'MMMDI'), (3610, 'MMMDCX'), (3743, 'MMMDCCXLIII'), (3844, 'MMMDCCCXLIV'), (3888, 'MMMDCCCLXXXVIII'), (3940, 'MMMCMXL'), (3999, 'MMMCMXCIX'), (4000, 'MMMM'),(4500, 'MMMMD'), (4888, 'MMMMDCCCLXXXVIII'), (4999, 'MMMMCMXCIX')) def testToRomanKnownValues(self): """toRoman should give known result with known input""" for integer, numeral in self.knownValues: result = roman71.toRoman(integer) self.assertEqual(numeral, result) def testFromRomanKnownValues(self): """fromRoman should give known result with known input""" for integer, numeral in self.knownValues: result = roman71.fromRoman(numeral) self.assertEqual(integer, result) class ToRomanBadInput(unittest.TestCase): def testTooLarge(self): """toRoman should fail with large input""" self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, 5000)
def testZero(self): """toRoman should fail with 0 input""" self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, 0) def testNegative(self): """toRoman should fail with negative input""" self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, -1) def testNonInteger(self): """toRoman should fail with non-integer input""" self.assertRaises(roman71.NotIntegerError, roman71.toRoman, 0.5) class FromRomanBadInput(unittest.TestCase): def testTooManyRepeatedNumerals(self): """fromRoman should fail with too many repeated numerals""" for s in ('MMMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'):
self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s) def testRepeatedPairs(self): """fromRoman should fail with repeated pairs of numerals""" for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'): self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s) def testMalformedAntecedent(self): """fromRoman should fail with malformed antecedents""" for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV', 'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'): self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s) def testBlank(self): """fromRoman should fail with blank string""" self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, "") class SanityCheck(unittest.TestCase): def testSanity(self): """fromRoman(toRoman(n))==n for all n""" for integer in range(1, 5000):
numeral = roman71.toRoman(integer) result = roman71.fromRoman(numeral) self.assertEqual(integer, result) class CaseCheck(unittest.TestCase): def testToRomanCase(self): """toRoman should always return uppercase""" for integer in range(1, 5000): numeral = roman71.toRoman(integer) self.assertEqual(numeral, numeral.upper()) def testFromRomanCase(self): """fromRoman should only accept uppercase input""" for integer in range(1, 5000): numeral = roman71.toRoman(integer) roman71.fromRoman(numeral.upper()) self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, numeral.lower()) if __name__ == "__main__": unittest.main()
Now your test cases are up to date with the new requirements, but your code is not, so you expect several of the test cases to fail.
fromRoman should only accept uppercase input ... ERRORtoRoman should always return uppercase ... ERROR fromRoman should fail with blank string ... ok fromRoman should fail with malformed antecedents ... ok fromRoman should fail with repeated pairs of numerals ... ok fromRoman should fail with too many repeated numerals ... ok fromRoman should give known result with known input ... ERROR
toRoman should give known result with known input ... ERROR
fromRoman(toRoman(n))==n for all n ... ERROR
toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok
====================================================================== ERROR: fromRoman should only accept uppercase input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 161, in testFromRomanCase numeral = roman71.toRoman(integer) File "roman71.py", line 28, in toRoman raise OutOfRangeError, "number out of range (must be 1..3999)" OutOfRangeError: number out of range (must be 1..3999) ====================================================================== ERROR: toRoman should always return uppercase ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 155, in testToRomanCase numeral = roman71.toRoman(integer) File "roman71.py", line 28, in toRoman raise OutOfRangeError, "number out of range (must be 1..3999)" OutOfRangeError: number out of range (must be 1..3999) ====================================================================== ERROR: fromRoman should give known result with known input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 102, in testFromRomanKnownValues result = roman71.fromRoman(numeral) File "roman71.py", line 47, in fromRoman raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s InvalidRomanNumeralError: Invalid Roman numeral: MMMM ====================================================================== ERROR: toRoman should give known result with known input ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 96, in testToRomanKnownValues result = roman71.toRoman(integer) File "roman71.py", line 28, in toRoman raise OutOfRangeError, "number out of range (must be 1..3999)" OutOfRangeError: number out of range (must be 1..3999) ====================================================================== ERROR: fromRoman(toRoman(n))==n for all n ---------------------------------------------------------------------- Traceback (most recent call last): File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 147, in testSanity numeral = roman71.toRoman(integer) File "roman71.py", line 28, in toRoman raise OutOfRangeError, "number out of range (must be 1..3999)" OutOfRangeError: number out of range (must be 1..3999) ---------------------------------------------------------------------- Ran 13 tests in 2.213s FAILED (errors=5)
Now that you have test cases that fail due to the new requirements, you can think about fixing the code to bring it in line with the test cases. (One thing that takes some getting used to when you first start coding unit tests is that the code being tested is never “ahead” of the test cases. While it's behind, you still have some work to do, and as soon as it catches up to the test cases, you stop coding.)
This file is available in py/roman/stage7/ in the examples directory.
"""Convert to and from Roman numerals""" import re #Define exceptions class RomanError(Exception): pass class OutOfRangeError(RomanError): pass class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass #Define digit mapping romanNumeralMap = (('M', 1000), ('CM', 900), ('D', 500), ('CD', 400), ('C', 100), ('XC', 90), ('L', 50), ('XL', 40), ('X', 10), ('IX', 9), ('V', 5), ('IV', 4), ('I', 1)) def toRoman(n): """convert integer to Roman numeral""" if not (0 < n < 5000):raise OutOfRangeError, "number out of range (must be 1..4999)" if int(n) <> n: raise NotIntegerError, "non-integers can not be converted" result = "" for numeral, integer in romanNumeralMap: while n >= integer: result += numeral n -= integer return result #Define pattern to detect valid Roman numerals romanNumeralPattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
def fromRoman(s): """convert Roman numeral to integer""" if not s: raise InvalidRomanNumeralError, 'Input can not be blank' if not re.search(romanNumeralPattern, s): raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s result = 0 index = 0 for numeral, integer in romanNumeralMap: while s[index:index+len(numeral)] == numeral: result += integer index += len(numeral) return result
You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself:
fromRoman should only accept uppercase input ... ok toRoman should always return uppercase ... ok fromRoman should fail with blank string ... ok fromRoman should fail with malformed antecedents ... ok fromRoman should fail with repeated pairs of numerals ... ok fromRoman should fail with too many repeated numerals ... ok fromRoman should give known result with known input ... ok toRoman should give known result with known input ... ok fromRoman(toRoman(n))==n for all n ... ok toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok ---------------------------------------------------------------------- Ran 13 tests in 3.685s OK
Comprehensive unit testing means never having to rely on a programmer who says “Trust me.”
The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any program.
Here, “better” means “faster”. Specifically, the fromRoman function is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals. It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not end up any faster), but you can speed up the function by precompiling the regular expression.
>>> import re >>> pattern = '^M?M?M?$' >>> re.search(pattern, 'M')<SRE_Match object at 01090490> >>> compiledPattern = re.compile(pattern)
>>> compiledPattern <SRE_Pattern object at 00F06E28> >>> dir(compiledPattern)
['findall', 'match', 'scanner', 'search', 'split', 'sub', 'subn'] >>> compiledPattern.search('M')
<SRE_Match object at 01104928>
![]() |
|
Whenever you are going to use a regular expression more than once, you should compile it to get a pattern object, then call the methods on the pattern object directly. |
This file is available in py/roman/stage8/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
# toRoman and rest of module omitted for clarity romanNumeralPattern = \ re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$')def fromRoman(s): """convert Roman numeral to integer""" if not s: raise InvalidRomanNumeralError, 'Input can not be blank' if not romanNumeralPattern.search(s):
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s result = 0 index = 0 for numeral, integer in romanNumeralMap: while s[index:index+len(numeral)] == numeral: result += integer index += len(numeral) return result
So how much faster is it to compile regular expressions? See for yourself:
.............---------------------------------------------------------------------- Ran 13 tests in 3.385s
OK
![]() | Just a note in passing here: this time, I ran the unit test without the -v option, so instead of the full doc string for each test, you only get a dot for each test that passes. (If a test failed, you'd get an F, and if it had an error, you'd get an E. You'd still get complete tracebacks for each failure and error, so you could track down any problems.) |
![]() | You ran 13 tests in 3.385 seconds, compared to 3.685 seconds without precompiling the regular expressions. That's an 8% improvement overall, and remember that most of the time spent during the unit test is spent doing other things. (Separately, I time-tested the regular expressions by themselves, apart from the rest of the unit tests, and found that compiling this regular expression speeds up the search by an average of 54%.) Not bad for such a simple fix. |
![]() | Oh, and in case you were wondering, precompiling the regular expression didn't break anything, and you just proved it. |
There is one other performance optimization that I want to try. Given the complexity of regular expression syntax, it should come as no surprise that there is frequently more than one way to write the same expression. After some discussion about this module on comp.lang.python, someone suggested that I try using the {m,n} syntax for the optional repeated characters.
This file is available in py/roman/stage8/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
# rest of program omitted for clarity #old version #romanNumeralPattern = \ # re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$') #new version romanNumeralPattern = \ re.compile('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$')![]()
This form of the regular expression is a little shorter (though not any more readable). The big question is, is it any faster?
............. ---------------------------------------------------------------------- Ran 13 tests in 3.315sOK
One other tweak I would like to make, and then I promise I'll stop refactoring and put this module to bed. As you've seen repeatedly, regular expressions can get pretty hairy and unreadable pretty quickly. I wouldn't like to come back to this module in six months and try to maintain it. Sure, the test cases pass, so I know that it works, but if I can't figure out how it works, it's still going to be difficult to add new features, fix new bugs, or otherwise maintain it. As you saw in Section 7.5, “Verbose Regular Expressions”, Python provides a way to document your logic line-by-line.
This file is available in py/roman/stage8/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
# rest of program omitted for clarity #old version #romanNumeralPattern = \ # re.compile('^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$') #new version romanNumeralPattern = re.compile(''' ^ # beginning of string M{0,4} # thousands - 0 to 4 M's (CM|CD|D?C{0,3}) # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's), # or 500-800 (D, followed by 0 to 3 C's) (XC|XL|L?X{0,3}) # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's), # or 50-80 (L, followed by 0 to 3 X's) (IX|IV|V?I{0,3}) # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's), # or 5-8 (V, followed by 0 to 3 I's) $ # end of string ''', re.VERBOSE)![]()
............. ---------------------------------------------------------------------- Ran 13 tests in 3.315sOK
A clever reader read the previous section and took it to the next level. The biggest headache (and performance drain) in the program as it is currently written is the regular expression, which is required because you have no other way of breaking down a Roman numeral. But there's only 5000 of them; why don't you just build a lookup table once, then simply read that? This idea gets even better when you realize that you don't need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers.
And best of all, he already had a complete set of unit tests. He changed over half the code in the module, but the unit tests stayed the same, so he could prove that his code worked just as well as the original.
This file is available in py/roman/stage9/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
#Define exceptions class RomanError(Exception): pass class OutOfRangeError(RomanError): pass class NotIntegerError(RomanError): pass class InvalidRomanNumeralError(RomanError): pass #Roman numerals must be less than 5000 MAX_ROMAN_NUMERAL = 4999 #Define digit mapping romanNumeralMap = (('M', 1000), ('CM', 900), ('D', 500), ('CD', 400), ('C', 100), ('XC', 90), ('L', 50), ('XL', 40), ('X', 10), ('IX', 9), ('V', 5), ('IV', 4), ('I', 1)) #Create tables for fast conversion of roman numerals. #See fillLookupTables() below. toRomanTable = [ None ] # Skip an index since Roman numerals have no zero fromRomanTable = {} def toRoman(n): """convert integer to Roman numeral""" if not (0 < n <= MAX_ROMAN_NUMERAL): raise OutOfRangeError, "number out of range (must be 1..%s)" % MAX_ROMAN_NUMERAL if int(n) <> n: raise NotIntegerError, "non-integers can not be converted" return toRomanTable[n] def fromRoman(s): """convert Roman numeral to integer""" if not s: raise InvalidRomanNumeralError, "Input can not be blank" if not fromRomanTable.has_key(s): raise InvalidRomanNumeralError, "Invalid Roman numeral: %s" % s return fromRomanTable[s] def toRomanDynamic(n): """convert integer to Roman numeral using dynamic programming""" result = "" for numeral, integer in romanNumeralMap: if n >= integer: result = numeral n -= integer break if n > 0: result += toRomanTable[n] return result def fillLookupTables(): """compute all the possible roman numerals""" #Save the values in two global tables to convert to and from integers. for integer in range(1, MAX_ROMAN_NUMERAL + 1): romanNumber = toRomanDynamic(integer) toRomanTable.append(romanNumber) fromRomanTable[romanNumber] = integer fillLookupTables()
So how fast is it?
.............
----------------------------------------------------------------------
Ran 13 tests in 0.791s
OK
Remember, the best performance you ever got in the original version was 13 tests in 3.315 seconds. Of course, it's not entirely a fair comparison, because this version will take longer to import (when it fills the lookup tables). But since import is only done once, this is negligible in the long run.
The moral of the story?
Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it work, you'll wonder how you ever got along without it.
This chapter covered a lot of ground, and much of it wasn't even Python-specific. There are unit testing frameworks for many languages, all of which require you to understand the same basic concepts:
Additionally, you should be comfortable doing all of the following Python-specific things:
In Chapter 13, Unit Testing, you learned about the philosophy of unit testing. In Chapter 14, Test-First Programming, you stepped through the implementation of basic unit tests in Python. In Chapter 15, Refactoring, you saw how unit testing makes large-scale refactoring easier. This chapter will build on those sample programs, but here we will focus more on advanced Python-specific techniques, rather than on unit testing itself.
The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the build process for this book; I have unit tests for several of the example programs (not just the roman.py module featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this regression test fails, the build immediately stops. I don't want to release non-working examples any more than you want to download them and sit around scratching your head and yelling at your monitor and wondering why they don't work.
If you have not already done so, you can download this and other examples used in this book.
"""Regression testing framework This module will search for scripts in the same directory named XYZtest.py. Each such script should be a test suite that tests a module through PyUnit. (As of Python 2.1, PyUnit is included in the standard library as "unittest".) This script will aggregate all found test suites into one big test suite and run them all at once. """ import sys, os, re, unittest def regressionTest(): path = os.path.abspath(os.path.dirname(sys.argv[0])) files = os.listdir(path) test = re.compile("test\.py$", re.IGNORECASE) files = filter(test.search, files) filenameToModuleName = lambda f: os.path.splitext(f)[0] moduleNames = map(filenameToModuleName, files) modules = map(__import__, moduleNames) load = unittest.defaultTestLoader.loadTestsFromModule return unittest.TestSuite(map(load, modules)) if __name__ == "__main__": unittest.main(defaultTest="regressionTest")
Running this script in the same directory as the rest of the example scripts that come with this book will find all the unit tests, named moduletest.py, run them as a single test, and pass or fail them all at once.
[you@localhost py]$ python regression.py -v help should fail with no object ... okhelp should return known result for apihelper ... ok help should honor collapse argument ... ok help should honor spacing argument ... ok buildConnectionString should fail with list input ... ok
buildConnectionString should fail with string input ... ok buildConnectionString should fail with tuple input ... ok buildConnectionString handles empty dictionary ... ok buildConnectionString returns known result with known input ... ok fromRoman should only accept uppercase input ... ok
toRoman should always return uppercase ... ok fromRoman should fail with blank string ... ok fromRoman should fail with malformed antecedents ... ok fromRoman should fail with repeated pairs of numerals ... ok fromRoman should fail with too many repeated numerals ... ok fromRoman should give known result with known input ... ok toRoman should give known result with known input ... ok fromRoman(toRoman(n))==n for all n ... ok toRoman should fail with non-integer input ... ok toRoman should fail with negative input ... ok toRoman should fail with large input ... ok toRoman should fail with 0 input ... ok kgp a ref test ... ok kgp b ref test ... ok kgp c ref test ... ok kgp d ref test ... ok kgp e ref test ... ok kgp f ref test ... ok kgp g ref test ... ok ---------------------------------------------------------------------- Ran 29 tests in 2.799s OK
![]() | The first 5 tests are from apihelpertest.py, which tests the example script from Chapter 4, The Power Of Introspection. |
![]() | The next 5 tests are from odbchelpertest.py, which tests the example script from Chapter 2, Your First Python Program. |
![]() | The rest are from romantest.py, which you studied in depth in Chapter 13, Unit Testing. |
When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.
This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember once you see it. The key to it is sys.argv. As you saw in Chapter 9, XML Processing, this is a list that holds the list of command-line arguments. However, it also holds the name of the running script, exactly as it was called from the command line, and this is enough information to determine its location.
If you have not already done so, you can download this and other examples used in this book.
import sys, os print 'sys.argv[0] =', sys.argv[0]pathname = os.path.dirname(sys.argv[0])
print 'path =', pathname print 'full path =', os.path.abspath(pathname)
os.path.abspath deserves further explanation. It is very flexible; it can take any kind of pathname.
>>> import os >>> os.getcwd()/home/you >>> os.path.abspath('')
/home/you >>> os.path.abspath('.ssh')
/home/you/.ssh >>> os.path.abspath('/home/you/.ssh')
/home/you/.ssh >>> os.path.abspath('.ssh/../foo/')
/home/you/foo
![]() |
|
The pathnames and filenames you pass to os.path.abspath do not need to exist. |
![]() |
|
os.path.abspath not only constructs full path names, it also normalizes them. That means that if you are in the /usr/ directory, os.path.abspath('bin/../local/bin') will return /usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without turning it into a full pathname, use os.path.normpath instead. |
[you@localhost py]$ python /home/you/diveintopython/common/py/fullpath.pysys.argv[0] = /home/you/diveintopython/common/py/fullpath.py path = /home/you/diveintopython/common/py full path = /home/you/diveintopython/common/py [you@localhost diveintopython]$ python common/py/fullpath.py
sys.argv[0] = common/py/fullpath.py path = common/py full path = /home/you/diveintopython/common/py [you@localhost diveintopython]$ cd common/py [you@localhost py]$ python fullpath.py
sys.argv[0] = fullpath.py path = full path = /home/you/diveintopython/common/py
![]() |
|
Like the other functions in the os and os.path modules, os.path.abspath is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the os module. |
Addendum. One reader was dissatisfied with this solution, and wanted to be able to run all the unit tests in the current directory, not the directory where regression.py is located. He suggests this approach instead:
import sys, os, re, unittest def regressionTest(): path = os.getcwd()sys.path.append(path)
files = os.listdir(path)
![]()
This technique will allow you to re-use this regression.py script on multiple projects. Just put the script in a common directory, then change to the project's directory before running it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where regression.py is located.
You're already familiar with using list comprehensions to filter lists. There is another way to accomplish this same thing, which some people feel is more expressive.
Python has a built-in filter function which takes two arguments, a function and a list, and returns a list.[7] The function passed as the first argument to filter must itself take one argument, and the list that filter returns will contain all the elements from the list passed to filter for which the function passed to filter returns true.
Got all that? It's not as difficult as it sounds.
>>> def odd(n):... return n % 2 ... >>> li = [1, 2, 3, 5, 9, 10, 256, -3] >>> filter(odd, li)
[1, 3, 5, 9, -3] >>> [e for e in li if odd(e)]
>>> filteredList = [] >>> for n in li:
... if odd(n): ... filteredList.append(n) ... >>> filteredList [1, 3, 5, 9, -3]
![]() | odd uses the built-in mod function “%” to return True if n is odd and False if n is even. |
![]() | filter takes two arguments, a function (odd) and a list (li). It loops through the list and calls odd with each element. If odd returns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out. The result is a list of only the odd numbers from the original list, in the same order as they appeared in the original. |
![]() | You could accomplish the same thing using list comprehensions, as you saw in Section 4.5, “Filtering Lists”. |
![]() | You could also accomplish the same thing with a for loop. Depending on your programming background, this may seem more “straightforward”, but functions like filter are much more expressive. Not only is it easier to write, it's easier to read, too. Reading the for loop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step back and see the bigger picture: “Oh, you're just filtering the list!” |
files = os.listdir(path)test = re.compile("test\.py$", re.IGNORECASE)
files = filter(test.search, files)
![]() | As you saw in Section 16.2, “Finding the path”, path may contain the full or partial pathname of the directory of the currently running script, or it may contain an empty string if the script is being run from the current directory. Either way, files will end up with the names of the files in the same directory as this script you're running. |
![]() | This is a compiled regular expression. As you saw in Section 15.3, “Refactoring”, if you're going to use the same regular expression over and over, you should compile it for faster performance. The compiled object has a search method which takes a single argument, the string to search. If the regular expression matches the string, the search method returns a Match object containing information about the regular expression match; otherwise it returns None, the Python null value. |
![]() | For each element in the files list, you're going to call the search method of the compiled regular expression object, test. If the regular expression matches, the method will return a Match object, which Python considers to be true, so the element will be included in the list returned by filter. If the regular expression does not match, the search method will return None, which Python considers to be false, so the element will not be included. |
Historical note. Versions of Python prior to 2.0 did not have list comprehensions, so you couldn't filter using list comprehensions; the filter function was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the old-style filter (and its companion function, map, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style. There is discussion that map and filter might be deprecated in a future version of Python, but no decision has been made.
You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in map function. It works much the same way as the filter function.
>>> def double(n): ... return n*2 ... >>> li = [1, 2, 3, 5, 9, 10, 256, -3] >>> map(double, li)[2, 4, 6, 10, 18, 20, 512, -6] >>> [double(n) for n in li]
[2, 4, 6, 10, 18, 20, 512, -6] >>> newlist = [] >>> for n in li:
... newlist.append(double(n)) ... >>> newlist [2, 4, 6, 10, 18, 20, 512, -6]
![]() | map takes a function and a list[8] and returns a new list by calling the function with each element of the list in order. In this case, the function simply multiplies each element by 2. |
![]() | You could accomplish the same thing with a list comprehension. List comprehensions were first introduced in Python 2.0; map has been around forever. |
![]() | You could, if you insist on thinking like a Visual Basic programmer, use a for loop to accomplish the same thing. |
>>> li = [5, 'a', (2, 'b')] >>> map(double, li)[10, 'aa', (2, 'b', 2, 'b')]
All right, enough play time. Let's look at some real code.
filenameToModuleName = lambda f: os.path.splitext(f)[0]moduleNames = map(filenameToModuleName, files)
![]() | As you saw in Section 4.7, “Using lambda Functions”, lambda defines an inline function. And as you saw in Example 6.17, “Splitting Pathnames”, os.path.splitext takes a filename and returns a tuple (name, extension). So filenameToModuleName is a function which will take a filename and strip off the file extension, and return just the name. |
![]() | Calling map takes each filename listed in files, passes it to the function filenameToModuleName, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off of each filename, and store the list of all those stripped filenames in moduleNames. |
As you'll see in the rest of the chapter, you can extend this type of data-centric thinking all the way to the final goal, which is to define and execute a single test suite that contains the tests from all of those individual test suites.
By now you're probably scratching your head wondering why this is better than using for loops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Using map and filter forces you to center your thinking around your data.
In this case, you started with no data at all; the first thing you did was get the directory path of the current script, and got a list of files in that directory. That was the bootstrap, and it gave you real data to work with: a list of filenames.
However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had too much data, so you needed to filter it. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to the filter function. In this case you used a regular expression to decide, but the concept would be the same regardless of how you constructed the test.
Now you had the filenames of each of the test suites (and only the test suites, since everything else had been filtered out), but you really wanted module names instead. You had the right amount of data, but it was in the wrong format. So you defined a function that would transform a single filename into a module name, and you mapped that function onto the entire list. From one filename, you can get a module name; from a list of filenames, you can get a list of module names.
Instead of filter, you could have used a for loop with an if statement. Instead of map, you could have used a for loop with a function call. But using for loops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need to figure out how to test for the condition “is this file a test suite?” anyway; that's the application-specific logic, and no language can write that for us. But once you've figured that out, do you really want go to all the trouble of defining a new empty list and writing a for loop and an if statement and manually calling append to add each element to the new list if it passes the condition and then keeping track of which variable holds the new filtered data and which one holds the old unfiltered data? Why not just define the test condition, then let Python do the rest of that work for us?
Oh sure, you could try to be fancy and delete elements in place without creating a new list. But you've been burned by that before. Trying to modify a data structure that you're looping through can be tricky. You delete an element, then loop to the next element, and suddenly you've skipped one. Is Python one of the languages that works that way? How long would it take you to figure it out? Would you remember for certain whether it was safe the next time you tried? Programmers spend so much time and make so many mistakes dealing with purely technical issues like this, and it's all pointless. It doesn't advance your program at all; it's just busywork.
I resisted list comprehensions when I first learned Python, and I resisted filter and map even longer. I insisted on making my life more difficult, sticking to the familiar way of for loops and if statements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems and obscure bugs. And it was all pointless.
Let it all go. Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind.
OK, enough philosophizing. Let's talk about dynamically importing modules.
First, let's look at how you normally import modules. The import module syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once this way, with a comma-separated list. You did this on the very first line of this chapter's script.
Now let's do the same thing, but with dynamic imports.
>>> sys = __import__('sys')>>> os = __import__('os') >>> re = __import__('re') >>> unittest = __import__('unittest') >>> sys
>>> <module 'sys' (built-in)> >>> os >>> <module 'os' from '/usr/local/lib/python2.2/os.pyc'>
So __import__ imports a module, but takes a string argument to do it. In this case the module you imported was just a hard-coded string, but it could just as easily be a variable, or the result of a function call. And the variable that you assign the module to doesn't need to match the module name, either. You could import a series of modules and assign them to a list.
>>> moduleNames = ['sys', 'os', 're', 'unittest']>>> moduleNames ['sys', 'os', 're', 'unittest'] >>> modules = map(__import__, moduleNames)
>>> modules
[<module 'sys' (built-in)>, <module 'os' from 'c:\Python22\lib\os.pyc'>, <module 're' from 'c:\Python22\lib\re.pyc'>, <module 'unittest' from 'c:\Python22\lib\unittest.pyc'>] >>> modules[0].version
'2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]' >>> import sys >>> sys.version '2.2.2 (#37, Nov 26 2002, 10:24:37) [MSC 32 bit (Intel)]'
Now you should be able to put this all together and figure out what most of this chapter's code sample is doing.
You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing selected modules within it.
def regressionTest(): path = os.path.abspath(os.path.dirname(sys.argv[0])) files = os.listdir(path) test = re.compile("test\.py$", re.IGNORECASE) files = filter(test.search, files) filenameToModuleName = lambda f: os.path.splitext(f)[0] moduleNames = map(filenameToModuleName, files) modules = map(__import__, moduleNames) load = unittest.defaultTestLoader.loadTestsFromModule return unittest.TestSuite(map(load, modules))
Let's look at it line by line, interactively. Assume that the current directory is c:\diveintopython\py, which contains the examples that come with this book, including this chapter's script. As you saw in Section 16.2, “Finding the path”, the script directory will end up in the path variable, so let's start hard-code that and go from there.
>>> import sys, os, re, unittest >>> path = r'c:\diveintopython\py' >>> files = os.listdir(path) >>> files['BaseHTMLProcessor.py', 'LICENSE.txt', 'apihelper.py', 'apihelpertest.py', 'argecho.py', 'autosize.py', 'builddialectexamples.py', 'dialect.py', 'fileinfo.py', 'fullpath.py', 'kgptest.py', 'makerealworddoc.py', 'odbchelper.py', 'odbchelpertest.py', 'parsephone.py', 'piglatin.py', 'plural.py', 'pluraltest.py', 'pyfontify.py', 'regression.py', 'roman.py', 'romantest.py', 'uncurly.py', 'unicode2koi8r.py', 'urllister.py', 'kgp', 'plural', 'roman', 'colorize.py']
>>> test = re.compile("test\.py$", re.IGNORECASE)>>> files = filter(test.search, files)
>>> files
['apihelpertest.py', 'kgptest.py', 'odbchelpertest.py', 'pluraltest.py', 'romantest.py']
>>> filenameToModuleName = lambda f: os.path.splitext(f)[0]>>> filenameToModuleName('romantest.py')
'romantest' >>> filenameToModuleName('odchelpertest.py') 'odbchelpertest' >>> moduleNames = map(filenameToModuleName, files)
>>> moduleNames
['apihelpertest', 'kgptest', 'odbchelpertest', 'pluraltest', 'romantest']
![]() | As you saw in Section 4.7, “Using lambda Functions”, lambda is a quick-and-dirty way of creating an inline, one-line function. This one takes a filename with an extension and returns just the filename part, using the standard library function os.path.splitext that you saw in Example 6.17, “Splitting Pathnames”. |
![]() | filenameToModuleName is a function. There's nothing magic about lambda functions as opposed to regular functions that you define with a def statement. You can call the filenameToModuleName function like any other, and it does just what you wanted it to do: strips the file extension off of its argument. |
![]() | Now you can apply this function to each file in the list of unit test files, using map. |
![]() | And the result is just what you wanted: a list of modules, as strings. |
>>> modules = map(__import__, moduleNames)>>> modules
[<module 'apihelpertest' from 'apihelpertest.py'>, <module 'kgptest' from 'kgptest.py'>, <module 'odbchelpertest' from 'odbchelpertest.py'>, <module 'pluraltest' from 'pluraltest.py'>, <module 'romantest' from 'romantest.py'>] >>> modules[-1]
<module 'romantest' from 'romantest.py'>
![]() | As you saw in Section 16.6, “Dynamically importing modules”, you can use a combination of map and __import__ to map a list of module names (as strings) into actual modules (which you can call or access like any other module). |
![]() | modules is now a list of modules, fully accessible like any other module. |
![]() | The last module in the list is the romantest module, just as if you had said import romantest. |
>>> load = unittest.defaultTestLoader.loadTestsFromModule >>> map(load, modules)[<unittest.TestSuite tests=[ <unittest.TestSuite tests=[<apihelpertest.BadInput testMethod=testNoObject>]>, <unittest.TestSuite tests=[<apihelpertest.KnownValues testMethod=testApiHelper>]>, <unittest.TestSuite tests=[ <apihelpertest.ParamChecks testMethod=testCollapse>, <apihelpertest.ParamChecks testMethod=testSpacing>]>, ... ] ] >>> unittest.TestSuite(map(load, modules))
![]()
This introspection process is what the unittest module usually does for us. Remember that magic-looking unittest.main() function that our individual test modules called to kick the whole thing off? unittest.main() actually creates an instance of unittest.TestProgram, which in turn creates an instance of a unittest.defaultTestLoader and loads it up with the module that called it. (How does it get a reference to the module that called it if you don't give it one? By using the equally-magic __import__('__main__') command, which dynamically imports the currently-running module. I could write a book on all the tricks and techniques used in the unittest module, but then I'd never finish this one.)
if __name__ == "__main__": unittest.main(defaultTest="regressionTest")![]()
The regression.py program and its output should now make perfect sense.
You should now feel comfortable doing all of these things:
[7] Technically, the second argument to filter can be any sequence, including lists, tuples, and custom classes that act like lists by defining the __getitem__ special method. If possible, filter will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.
[8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter.
I want to talk about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. Generators are new in Python 2.3. But first, let's talk about how to make plural nouns.
If you haven't read Chapter 7, Regular Expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and quickly descends into more advanced uses.
English is a schizophrenic language that borrows from a lot of other languages, and the rules for making singular nouns into plural nouns are varied and complex. There are rules, and then there are exceptions to those rules, and then there are exceptions to the exceptions.
If you grew up in an English-speaking country or learned English in a formal school setting, you're probably familiar with the basic rules:
(I know, there are a lot of exceptions. “Man” becomes “men” and “woman” becomes “women”, but “human” becomes “humans”. “Mouse” becomes “mice” and “louse” becomes “lice”, but “house” becomes “houses”. “Knife” becomes “knives” and “wife” becomes “wives”, but “lowlife” becomes “lowlifes”. And don't even get me started on words that are their own plural, like “sheep”, “deer”, and “haiku”.)
Other languages are, of course, completely different.
Let's design a module that pluralizes nouns. Start with just English nouns, and just these four rules, but keep in mind that you'll inevitably need to add more rules, and you may eventually need to add more languages.
So you're looking at words, which at least in English are strings of characters. And you have rules that say you need to find different combinations of characters, and then do different things to them. This sounds like a job for regular expressions.
import re def plural(noun): if re.search('[sxz]$', noun):return re.sub('$', 'es', noun)
elif re.search('[^aeioudgkprt]h$', noun): return re.sub('$', 'es', noun) elif re.search('[^aeiou]y$', noun): return re.sub('y$', 'ies', noun) else: return noun + 's'
![]() | OK, this is a regular expression, but it uses a syntax you didn't see in Chapter 7, Regular Expressions. The square brackets mean “match exactly one of these characters”. So [sxz] means “s, or x, or z”, but only one of them. The $ should be familiar; it matches the end of string. So you're checking to see if noun ends with s, x, or z. |
![]() | This re.sub function performs regular expression-based string substitutions. Let's look at it in more detail. |
>>> import re >>> re.search('[abc]', 'Mark')<_sre.SRE_Match object at 0x001C1FA8> >>> re.sub('[abc]', 'o', 'Mark')
'Mork' >>> re.sub('[abc]', 'o', 'rock')
'rook' >>> re.sub('[abc]', 'o', 'caps')
'oops'
import re def plural(noun): if re.search('[sxz]$', noun): return re.sub('$', 'es', noun)elif re.search('[^aeioudgkprt]h$', noun):
return re.sub('$', 'es', noun)
elif re.search('[^aeiou]y$', noun): return re.sub('y$', 'ies', noun) else: return noun + 's'
>>> import re >>> re.search('[^aeiou]y$', 'vacancy')<_sre.SRE_Match object at 0x001C1FA8> >>> re.search('[^aeiou]y$', 'boy')
>>> >>> re.search('[^aeiou]y$', 'day') >>> >>> re.search('[^aeiou]y$', 'pita')
>>>
>>> re.sub('y$', 'ies', 'vacancy')'vacancies' >>> re.sub('y$', 'ies', 'agency') 'agencies' >>> re.sub('([^aeiou])y$', r'\1ies', 'vacancy')
'vacancies'
![]() | This regular expression turns vacancy into vacancies and agency into agencies, which is what you wanted. Note that it would also turn boy into boies, but that will never happen in the function because you did that re.search first to find out whether you should do this re.sub. |
![]() | Just in passing, I want to point out that it is possible to combine these two regular expressions (one to find out if the rule applies, and another to actually apply it) into a single regular expression. Here's what that would look like. Most of it should look familiar: you're using a remembered group, which you learned in Section 7.6, “Case study: Parsing Phone Numbers”, to remember the character before the y. Then in the substitution string, you use a new syntax, \1, which means “hey, that first group you remembered? put it here”. In this case, you remember the c before the y, and then when you do the substitution, you substitute c in place of c, and ies in place of y. (If you have more than one remembered group, you can use \2 and \3 and so on.) |
Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn't directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. And if you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn't get much more direct than that.
Now you're going to add a level of abstraction. You started by defining a list of rules: if this, then do that, otherwise go to the next rule. Let's temporarily complicate part of the program so you can simplify another part.
import re def match_sxz(noun): return re.search('[sxz]$', noun) def apply_sxz(noun): return re.sub('$', 'es', noun) def match_h(noun): return re.search('[^aeioudgkprt]h$', noun) def apply_h(noun): return re.sub('$', 'es', noun) def match_y(noun): return re.search('[^aeiou]y$', noun) def apply_y(noun): return re.sub('y$', 'ies', noun) def match_default(noun): return 1 def apply_default(noun): return noun + 's' rules = ((match_sxz, apply_sxz), (match_h, apply_h), (match_y, apply_y), (match_default, apply_default) )def plural(noun): for matchesRule, applyRule in rules:
if matchesRule(noun):
return applyRule(noun)
![]()
![]() | This version looks more complicated (it's certainly longer), but it does exactly the same thing: try to match four different rules, in order, and apply the appropriate regular expression when a match is found. The difference is that each individual match and apply rule is defined in its own function, and the functions are then listed in this rules variable, which is a tuple of tuples. |
![]() | Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules tuple. On the first iteration of the for loop, matchesRule will get match_sxz, and applyRule will get apply_sxz. On the second iteration (assuming you get that far), matchesRule will be assigned match_h, and applyRule will be assigned apply_h. |
![]() | Remember that everything in Python is an object, including functions. rules contains actual functions; not names of functions, but actual functions. When they get assigned in the for loop, then matchesRule and applyRule are actual functions that you can call. So on the first iteration of the for loop, this is equivalent to calling matches_sxz(noun). |
![]() | On the first iteration of the for loop, this is equivalent to calling apply_sxz(noun), and so forth. |
If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. This for loop is equivalent to the following:
def plural(noun): if match_sxz(noun): return apply_sxz(noun) if match_h(noun): return apply_h(noun) if match_y(noun): return apply_y(noun) if match_default(noun): return apply_default(noun)
The benefit here is that that plural function is now simplified. It takes a list of rules, defined elsewhere, and iterates through them in a generic fashion. Get a match rule; does it match? Then call the apply rule. The rules could be defined anywhere, in any way. The plural function doesn't care.
Now, was adding this level of abstraction worth it? Well, not yet. Let's consider what it would take to add a new rule to the function. Well, in the previous example, it would require adding an if statement to the plural function. In this example, it would require adding two functions, match_foo and apply_foo, and then updating the rules list to specify where in the order the new match and apply functions should be called relative to the other rules.
This is really just a stepping stone to the next section. Let's move on.
Defining separate named functions for each match and apply rule isn't really necessary. You never call them directly; you define them in the rules list and call them through there. Let's streamline the rules definition by anonymizing those functions.
import re rules = \ ( ( lambda word: re.search('[sxz]$', word), lambda word: re.sub('$', 'es', word) ), ( lambda word: re.search('[^aeioudgkprt]h$', word), lambda word: re.sub('$', 'es', word) ), ( lambda word: re.search('[^aeiou]y$', word), lambda word: re.sub('y$', 'ies', word) ), ( lambda word: re.search('$', word), lambda word: re.sub('$', 's', word) ) )def plural(noun): for matchesRule, applyRule in rules:
if matchesRule(noun): return applyRule(noun)
![]() | This is the same set of rules as you defined in stage 2. The only difference is that instead of defining named functions like match_sxz and apply_sxz, you have “inlined” those function definitions directly into the rules list itself, using lambda functions. |
![]() | Note that the plural function hasn't changed at all. It iterates through a set of rule functions, checks the first rule, and if it returns a true value, calls the second rule and returns the value. Same as above, word for word. The only difference is that the rule functions were defined inline, anonymously, using lambda functions. But the plural function doesn't care how they were defined; it just gets a list of rules and blindly works through them. |
Now to add a new rule, all you need to do is define the functions directly in the rules list itself: one match rule, and one apply rule. But defining the rule functions inline like this makes it very clear that you have some unnecessary duplication here. You have four pairs of functions, and they all follow the same pattern. The match function is a single call to re.search, and the apply function is a single call to re.sub. Let's factor out these similarities.
Let's factor out the duplication in the code so that defining new rules can be easier.
import re def buildMatchAndApplyFunctions((pattern, search, replace)): matchFunction = lambda word: re.search(pattern, word)applyFunction = lambda word: re.sub(search, replace, word)
return (matchFunction, applyFunction)
![]()
If this is incredibly confusing (and it should be, this is weird stuff), it may become clearer when you see how to use it.
patterns = \ ( ('[sxz]$', '$', 'es'), ('[^aeioudgkprt]h$', '$', 'es'), ('(qu|[^aeiou])y$', 'y$', 'ies'), ('$', '$', 's') )rules = map(buildMatchAndApplyFunctions, patterns)
![]()
I swear I am not making this up: rules ends up with exactly the same list of functions as the previous example. Unroll the rules definition, and you'll get this:
rules = \ ( ( lambda word: re.search('[sxz]$', word), lambda word: re.sub('$', 'es', word) ), ( lambda word: re.search('[^aeioudgkprt]h$', word), lambda word: re.sub('$', 'es', word) ), ( lambda word: re.search('[^aeiou]y$', word), lambda word: re.sub('y$', 'ies', word) ), ( lambda word: re.search('$', word), lambda word: re.sub('$', 's', word) ) )
def plural(noun): for matchesRule, applyRule in rules:if matchesRule(noun): return applyRule(noun)
![]() | Since the rules list is the same as the previous example, it should come as no surprise that the plural function hasn't changed. Remember, it's completely generic; it takes a list of rule functions and calls them in order. It doesn't care how the rules are defined. In stage 2, they were defined as seperate named functions. In stage 3, they were defined as anonymous lambda functions. Now in stage 4, they are built dynamically by mapping the buildMatchAndApplyFunctions function onto a list of raw strings. Doesn't matter; the plural function still works the same way. |
Just in case that wasn't mind-blowing enough, I must confess that there was a subtlety in the definition of buildMatchAndApplyFunctions that I skipped over. Let's go back and take another look.
def buildMatchAndApplyFunctions((pattern, search, replace)):![]()
>>> def foo((a, b, c)): ... print c ... print b ... print a >>> parameters = ('apple', 'bear', 'catnap') >>> foo(parameters)catnap bear apple
![]() | The proper way to call the function foo is with a tuple of three elements. When the function is called, the elements are assigned to different local variables within foo. |
Now let's go back and see why this auto-tuple-expansion trick was necessary. patterns was a list of tuples, and each tuple had three elements. When you called map(buildMatchAndApplyFunctions, patterns), that means that buildMatchAndApplyFunctions is not getting called with three parameters. Using map to map a single list onto a function always calls the function with a single parameter: each element of the list. In the case of patterns, each element of the list is a tuple, so buildMatchAndApplyFunctions always gets called with the tuple, and you use the auto-tuple-expansion trick in the definition of buildMatchAndApplyFunctions to assign the elements of that tuple to named variables that you can work with.
You've factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them.
First, let's create a text file that contains the rules you want. No fancy data structures, just space- (or tab-)delimited strings in three columns. You'll call it rules.en; “en” stands for English. These are the rules for pluralizing English nouns. You could add other rule files for other languages later.
Now let's see how you can use this rules file.
import re import string def buildRule((pattern, search, replace)): return lambda word: re.search(pattern, word) and re.sub(search, replace, word)def plural(noun, language='en'):
lines = file('rules.%s' % language).readlines()
patterns = map(string.split, lines)
rules = map(buildRule, patterns)
for rule in rules: result = rule(noun)
if result: return result
![]() | You're still using the closures technique here (building a function dynamically that uses variables defined outside the function), but now you've combined the separate match and apply functions into one. (The reason for this change will become clear in the next section.) This will let you accomplish the same thing as having two functions, but you'll need to call it differently, as you'll see in a minute. |
![]() | Our plural function now takes an optional second parameter, language, which defaults to en. |
![]() | You use the language parameter to construct a filename, then open the file and read the contents into a list. If language is en, then you'll open the rules.en file, read the entire thing, break it up by carriage returns, and return a list. Each line of the file will be one element in the list. |
![]() | As you saw, each line in the file really has three values, but they're separated by whitespace (tabs or spaces, it makes no difference). Mapping the string.split function onto this list will create a new list where each element is a tuple of three strings. So a line like [sxz]$ $ es will be broken up into the tuple ('[sxz]$', '$', 'es'). This means that patterns will end up as a list of tuples, just like you hard-coded it in stage 4. |
![]() | If patterns is a list of tuples, then rules will be a list of the functions created dynamically by each call to buildRule. Calling buildRule(('[sxz]$', '$', 'es')) returns a function that takes a single parameter, word. When this returned function is called, it will execute re.search('[sxz]$', word) and re.sub('$', 'es', word). |
![]() | Because you're now building a combined match-and-apply function, you need to call it differently. Just call the function, and if it returns something, then that's the plural; if it returns nothing (None), then the rule didn't match and you need to try another rule. |
So the improvement here is that you've completely separated the pluralization rules into an external file. Not only can the file be maintained separately from the code, but you've set up a naming scheme where the same plural function can use different rule files, based on the language parameter.
The downside here is that you're reading that file every time you call the plural function. I thought I could get through this entire book without using the phrase “left as an exercise for the reader”, but here you go: building a caching mechanism for the language-specific rule files that auto-refreshes itself if the rule files change between calls is left as an exercise for the reader. Have fun.
Now you're ready to talk about generators.
import re def rules(language): for line in file('rules.%s' % language): pattern, search, replace = line.split() yield lambda word: re.search(pattern, word) and re.sub(search, replace, word) def plural(noun, language='en'): for applyRule in rules(language): result = applyRule(noun) if result: return result
This uses a technique called generators, which I'm not even going to try to explain until you look at a simpler example first.
>>> def make_counter(x): ... print 'entering make_counter' ... while 1: ... yield x... print 'incrementing x' ... x = x + 1 ... >>> counter = make_counter(2)
>>> counter
<generator object at 0x001C9C10> >>> counter.next()
entering make_counter 2 >>> counter.next()
incrementing x 3 >>> counter.next()
incrementing x 4
def fibonacci(max): a, b = 0, 1while a < max: yield a
a, b = b, a+b
![]()
So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops.
>>> for n in fibonacci(1000):... print n,
0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
OK, let's go back to the plural function and see how you're using this.
def rules(language): for line in file('rules.%s' % language):pattern, search, replace = line.split()
yield lambda word: re.search(pattern, word) and re.sub(search, replace, word)
def plural(noun, language='en'): for applyRule in rules(language):
result = applyRule(noun) if result: return result
What have you gained over stage 5? In stage 5, you read the entire rules file and built a list of all the possible rules before you even tried the first one. Now with generators, you can do everything lazily: you open the first and read the first rule and create a function to try it, but if that works you don't ever read the rest of the file or create any other functions.
You talked about several different advanced techniques in this chapter. Not all of them are appropriate for every situation.
You should now be comfortable with all of these techniques:
Adding abstractions, building functions dynamically, building closures, and using generators can all make your code simpler, more readable, and more flexible. But they can also end up making it more difficult to debug later. It's up to you to find the right balance between simplicity and power.
Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it too much.
There are so many pitfalls involved in optimizing your code, it's hard to know where to start.
Let's start here: are you sure you need to do it at all? Is your code really so bad? Is it worth the time to tune it? Over the lifetime of your application, how much time is going to be spent running that code, compared to the time spent waiting for a remote database server, or waiting for user input?
Second, are you sure you're done coding? Premature optimization is like spreading frosting on a half-baked cake. You spend hours or days (or more) optimizing your code for performance, only to discover it doesn't do what you need it to do. That's time down the drain.
This is not to say that code optimization is worthless, but you need to look at the whole system and decide whether it's the best use of your time. Every minute you spend optimizing code is a minute you're not spending adding new features, or writing documentation, or playing with your kids, or writing unit tests.
Oh yes, unit tests. It should go without saying that you need a complete set of unit tests before you begin performance tuning. The last thing you need is to introduce new bugs while fiddling with your algorithms.
With these caveats in place, let's look at some techniques for optimizing Python code. The code in question is an implementation of the Soundex algorithm. Soundex was a method used in the early 20th century for categorizing surnames in the United States census. It grouped similar-sounding names together, so even if a name was misspelled, researchers had a chance of finding it. Soundex is still used today for much the same reason, although of course we use computerized database servers now. Most database servers include a Soundex function.
There are several subtle variations of the Soundex algorithm. This is the one used in this chapter:
For example, my name, Pilgrim, becomes P942695. That has no consecutive duplicates, so nothing to do there. Then you remove the 9s, leaving P4265. That's too long, so you discard the excess character, leaving P426.
Another example: Woo becomes W99, which becomes W9, which becomes W, which gets padded with zeros to become W000.
Here's a first attempt at a Soundex function:
If you have not already done so, you can download this and other examples used in this book.
import string, re charToSoundex = {"A": "9", "B": "1", "C": "2", "D": "3", "E": "9", "F": "1", "G": "2", "H": "9", "I": "9", "J": "2", "K": "2", "L": "4", "M": "5", "N": "5", "O": "9", "P": "1", "Q": "2", "R": "6", "S": "2", "T": "3", "U": "9", "V": "1", "W": "9", "X": "2", "Y": "9", "Z": "2"} def soundex(source): "convert string to Soundex equivalent" # Soundex requirements: # source string must be at least 1 character # and must consist entirely of letters allChars = string.uppercase + string.lowercase if not re.search('^[%s]+$' % allChars, source): return "0000" # Soundex algorithm: # 1. make first character uppercase source = source[0].upper() + source[1:] # 2. translate all other characters to Soundex digits digits = source[0] for s in source[1:]: s = s.upper() digits += charToSoundex[s] # 3. remove consecutive duplicates digits2 = digits[0] for d in digits[1:]: if digits2[-1] != d: digits2 += d # 4. remove all "9"s digits3 = re.sub('9', '', digits2) # 5. pad end with "0"s to 4 characters while len(digits3) < 4: digits3 += "0" # 6. return first 4 characters return digits3[:4] if __name__ == '__main__': from timeit import Timer names = ('Woo', 'Pilgrim', 'Flingjingwaller') for name in names: statement = "soundex('%s')" % name t = Timer(statement, "from __main__ import soundex") print name.ljust(15), soundex(name), min(t.repeat())
The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.
Timing short pieces of code is incredibly complex. How much processor time is your computer devoting to running this code? Are there things running in the background? Are you sure? Every modern computer has background processes running, some all the time, some intermittently. Cron jobs fire off at consistent intervals; background services occasionally “wake up” to do useful things like check for new mail, connect to instant messaging servers, check for application updates, scan for viruses, check whether a disk has been inserted into your CD drive in the last 100 nanoseconds, and so on. Before you start your timing tests, turn everything off and disconnect from the network. Then turn off all the things you forgot to turn off the first time, then turn off the service that's incessantly checking whether the network has come back yet, then ...
And then there's the matter of the variations introduced by the timing framework itself. Does the Python interpreter cache method name lookups? Does it cache code block compilations? Regular expressions? Will your code have side effects if run more than once? Don't forget that you're dealing with small fractions of a second, so small mistakes in your timing framework will irreparably skew your results.
The Python community has a saying: “Python comes with batteries included.” Don't write your own timing framework. Python 2.3 comes with a perfectly good one called timeit.
If you have not already done so, you can download this and other examples used in this book.
>>> import timeit >>> t = timeit.Timer("soundex.soundex('Pilgrim')", ... "import soundex")>>> t.timeit()
8.21683733547 >>> t.repeat(3, 2000000)
[16.48319309109, 16.46128984923, 16.44203948912]
![]() |
|
You can use the timeit module on the command line to test an existing Python program, without modifying the code. See http://docs.python.org/lib/node396.html for documentation on the command-line flags. |
Note that repeat() returns a list of times. The times will almost never be identical, due to slight variations in how much processor time the Python interpreter is getting (and those pesky background processes that you can't get rid of). Your first thought might be to say “Let's take the average and call that The True Number.”
In fact, that's almost certainly wrong. The tests that took longer didn't take longer because of variations in your code or in the Python interpreter; they took longer because of those pesky background processes, or other factors outside of the Python interpreter that you can't fully eliminate. If the different timing results differ by more than a few percent, you still have too much variability to trust the results. Otherwise, take the minimum time and discard the rest.
Python has a handy min function that takes a list and returns the smallest value:
>>> min(t.repeat(3, 1000000)) 8.22203948912
![]() |
|
The timeit module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out the hotshot module. |
The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to do this?
If you answered “regular expressions”, go sit in the corner and contemplate your bad instincts. Regular expressions are almost never the right answer; they should be avoided whenever possible. Not only for performance reasons, but simply because they're difficult to debug and maintain. Also for performance reasons.
This code fragment from soundex/stage1/soundex1a.py checks whether the function argument source is a word made entirely of letters, with at least one letter (not the empty string):
allChars = string.uppercase + string.lowercase if not re.search('^[%s]+$' % allChars, source): return "0000"
How does soundex1a.py perform? For convenience, the __main__ section of the script contains this code that calls the timeit module, sets up a timing test with three different names, tests each name three times, and displays the minimum time for each:
if __name__ == '__main__': from timeit import Timer names = ('Woo', 'Pilgrim', 'Flingjingwaller') for name in names: statement = "soundex('%s')" % name t = Timer(statement, "from __main__ import soundex") print name.ljust(15), soundex(name), min(t.repeat())
So how does soundex1a.py perform with this regular expression?
C:\samples\soundex\stage1>python soundex1a.py Woo W000 19.3356647283 Pilgrim P426 24.0772053431 Flingjingwaller F452 35.0463220884
As you might expect, the algorithm takes significantly longer when called with longer names. There will be a few things we can do to narrow that gap (make the function take less relative time for longer input), but the nature of the algorithm dictates that it will never run in constant time.
The other thing to keep in mind is that we are testing a representative sample of names. Woo is a kind of trivial case, in that it gets shorted down to a single letter and then padded with zeros. Pilgrim is a normal case, of average length and a mixture of significant and ignored letters. Flingjingwaller is extraordinarily long and contains consecutive duplicates. Other tests might also be helpful, but this hits a good range of different cases.
So what about that regular expression? Well, it's inefficient. Since the expression is testing for ranges of characters (A-Z in uppercase, and a-z in lowercase), we can use a shorthand regular expression syntax. Here is soundex/stage1/soundex1b.py:
if not re.search('^[A-Za-z]+$', source): return "0000"
timeit says soundex1b.py is slightly faster than soundex1a.py, but nothing to get terribly excited about:
C:\samples\soundex\stage1>python soundex1b.py Woo W000 17.1361133887 Pilgrim P426 21.8201693232 Flingjingwaller F452 32.7262294509
We saw in Section 15.3, “Refactoring” that regular expressions can be compiled and reused for faster results. Since this regular expression never changes across function calls, we can compile it once and use the compiled version. Here is soundex/stage1/soundex1c.py:
isOnlyChars = re.compile('^[A-Za-z]+$').search def soundex(source): if not isOnlyChars(source): return "0000"
Using a compiled regular expression in soundex1c.py is significantly faster:
C:\samples\soundex\stage1>python soundex1c.py Woo W000 14.5348347346 Pilgrim P426 19.2784703084 Flingjingwaller F452 30.0893873383
But is this the wrong path? The logic here is simple: the input source needs to be non-empty, and it needs to be composed entirely of letters. Wouldn't it be faster to write a loop checking each character, and do away with regular expressions altogether?
Here is soundex/stage1/soundex1d.py:
if not source: return "0000" for c in source: if not ('A' <= c <= 'Z') and not ('a' <= c <= 'z'): return "0000"
It turns out that this technique in soundex1d.py is not faster than using a compiled regular expression (although it is faster than using a non-compiled regular expression):
C:\samples\soundex\stage1>python soundex1d.py Woo W000 15.4065058548 Pilgrim P426 22.2753567842 Flingjingwaller F452 37.5845122774
Why isn't soundex1d.py faster? The answer lies in the interpreted nature of Python. The regular expression engine is written in C, and compiled to run natively on your computer. On the other hand, this loop is written in Python, and runs through the Python interpreter. Even though the loop is relatively simple, it's not simple enough to make up for the overhead of being interpreted. Regular expressions are never the right answer... except when they are.
It turns out that Python offers an obscure string method. You can be excused for not knowing about it, since it's never been mentioned in this book. The method is called isalpha(), and it checks whether a string contains only letters.
This is soundex/stage1/soundex1e.py:
if (not source) and (not source.isalpha()): return "0000"
How much did we gain by using this specific method in soundex1e.py? Quite a bit.
C:\samples\soundex\stage1>python soundex1e.py Woo W000 13.5069504644 Pilgrim P426 18.2199394057 Flingjingwaller F452 28.9975225902
import string, re charToSoundex = {"A": "9", "B": "1", "C": "2", "D": "3", "E": "9", "F": "1", "G": "2", "H": "9", "I": "9", "J": "2", "K": "2", "L": "4", "M": "5", "N": "5", "O": "9", "P": "1", "Q": "2", "R": "6", "S": "2", "T": "3", "U": "9", "V": "1", "W": "9", "X": "2", "Y": "9", "Z": "2"} def soundex(source): if (not source) and (not source.isalpha()): return "0000" source = source[0].upper() + source[1:] digits = source[0] for s in source[1:]: s = s.upper() digits += charToSoundex[s] digits2 = digits[0] for d in digits[1:]: if digits2[-1] != d: digits2 += d digits3 = re.sub('9', '', digits2) while len(digits3) < 4: digits3 += "0" return digits3[:4] if __name__ == '__main__': from timeit import Timer names = ('Woo', 'Pilgrim', 'Flingjingwaller') for name in names: statement = "soundex('%s')" % name t = Timer(statement, "from __main__ import soundex") print name.ljust(15), soundex(name), min(t.repeat())
The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to do this?
The most obvious solution is to define a dictionary with individual characters as keys and their corresponding digits as values, and do dictionary lookups on each character. This is what we have in soundex/stage1/soundex1c.py (the current best result so far):
charToSoundex = {"A": "9", "B": "1", "C": "2", "D": "3", "E": "9", "F": "1", "G": "2", "H": "9", "I": "9", "J": "2", "K": "2", "L": "4", "M": "5", "N": "5", "O": "9", "P": "1", "Q": "2", "R": "6", "S": "2", "T": "3", "U": "9", "V": "1", "W": "9", "X": "2", "Y": "9", "Z": "2"} def soundex(source): # ... input check omitted for brevity ... source = source[0].upper() + source[1:] digits = source[0] for s in source[1:]: s = s.upper() digits += charToSoundex[s]
You timed soundex1c.py already; this is how it performs:
C:\samples\soundex\stage1>python soundex1c.py Woo W000 14.5341678901 Pilgrim P426 19.2650071448 Flingjingwaller F452 30.1003563302
This code is straightforward, but is it the best solution? Calling upper() on each individual character seems inefficient; it would probably be better to call upper() once on the entire string.
Then there's the matter of incrementally building the digits string. Incrementally building strings like this is horribly inefficient; internally, the Python interpreter needs to create a new string each time through the loop, then discard the old one.
Python is good at lists, though. It can treat a string as a list of characters automatically. And lists are easy to combine into strings again, using the string method join().
Here is soundex/stage2/soundex2a.py, which converts letters to digits by using ↦ and lambda:
def soundex(source): # ... source = source.upper() digits = source[0] + "".join(map(lambda c: charToSoundex[c], source[1:]))
Surprisingly, soundex2a.py is not faster:
C:\samples\soundex\stage2>python soundex2a.py Woo W000 15.0097526362 Pilgrim P426 19.254806407 Flingjingwaller F452 29.3790847719
The overhead of the anonymous lambda function kills any performance you gain by dealing with the string as a list of characters.
soundex/stage2/soundex2b.py uses a list comprehension instead of ↦ and lambda:
source = source.upper() digits = source[0] + "".join([charToSoundex[c] for c in source[1:]])
Using a list comprehension in soundex2b.py is faster than using ↦ and lambda in soundex2a.py, but still not faster than the original code (incrementally building a string in soundex1c.py):
C:\samples\soundex\stage2>python soundex2b.py Woo W000 13.4221324219 Pilgrim P426 16.4901234654 Flingjingwaller F452 25.8186157738
It's time for a radically different approach. Dictionary lookups are a general purpose tool. Dictionary keys can be any length string (or many other data types), but in this case we are only dealing with single-character keys and single-character values. It turns out that Python has a specialized function for handling exactly this situation: the string.maketrans function.
This is soundex/stage2/soundex2c.py:
allChar = string.uppercase + string.lowercase charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2) def soundex(source): # ... digits = source[0].upper() + source[1:].translate(charToSoundex)
What the heck is going on here? string.maketrans creates a translation matrix between two strings: the first argument and the second argument. In this case, the first argument is the string ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz, and the second argument is the string 9123912992245591262391929291239129922455912623919292. See the pattern? It's the same conversion pattern we were setting up longhand with a dictionary. A maps to 9, B maps to 1, C maps to 2, and so forth. But it's not a dictionary; it's a specialized data structure that you can access using the string method translate, which translates each character into the corresponding digit, according to the matrix defined by string.maketrans.
timeit shows that soundex2c.py is significantly faster than defining a dictionary and looping through the input and building the output incrementally:
C:\samples\soundex\stage2>python soundex2c.py Woo W000 11.437645008 Pilgrim P426 13.2825062962 Flingjingwaller F452 18.5570110168
You're not going to get much better than that. Python has a specialized function that does exactly what you want to do; use it and move on.
import string, re allChar = string.uppercase + string.lowercase charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2) isOnlyChars = re.compile('^[A-Za-z]+$').search def soundex(source): if not isOnlyChars(source): return "0000" digits = source[0].upper() + source[1:].translate(charToSoundex) digits2 = digits[0] for d in digits[1:]: if digits2[-1] != d: digits2 += d digits3 = re.sub('9', '', digits2) while len(digits3) < 4: digits3 += "0" return digits3[:4] if __name__ == '__main__': from timeit import Timer names = ('Woo', 'Pilgrim', 'Flingjingwaller') for name in names: statement = "soundex('%s')" % name t = Timer(statement, "from __main__ import soundex") print name.ljust(15), soundex(name), min(t.repeat())
The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?
Here's the code we have so far, in soundex/stage2/soundex2c.py:
digits2 = digits[0] for d in digits[1:]: if digits2[-1] != d: digits2 += d
Here are the performance results for soundex2c.py:
C:\samples\soundex\stage2>python soundex2c.py Woo W000 12.6070768771 Pilgrim P426 14.4033353401 Flingjingwaller F452 19.7774882003
The first thing to consider is whether it's efficient to check digits[-1] each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate variable, and checking that instead?
To answer this question, here is soundex/stage3/soundex3a.py:
digits2 = '' last_digit = '' for d in digits: if d != last_digit: digits2 += d last_digit = d
soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure):
C:\samples\soundex\stage3>python soundex3a.py Woo W000 11.5346048171 Pilgrim P426 13.3950636184 Flingjingwaller F452 18.6108927252
Why isn't soundex3a.py faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing digits2[-1] is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have two variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating the list lookup.
Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character in the list, and that's not easy to do with a straightforward list comprehension.
However, it is possible to create a list of index numbers using the built-in range() function, and use those index numbers to progressively search through the list and pull out each character that is different from the previous character. That will give you a list of characters, and you can use the string method join() to reconstruct a string from that.
Here is soundex/stage3/soundex3b.py:
digits2 = "".join([digits[i] for i in range(len(digits)) if i == 0 or digits[i-1] != digits[i]])
Is this faster? In a word, no.
C:\samples\soundex\stage3>python soundex3b.py Woo W000 14.2245271396 Pilgrim P426 17.8337165757 Flingjingwaller F452 25.9954005327
It's possible that the techniques so far as have been “string-centric”. Python can convert a string into a list of characters with a single command: list('abc') returns ['a', 'b', 'c']. Furthermore, lists can be modified in place very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around within a single list?
Here is soundex/stage3/soundex3c.py, which modifies a list in place to remove consecutive duplicate elements:
digits = list(source[0].upper() + source[1:].translate(charToSoundex)) i=0 for item in digits: if item==digits[i]: continue i+=1 digits[i]=item del digits[i+1:] digits2 = "".join(digits)
Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet:
C:\samples\soundex\stage3>python soundex3c.py Woo W000 14.1662554878 Pilgrim P426 16.0397885765 Flingjingwaller F452 22.1789341942
We haven't made any progress here at all, except to try and rule out several “clever” techniques. The fastest code we've seen so far was the original, most straightforward method (soundex2c.py). Sometimes it doesn't pay to be clever.
import string, re allChar = string.uppercase + string.lowercase charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2) isOnlyChars = re.compile('^[A-Za-z]+$').search def soundex(source): if not isOnlyChars(source): return "0000" digits = source[0].upper() + source[1:].translate(charToSoundex) digits2 = digits[0] for d in digits[1:]: if digits2[-1] != d: digits2 += d digits3 = re.sub('9', '', digits2) while len(digits3) < 4: digits3 += "0" return digits3[:4] if __name__ == '__main__': from timeit import Timer names = ('Woo', 'Pilgrim', 'Flingjingwaller') for name in names: statement = "soundex('%s')" % name t = Timer(statement, "from __main__ import soundex") print name.ljust(15), soundex(name), min(t.repeat())
The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best way to do this?
This is what we have so far, taken from soundex/stage2/soundex2c.py:
digits3 = re.sub('9', '', digits2) while len(digits3) < 4: digits3 += "0" return digits3[:4]
These are the results for soundex2c.py:
C:\samples\soundex\stage2>python soundex2c.py Woo W000 12.6070768771 Pilgrim P426 14.4033353401 Flingjingwaller F452 19.7774882003
The first thing to consider is replacing that regular expression with a loop. This code is from soundex/stage4/soundex4a.py:
digits3 = '' for d in digits2: if d != '9': digits3 += d
Is soundex4a.py faster? Yes it is:
C:\samples\soundex\stage4>python soundex4a.py Woo W000 6.62865531792 Pilgrim P426 9.02247576158 Flingjingwaller F452 13.6328416042
But wait a minute. A loop to remove characters from a string? We can use a simple string method for that. Here's soundex/stage4/soundex4b.py:
digits3 = digits2.replace('9', '')
Is soundex4b.py faster? That's an interesting question. It depends on the input:
C:\samples\soundex\stage4>python soundex4b.py Woo W000 6.75477414029 Pilgrim P426 7.56652144337 Flingjingwaller F452 10.8727729362
The string method in soundex4b.py is faster than the loop for most names, but it's actually slightly slower than soundex4a.py in the trivial case (of a very short name). Performance optimizations aren't always uniform; tuning that makes one case faster can sometimes make other cases slower. In this case, the majority of cases will benefit from the change, so let's leave it at that, but the principle is an important one to remember.
Last but not least, let's examine the final two steps of the algorithm: padding short results with zeros, and truncating long results to four characters. The code you see in soundex4b.py does just that, but it's horribly inefficient. Take a look at soundex/stage4/soundex4c.py to see why:
digits3 += '000' return digits3[:4]
Why do we need a while loop to pad out the result? We know in advance that we're going to truncate the result to four characters, and we know that we already have at least one character (the initial letter, which is passed unchanged from the original source variable). That means we can simply add three zeros to the output, then truncate it. Don't get stuck in a rut over the exact wording of the problem; looking at the problem slightly differently can lead to a simpler solution.
How much speed do we gain in soundex4c.py by dropping the while loop? It's significant:
C:\samples\soundex\stage4>python soundex4c.py Woo W000 4.89129791636 Pilgrim P426 7.30642134685 Flingjingwaller F452 10.689832367
Finally, there is still one more thing you can do to these three lines of code to make them faster: you can combine them into one line. Take a look at soundex/stage4/soundex4d.py:
return (digits2.replace('9', '') + '000')[:4]
Putting all this code on one line in soundex4d.py is barely faster than soundex4c.py:
C:\samples\soundex\stage4>python soundex4d.py Woo W000 4.93624105857 Pilgrim P426 7.19747593619 Flingjingwaller F452 10.5490700634
It is also significantly less readable, and for not much performance gain. Is that worth it? I hope you have good comments. Performance isn't everything. Your optimization efforts must always be balanced against threats to your program's readability and maintainability.
This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general.
I can't emphasize that last point strongly enough. Over the course of this chapter, you made this function three times faster and saved 20 seconds over 1 million function calls. Great. Now think: over the course of those million function calls, how many seconds will your surrounding application wait for a database connection? Or wait for disk I/O? Or wait for user input? Don't spend too much time over-optimizing one algorithm, or you'll ignore obvious improvements somewhere else. Develop an instinct for the sort of code that Python runs well, correct obvious blunders if you find them, and leave the rest alone.
Chapter 2. Your First Python Program
Chapter 4. The Power Of Introspection
Chapter 5. Objects and Object-Orientation
Chapter 6. Exceptions and File Handling
Chapter 7. Regular Expressions
Chapter 10. Scripts and Streams
Chapter 14. Test-First Programming
Chapter 16. Functional Programming
Chapter 18. Performance Tuning
The first thing you need to do with Python is install it. Or do you?
On Windows, you have a couple choices for installing Python.
On Mac OS X, you have two choices for installing Python: install it, or don't install it. You probably want to install it.
Mac OS 9 does not come with any version of Python, but installation is very simple, and there is only one choice.
Download the latest Python RPM by going to http://www.python.org/ftp/python/ and selecting the highest version number listed, then selecting the rpms/ directory within that. Then download the RPM with the highest version number. You can install it with the rpm command, as shown here:
If you are lucky enough to be running Debian GNU/Linux, you install Python through the apt command.
If you prefer to build from source, you can download the Python source code from http://www.python.org/ftp/python/. Select the highest version number listed, download the .tgz file), and then do the usual configure, make, make install dance.
Now that you have Python installed, what's this interactive shell thing you're running?
You should now have a version of Python installed that works for you.
Chapter 2. Your First Python Program
Here is a complete, working Python program.
Python has functions like most other languages, but it does not have separate header files like C++ or interface/implementation sections like Pascal. When you need a function, just declare it, like this:
You can document a Python function by giving it a doc string.
A function, like everything else in Python, is an object.
Python functions have no explicit begin or end, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself.
Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them. Here's an example that uses the if __name__ trick.
One of Python's built-in datatypes is the dictionary, which defines one-to-one relationships between keys and values.
Lists are Python's workhorse datatype. If your only experience with lists is arrays in Visual Basic or (God forbid) the datastore in Powerbuilder, brace yourself for Python lists.
A tuple is an immutable list. A tuple can not be changed in any way once it is created.
Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring into existence by being assigned a value, and they are automatically destroyed when they go out of scope.
Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert values into a string with the %s placeholder.
One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each of the elements of the list.
You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.
The odbchelper.py program and its output should now make perfect sense.
Chapter 4. The Power Of Introspection
Here is a complete, working Python program. You should understand a good deal about it just by looking at it. The numbered lines illustrate concepts covered in Chapter 2, Your First Python Program. Don't worry if the rest of the code looks intimidating; you'll learn all about it throughout this chapter.
Python allows function arguments to have default values; if the function is called without the argument, the argument gets its default value. Futhermore, arguments can be specified in any order by using named arguments. Stored procedures in SQL Server Transact/SQL can do this, so if you're a SQL Server scripting guru, you can skim this part.
Python has a small set of extremely useful built-in functions. All other functions are partitioned off into modules. This was actually a conscious design decision, to keep the core language from getting bloated like other scripting languages (cough cough, Visual Basic).
You already know that Python functions are objects. What you don't know is that you can get a reference to a function without knowing its name until run-time, by using the getattr function.
As you know, Python has powerful capabilities for mapping lists into other lists, via list comprehensions (Section 3.6, “Mapping Lists”). This can be combined with a filtering mechanism, where some elements in the list are mapped while others are skipped entirely.
In Python, and and or perform boolean logic as you would expect, but they do not return boolean values; instead, they return one of the actual values they are comparing.
Python supports an interesting syntax that lets you define one-line mini-functions on the fly. Borrowed from Lisp, these so-called lambda functions can be used anywhere a function is required.
The last line of code, the only one you haven't deconstructed yet, is the one that does all the work. But by now the work is easy, because everything you need is already set up just the way you need it. All the dominoes are in place; it's time to knock them down.
The apihelper.py program and its output should now make perfect sense.
Chapter 5. Objects and Object-Orientation
Here is a complete, working Python program. Read the doc strings of the module, the classes, and the functions to get an overview of what this program does and how it works. As usual, don't worry about the stuff you don't understand; that's what the rest of the chapter is for.
Python has two ways of importing modules. Both are useful, and you should know when to use each. One way, import module, you've already seen in Section 2.4, “Everything Is an Object”. The other way accomplishes the same thing, but it has subtle and important differences.
Python is fully object-oriented: you can define your own classes, inherit from your own or built-in classes, and instantiate the classes you've defined.
Instantiating classes in Python is straightforward. To instantiate a class, simply call the class as if it were a function, passing the arguments that the __init__ method defines. The return value will be the newly created object.
As you've seen, FileInfo is a class that acts like a dictionary. To explore this further, let's look at the UserDict class in the UserDict module, which is the ancestor of the FileInfo class. This is nothing special; the class is written in Python and stored in a .py file, just like any other Python code. In particular, it's stored in the lib directory in your Python installation.
In addition to normal class methods, there are a number of special methods that Python classes can define. Instead of being called directly by your code (like normal methods), special methods are called for you by Python in particular circumstances or when specific syntax is used.
Python has more special methods than just __getitem__ and __setitem__. Some of them let you emulate functionality that you may not even know about.
You already know about data attributes, which are variables owned by a specific instance of a class. Python also supports class attributes, which are variables owned by the class itself.
Unlike in most languages, whether a Python function, method, or attribute is private or public is determined entirely by its name.
That's it for the hard-core object trickery. You'll see a real-world application of special class methods in Chapter 12, which uses getattr to create a proxy to a remote web service.
Chapter 6. Exceptions and File Handling
Like many other programming languages, Python has exception handling via try...except blocks.
Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file.
Like most other languages, Python has for loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.
Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules.
The os.path module has several functions for manipulating files and directories. Here, we're looking at handling pathnames and listing the contents of a directory.
Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all fits together.
The fileinfo.py program introduced in Chapter 5 should now make perfect sense.
Chapter 7. Regular Expressions
If what you're trying to do can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you're combining them with split and join and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful.) This example shows how I approached the problem.
You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.
So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.
This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely overwhelmed by them now, believe me, you ain't seen nothing yet.
I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.
HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.
To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.
SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.
Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.
There is an alternative form of string formatting that uses dictionaries instead of tuples of values.
A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor.
Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered.
It's time to put everything you've learned so far to good use. I hope you were paying attention.
Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.
There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM.
Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages.
As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you.
Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode.
Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName.
XML elements can have one or more attributes, and it is incredibly simple to access them once you have parsed an XML document.
OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on other aspects that make the program more flexible: using streams for input processing, using getattr for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code.
Chapter 10. Scripts and Streams
One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object.
UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for the rest of you.
kgp.py employs several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes.
Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a ref element can have several p elements, each of which can contain many things, including other p elements. You want to find just the p elements that are children of the ref, not p elements that are children of other p elements.
The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a Document object. The Document then contains one or more Element objects (for actual XML tags), each of which may contain other Element objects, Text objects (for bits of text), or Comment objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type.
Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short- or long-style flags to specify various options. None of this is XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it.
You've covered a lot of ground. Let's step back and see how all the pieces fit together.
Python comes with powerful libraries for parsing and manipulating XML documents. The minidom takes an XML file and parses it into Python objects, providing for random access to arbitrary elements. Furthermore, this chapter shows how Python can be used to create a "real" standalone command-line script, complete with command-line flags, command-line arguments, error handling, even the ability to take input from the piped result of a previous program.
You've learned about HTML processing and XML processing, and along the way you saw how to download a web page and how to parse XML from a URL, but let's dive into the more general topic of HTTP web services.
Let's say you want to download a resource over HTTP, such as a syndicated Atom feed. But you don't just want to download it once; you want to download it over and over again, every hour, to get the latest news from the site that's offering the news feed. Let's do it the quick-and-dirty way first, and then see how you can do better.
There are five important features of HTTP which you should support.
First, let's turn on the debugging features of Python's HTTP library and see what's being sent over the wire. This will be useful throughout the chapter, as you add more and more features.
The first step to improving your HTTP web services client is to identify yourself properly with a User-Agent. To do that, you need to move beyond the basic urllib and dive into urllib2.
Now that you know how to add custom HTTP headers to your web service requests, let's look at adding support for Last-Modified and ETag headers.
You can support permanent and temporary redirects using a different kind of custom URL handler.
The last important HTTP feature you want to support is compression. Many web services have the ability to send data compressed, which can cut down the amount of data sent over the wire by 60% or more. This is especially true of XML web services, since XML data compresses very well.
You've seen all the pieces for building an intelligent HTTP web services client. Now let's see how they all fit together.
The openanything.py and its functions should now make perfect sense.
You use Google, right? It's a popular search engine. Have you ever wished you could programmatically access Google search results? Now you can. Here is a program to search Google from Python.
Unlike the other code in this book, this chapter relies on libraries that do not come pre-installed with Python.
The heart of SOAP is the ability to call remote functions. There are a number of public access SOAP servers that provide simple functions for demonstration purposes.
The SOAP libraries provide an easy way to see what's going on behind the scenes.
The SOAPProxy class proxies local method calls and transparently turns then into invocations of remote SOAP methods. As you've seen, this is a lot of work, and SOAPProxy does it quickly and transparently. What it doesn't do is provide any means of method introspection.
Like many things in the web services arena, WSDL has a long and checkered history, full of political strife and intrigue. I will skip over this history entirely, since it bores me to tears. There were other standards that tried to do similar things, but WSDL won, so let's learn how to use it.
Let's finally turn to the sample code that you saw that the beginning of this chapter, which does something more useful and exciting than get the current temperature.
Of course, the world of SOAP web services is not all happiness and light. Sometimes things go wrong.
SOAP web services are very complicated. The specification is very ambitious and tries to cover many different use cases for web services. This chapter has touched on some of the simpler use cases.
In previous chapters, you “dived in” by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen before the code gets written.
Now that you've completely defined the behavior you expect from your conversion functions, you're going to do something a little unexpected: you're going to write a test suite that puts these functions through their paces and makes sure that they behave the way you want them to. You read that right: you're going to write code that tests code that you haven't written yet.
This is the complete test suite for your Roman numeral conversion functions, which are yet to be written but will eventually be in roman.py. It is not immediately obvious how it all fits together; none of these classes or methods reference any of the others. There are good reasons for this, as you'll see shortly.
The most fundamental part of unit testing is constructing individual test cases. A test case answers a single question about the code it is testing.
It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.
Often, you will find that a unit of code contains a set of reciprocal functions, usually in the form of conversion functions where one converts A to B and the other converts B to A. In these cases, it is useful to create a “sanity check” to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering any other sort of bug.
Chapter 14. Test-First Programming
Now that the unit tests are complete, it's time to start writing the code that the test cases are attempting to test. You're going to do this in stages, so you can see all the unit tests fail, then watch them pass one by one as you fill in the gaps in roman.py.
Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases.
Now that toRoman behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else).
Now that toRoman is done, it's time to start coding fromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than the toRoman function.
Now that fromRoman works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input. That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult than validating numeric input in toRoman, but you have a powerful tool at your disposal: regular expressions.
Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet.
Despite your best efforts to pin your customers to the ground and extract exact requirements from them on pain of horrible nasty things involving scissors and hot wax, requirements will change. Most customers don't know what they want until they see it, and even if they do, they aren't that good at articulating what they want precisely enough to be useful. And even if they do, they'll want more in the next release anyway. So be prepared to update your test cases as requirements change.
The best thing about comprehensive unit testing is not the feeling you get when all your test cases finally pass, or even the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
A clever reader read the previous section and took it to the next level. The biggest headache (and performance drain) in the program as it is currently written is the regular expression, which is required because you have no other way of breaking down a Roman numeral. But there's only 5000 of them; why don't you just build a lookup table once, then simply read that? This idea gets even better when you realize that you don't need to use regular expressions at all. As you build the lookup table for converting integers to Roman numerals, you can build the reverse lookup table to convert Roman numerals to integers.
Unit testing is a powerful concept which, if properly implemented, can both reduce maintenance costs and increase flexibility in any long-term project. It is also important to understand that unit testing is not a panacea, a Magic Problem Solver, or a silver bullet. Writing good test cases is hard, and keeping them up to date takes discipline (especially when customers are screaming for critical bug fixes). Unit testing is not a replacement for other forms of testing, including functional testing, integration testing, and user acceptance testing. But it is feasible, and it does work, and once you've seen it work, you'll wonder how you ever got along without it.
Chapter 16. Functional Programming
In Chapter 13, Unit Testing, you learned about the philosophy of unit testing. In Chapter 14, Test-First Programming, you stepped through the implementation of basic unit tests in Python. In Chapter 15, Refactoring, you saw how unit testing makes large-scale refactoring easier. This chapter will build on those sample programs, but here we will focus more on advanced Python-specific techniques, rather than on unit testing itself.
When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.
You're already familiar with using list comprehensions to filter lists. There is another way to accomplish this same thing, which some people feel is more expressive.
You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in map function. It works much the same way as the filter function.
By now you're probably scratching your head wondering why this is better than using for loops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Using map and filter forces you to center your thinking around your data.
OK, enough philosophizing. Let's talk about dynamically importing modules.
You've learned enough now to deconstruct the first seven lines of this chapter's code sample: reading a directory and importing selected modules within it.
The regression.py program and its output should now make perfect sense.
I want to talk about plural nouns. Also, functions that return other functions, advanced regular expressions, and generators. Generators are new in Python 2.3. But first, let's talk about how to make plural nouns.
So you're looking at words, which at least in English are strings of characters. And you have rules that say you need to find different combinations of characters, and then do different things to them. This sounds like a job for regular expressions.
Now you're going to add a level of abstraction. You started by defining a list of rules: if this, then do that, otherwise go to the next rule. Let's temporarily complicate part of the program so you can simplify another part.
Defining separate named functions for each match and apply rule isn't really necessary. You never call them directly; you define them in the rules list and call them through there. Let's streamline the rules definition by anonymizing those functions.
Let's factor out the duplication in the code so that defining new rules can be easier.
You've factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them.
Now you're ready to talk about generators.
You talked about several different advanced techniques in this chapter. Not all of them are appropriate for every situation.
Chapter 18. Performance Tuning
There are so many pitfalls involved in optimizing your code, it's hard to know where to start.
The most important thing you need to know about optimizing Python code is that you shouldn't write your own timing function.
The first thing the Soundex function checks is whether the input is a non-empty string of letters. What's the best way to do this?
The second step of the Soundex algorithm is to convert characters to digits in a specific pattern. What's the best way to do this?
The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?
The final step of the Soundex algorithm is padding short results with zeros, and truncating long results. What is the best way to do this?
This chapter has illustrated several important aspects of performance tuning in Python, and performance tuning in general.
Chapter 2. Your First Python Program
![]() |
|
In the ActivePython IDE on Windows, you can run the Python program you're editing by choosing -> (Ctrl-R). Output is displayed in the interactive window. |
![]() |
|
In the Python IDE on Mac OS, you can run a Python program with -> (Cmd-R), but there is an important option you must set first. Open the .py file in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the option is checked. This is a per-file setting, but you'll only need to do it once per file. |
![]() |
|
On UNIX-compatible systems (including Mac OS X), you can run a Python program from the command line: python odbchelper.py |
![]() |
|
In Visual Basic, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's None), and all functions start with def. |
![]() |
|
In Java, C++, and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally. |
![]() |
|
Triple quotes are also an easy way to define a string with both single and double quotes, like qq/.../ in Perl. |
![]() |
|
Many Python IDEs use the doc string to provide context-sensitive documentation, so that when you type a function name, its doc string appears as a tooltip. This can be incredibly helpful, but it's only as good as the doc strings you write. |
![]() |
|
import in Python is like require in Perl. Once you import a Python module, you access its functions with module.function; once you require a Perl module, you access its functions with module::function. |
![]() |
|
Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. C++ and Java use semicolons to separate statements and curly braces to separate code blocks. |
![]() |
|
Like C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing. |
![]() |
|
On MacPython, there is an additional step to make the if __name__ trick work. Pop up the module's options menu by clicking the black triangle in the upper-right corner of the window, and make sure is checked. |
![]() |
|
A dictionary in Python is like a hash in Perl. In Perl, variables that store hashes always start with a % character. In Python, variables can be named anything, and Python keeps track of the datatype internally. |
![]() |
|
A dictionary in Python is like an instance of the Hashtable class in Java. |
![]() |
|
A dictionary in Python is like an instance of the Scripting.Dictionary object in Visual Basic. |
![]() |
|
Dictionaries have no concept of order among elements. It is incorrect to say that the elements are “out of order”; they are simply unordered. This is an important distinction that will annoy you when you want to access the elements of a dictionary in a specific, repeatable order (like alphabetical order by key). There are ways of doing this, but they're not built into the dictionary. |
![]() |
|
A list in Python is like an array in Perl. In Perl, variables that store arrays always start with the @ character; in Python, variables can be named anything, and Python keeps track of the datatype internally. |
![]() |
|
A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added. |
![]() |
|
Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an if statement), according to the following rules:
|
![]() |
|
Tuples can be converted into lists, and vice-versa. The built-in tuple function takes a list and returns a tuple with the same elements, and the list function takes a tuple and returns a list. In effect, tuple freezes a list, and list thaws a tuple. |
![]() |
|
When a command is split among several lines with the line-continuation marker (“\”), the continued lines can be indented in any manner; Python's normally stringent indentation rules do not apply. If your Python IDE auto-indents the continued line, you should probably accept its default unless you have a burning reason not to. |
![]() |
|
String formatting in Python uses the same syntax as the sprintf function in C. |
![]() |
|
join works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception. |
![]() |
|
anystring.split(delimiter, 1) is a useful technique when you want to search a string for a substring and then work with everything before the substring (which ends up in the first element of the returned list) and everything after it (which ends up in the second element). |
Chapter 4. The Power Of Introspection
![]() |
|
The only thing you need to do to call a function is specify a value (somehow) for each required argument; the manner and order in which you do that is up to you. |
![]() |
|
Python comes with excellent reference manuals, which you should peruse thoroughly to learn all the modules Python has to offer. But unlike most languages, where you would find yourself referring back to the manuals or man pages to remind yourself how to use these modules, Python is largely self-documenting. |
![]() |
|
lambda functions are a matter of style. Using them is never required; anywhere you could use them, you could define a separate normal function and use that instead. I use them in places where I want to encapsulate specific, non-reusable code without littering my code with a lot of little one-line functions. |
![]() |
|
In SQL, you must use IS NULL instead of = NULL to compare a null value. In Python, you can use either == None or is None, but is None is faster. |
Chapter 5. Objects and Object-Orientation
![]() |
|
from module import * in Python is like use module in Perl; import module in Python is like require module in Perl. |
![]() |
|
from module import * in Python is like import module.* in Java; import module in Python is like import module in Java. |
![]() |
|
Use from module import * sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes debugging and refactoring more difficult. |
![]() |
|
The pass statement in Python is like an empty set of braces ({}) in Java or C. |
![]() |
|
In Python, the ancestor of a class is simply listed in parentheses immediately after the class name. There is no special keyword like extends in Java. |
![]() |
|
By convention, the first argument of any Python class method (the reference to the current instance) is called self. This argument fills the role of the reserved word this in C++ or Java, but self is not a reserved word in Python, merely a naming convention. Nonetheless, please don't call it anything but self; this is a very strong convention. |
![]() |
|
__init__ methods are optional, but when you define one, you must remember to explicitly call the ancestor's __init__ method (if it defines one). This is more generally true: whenever a descendant wants to extend the behavior of the ancestor, the descendant method must explicitly call the ancestor method at the proper time, with the proper arguments. |
![]() |
|
In Python, simply call a class as if it were a function to create a new instance of the class. There is no explicit new operator like C++ or Java. |
![]() |
|
In the ActivePython IDE on Windows, you can quickly open any module in your library path by selecting -> (Ctrl-L). |
![]() |
|
Java and Powerbuilder support function overloading by argument list, i.e. one class can have multiple methods with the same name but a different number of arguments, or arguments of different types. Other languages (most notably PL/SQL) even support function overloading by argument name; i.e. one class can have multiple methods with the same name and the same number of arguments of the same type but different argument names. Python supports neither of these; it has no form of function overloading whatsoever. Methods are defined solely by their name, and there can be only one method per class with a given name. So if a descendant class has an __init__ method, it always overrides the ancestor __init__ method, even if the descendant defines it with a different argument list. And the same rule applies to any other method. |
![]() |
|
Guido, the original author of Python, explains method overriding this way: "Derived classes may override methods of their base classes. Because methods have no special privileges when calling other methods of the same object, a method of a base class that calls another method defined in the same base class, may in fact end up calling a method of a derived class that overrides it. (For C++ programmers: all methods in Python are effectively virtual.)" If that doesn't make sense to you (it confuses the hell out of me), feel free to ignore it. I just thought I'd pass it along. |
![]() |
|
Always assign an initial value to all of an instance's data attributes in the __init__ method. It will save you hours of debugging later, tracking down AttributeError exceptions because you're referencing uninitialized (and therefore non-existent) attributes. |
![]() |
|
In versions of Python prior to 2.2, you could not directly subclass built-in datatypes like strings, lists, and dictionaries. To compensate for this, Python comes with wrapper classes that mimic the behavior of these built-in datatypes: UserString, UserList, and UserDict. Using a combination of normal and special methods, the UserDict class does an excellent imitation of a dictionary. In Python 2.2 and later, you can inherit classes directly from built-in datatypes like dict. An example of this is given in the examples that come with this book, in fileinfo_fromdict.py. |
![]() |
|
When accessing data attributes within a class, you need to qualify the attribute name: self.attribute. When calling other methods within a class, you need to qualify the method name: self.method. |
![]() |
|
In Java, you determine whether two string variables reference the same physical memory location by using str1 == str2. This is called object identity, and it is written in Python as str1 is str2. To compare string values in Java, you would use str1.equals(str2); in Python, you would use str1 == str2. Java programmers who have been taught to believe that the world is a better place because == in Java compares by identity instead of by value may have a difficult time adjusting to Python's lack of such “gotchas”. |
![]() |
|
While other object-oriented languages only let you define the physical model of an object (“this object has a GetLength method”), Python's special class methods like __len__ allow you to define the logical model of an object (“this object has a length”). |
![]() |
|
In Java, both static variables (called class attributes in Python) and instance variables (called data attributes in Python) are defined immediately after the class definition (one with the static keyword, one without). In Python, only class attributes can be defined here; data attributes are defined in the __init__ method. |
![]() |
|
There are no constants in Python. Everything can be changed if you try hard enough. This fits with one of the core principles of Python: bad behavior should be discouraged but not banned. If you really want to change the value of None, you can do it, but don't come running to me when your code is impossible to debug. |
![]() |
|
In Python, all special methods (like __setitem__) and built-in attributes (like __doc__) follow a standard naming convention: they both start with and end with two underscores. Don't name your own methods and attributes this way, because it will only confuse you (and others) later. |
Chapter 6. Exceptions and File Handling
![]() |
|
Python uses try...except to handle exceptions and raise to generate them. Java and C++ use try...catch to handle exceptions, and throw to generate them. |
![]() |
|
Whenever possible, you should use the functions in os and os.path for file, directory, and path manipulations. These modules are wrappers for platform-specific modules, so functions like os.path.split work on UNIX, Windows, Mac OS, and any other platform supported by Python. |
Chapter 7. Regular Expressions
![]() |
|
There is no way to programmatically determine that two regular expressions are equivalent. The best you can do is write a lot of test cases to make sure they behave the same way on all relevant inputs. You'll talk more about writing test cases later in this book. |
![]() |
|
Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1. |
![]() |
|
In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces. |
![]() |
|
The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes. SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script within HTML comments. |
![]() |
|
Python 2.2 introduced a subtle but important change that affects the namespace search order: nested scopes. In versions of Python prior to 2.2, when you reference a variable within a nested function or lambda function, Python will search for that variable in the current (nested or lambda) function's namespace, then in the module's namespace. Python 2.2 will search for the variable in the current (nested or lambda) function's namespace, then in the parent function's namespace, then in the module's namespace. Python 2.1 can work either way; by default, it works like Python 2.0, but you can add the following line of code at the top of your module to make your module work like Python 2.2:from __future__ import nested_scopes |
![]() |
|
Using the locals and globals functions, you can get the value of arbitrary variables dynamically, providing the variable name as a string. This mirrors the functionality of the getattr function, which allows you to access arbitrary functions dynamically by providing the function name as a string. |
![]() |
|
Using dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a slight performance hit in making the call to locals, since locals builds a copy of the local namespace. |
![]() |
|
A package is a directory with the special __init__.py file in it. The __init__.py file defines the attributes and methods of the package. It doesn't need to define anything; it can just be an empty file, but it has to exist. But if __init__.py doesn't exist, the directory is just a directory, not a package, and it can't be imported or contain modules or nested packages. |
![]() |
|
This section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these more clearly. |
![]() |
|
Like a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attr objects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes by name, like the keys of a dictionary. |
Chapter 10. Scripts and Streams
![]() |
|
In these examples, the HTTP server has supported both Last-Modified and ETag headers, but not all servers do. As a web services client, you should be prepared to support both, but you must code defensively in case a server only supports one or the other, or neither. |
![]() |
|
unittest is included with Python 2.1 and later. Python 2.0 users can download it from pyunit.sourceforge.net. |
Chapter 14. Test-First Programming
![]() |
|
The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module. |
![]() |
|
When all of your tests pass, stop coding. |
![]() |
|
Whenever you are going to use a regular expression more than once, you should compile it to get a pattern object, then call the methods on the pattern object directly. |
Chapter 16. Functional Programming
![]() |
|
The pathnames and filenames you pass to os.path.abspath do not need to exist. |
![]() |
|
os.path.abspath not only constructs full path names, it also normalizes them. That means that if you are in the /usr/ directory, os.path.abspath('bin/../local/bin') will return /usr/local/bin. It normalizes the path by making it as simple as possible. If you just want to normalize a pathname like this without turning it into a full pathname, use os.path.normpath instead. |
![]() |
|
Like the other functions in the os and os.path modules, os.path.abspath is cross-platform. Your results will look slightly different than my examples if you're running on Windows (which uses backslash as a path separator) or Mac OS (which uses colons), but they'll still work. That's the whole point of the os module. |
Chapter 18. Performance Tuning
![]() |
|
You can use the timeit module on the command line to test an existing Python program, without modifying the code. See http://docs.python.org/lib/node396.html for documentation on the command-line flags. |
![]() |
|
The timeit module only works if you already know what piece of code you need to optimize. If you have a larger Python program and don't know where your performance problems are, check out the hotshot module. |
Chapter 2. Your First Python Program
Chapter 4. The Power Of Introspection
Chapter 5. Objects and Object-Orientation
Chapter 6. Exceptions and File Handling
Chapter 7. Regular Expressions
Chapter 10. Scripts and Streams
Chapter 14. Test-First Programming
Chapter 16. Functional Programming
Revision History | |
---|---|
Revision 5.4 | 2004-05-20 |
|
|
Revision 5.3 | 2004-05-12 |
|
|
Revision 5.2 | 2004-05-09 |
|
|
Revision 5.1 | 2004-05-05 |
|
|
Revision 5.0 | 2004-04-16 |
|
|
Revision 4.9 | 2004-03-25 |
|
|
Revision 4.8 | 2004-03-25 |
|
|
Revision 4.7 | 2004-03-21 |
|
|
Revision 4.6 | 2004-03-14 |
|
|
Revision 4.5 | 2004-03-07 |
|
|
Revision 4.4 | 2003-10-08 |
|
|
Revision 4.3 | 2003-09-28 |
|
|
Revision 4.2.1 | 2003-09-17 |
|
|
Revision 4.2 | 2003-09-12 |
|
|
Revision 4.1 | 2002-07-28 |
|
|
Revision 4.0-2 | 2002-04-26 |
|
|
Revision 4.0 | 2002-04-19 |
|
|
Revision 3.9 | 2002-01-01 |
|
|
Revision 3.8 | 2001-11-18 |
|
|
Revision 3.7 | 2001-09-30 |
|
|
Revision 3.6.4 | 2001-09-06 |
|
|
Revision 3.6.3 | 2001-09-04 |
|
|
Revision 3.6.2 | 2001-08-31 |
|
|
Revision 3.6 | 2001-08-31 |
|
|
Revision 3.5 | 2001-06-26 |
|
|
Revision 3.4 | 2001-05-31 |
|
|
Revision 3.3 | 2001-05-24 |
|
|
Revision 3.2 | 2001-05-03 |
|
|
Revision 3.1 | 2001-04-18 |
|
|
Revision 3.0 | 2001-04-16 |
|
|
Revision 2.9 | 2001-04-13 |
|
|
Revision 2.8 | 2001-03-26 |
|
|
Revision 2.7 | 2001-03-16 |
|
|
Revision 2.6 | 2001-02-28 |
|
|
Revision 2.5 | 2001-02-23 |
|
|
Revision 2.4.1 | 2001-02-12 |
|
|
Revision 2.4 | 2001-02-12 |
|
|
Revision 2.3 | 2001-02-09 |
|
|
Revision 2.2 | 2001-02-02 |
|
|
Revision 2.1 | 2001-02-01 |
|
|
Revision 2.0 | 2001-01-31 |
|
|
Revision 1.9 | 2001-01-15 |
|
|
Revision 1.8 | 2001-01-12 |
|
|
Revision 1.71 | 2001-01-03 |
|
|
Revision 1.7 | 2001-01-02 |
|
|
Revision 1.6 | 2000-12-11 |
Revision 1.5 | 2000-11-22 |
|
|
Revision 1.4 | 2000-11-14 |
|
|
Revision 1.3 | 2000-11-09 |
|
|
Revision 1.2 | 2000-11-06 |
|
|
Revision 1.1 | 2000-10-31 |
|
|
Revision 1.0 | 2000-10-30 |
|
This book was written in DocBook XML using Emacs, and converted to HTML using the SAXON XSLT processor from Michael Kay with a customized version of Norman Walsh's XSL stylesheets. From there, it was converted to PDF using HTMLDoc, and to plain text using w3m. Program listings and examples were colorized using an updated version of Just van Rossum's pyfontify.py, which is included in the example scripts.
If you're interested in learning more about DocBook for technical writing, you can download the XML source and the build scripts, which include the customized XSL stylesheets used to create all the different formats of the book. You should also read the canonical book, DocBook: The Definitive Guide. If you're going to do any serious writing in DocBook, I would recommend subscribing to the DocBook mailing lists.
Version 1.1, March 2000
Copyright (C) 2000 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
The purpose of this License is to make a manual, textbook, or other written document "free" in the sense of freedom: to assure everyone the effective freedom to copy and redistribute it, with or without modifying it, either commercially or noncommercially. Secondarily, this License preserves for the author and publisher a way to get credit for their work, while not being considered responsible for modifications made by others.
This License is a kind of "copyleft", which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free software.
We have designed this License in order to use it for manuals for free software, because free software needs free documentation: a free program should come with manuals providing the same freedoms that the software does. But this License is not limited to software manuals; it can be used for any textual work, regardless of subject matter or whether it is published as a printed book. We recommend this License principally for works whose purpose is instruction or reference.
This License applies to any manual or other work that contains a notice placed by the copyright holder saying it can be distributed under the terms of this License. The "Document", below, refers to any such manual or work. Any member of the public is a licensee, and is addressed as "you".
A "Modified Version" of the Document means any work containing the Document or a portion of it, either copied verbatim, or with modifications and/or translated into another language.
A "Secondary Section" is a named appendix or a front-matter section of the Document that deals exclusively with the relationship of the publishers or authors of the Document to the Document's overall subject (or to related matters) and contains nothing that could fall directly within that overall subject. (For example, if the Document is in part a textbook of mathematics, a Secondary Section may not explain any mathematics.) The relationship could be a matter of historical connection with the subject or with related matters, or of legal, commercial, philosophical, ethical or political position regarding them.
The "Invariant Sections" are certain Secondary Sections whose titles are designated, as being those of Invariant Sections, in the notice that says that the Document is released under this License.
The "Cover Texts" are certain short passages of text that are listed, as Front-Cover Texts or Back-Cover Texts, in the notice that says that the Document is released under this License.
A "Transparent" copy of the Document means a machine-readable copy, represented in a format whose specification is available to the general public, whose contents can be viewed and edited directly and straightforwardly with generic text editors or (for images composed of pixels) generic paint programs or (for drawings) some widely available drawing editor, and that is suitable for input to text formatters or for automatic translation to a variety of formats suitable for input to text formatters. A copy made in an otherwise Transparent file format whose markup has been designed to thwart or discourage subsequent modification by readers is not Transparent. A copy that is not "Transparent" is called "Opaque".
Examples of suitable formats for Transparent copies include plain ASCII without markup, Texinfo input format, LaTeX input format, SGML or XML using a publicly available DTD, and standard-conforming simple HTML designed for human modification. Opaque formats include PostScript, PDF, proprietary formats that can be read and edited only by proprietary word processors, SGML or XML for which the DTD and/or processing tools are not generally available, and the machine-generated HTML produced by some word processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself, plus such following pages as are needed to hold, legibly, the material this License requires to appear in the title page. For works in formats which do not have any title page as such, "Title Page" means the text near the most prominent appearance of the work's title, preceding the beginning of the body of the text.
You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License. You may not use technical measures to obstruct or control the reading or further copying of the copies you make or distribute. However, you may accept compensation in exchange for copies. If you distribute a large enough number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and you may publicly display copies.
If you publish printed copies of the Document numbering more than 100, and the Document's license notice requires Cover Texts, you must enclose the copies in covers that carry, clearly and legibly, all these Cover Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on the back cover. Both covers must also clearly and legibly identify you as the publisher of these copies. The front cover must present the full title with all words of the title equally prominent and visible. You may add other material on the covers in addition. Copying with changes limited to the covers, as long as they preserve the title of the Document and satisfy these conditions, can be treated as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit legibly, you should put the first ones listed (as many as fit reasonably) on the actual cover, and continue the rest onto adjacent pages.
If you publish or distribute Opaque copies of the Document numbering more than 100, you must either include a machine-readable Transparent copy along with each Opaque copy, or state in or with each Opaque copy a publicly-accessible computer-network location containing a complete Transparent copy of the Document, free of added material, which the general network-using public has access to download anonymously at no charge using public-standard network protocols. If you use the latter option, you must take reasonably prudent steps, when you begin distribution of Opaque copies in quantity, to ensure that this Transparent copy will remain thus accessible at the stated location until at least one year after the last time you distribute an Opaque copy (directly or through your agents or retailers) of that edition to the public.
It is requested, but not required, that you contact the authors of the Document well before redistributing any large number of copies, to give them a chance to provide you with an updated version of the Document.
You may copy and distribute a Modified Version of the Document under the conditions of sections 2 and 3 above, provided that you release the Modified Version under precisely this License, with the Modified Version filling the role of the Document, thus licensing distribution and modification of the Modified Version to whoever possesses a copy of it. In addition, you must do these things in the Modified Version:
If the Modified Version includes new front-matter sections or appendices that qualify as Secondary Sections and contain no material copied from the Document, you may at your option designate some or all of these sections as invariant. To do this, add their titles to the list of Invariant Sections in the Modified Version's license notice. These titles must be distinct from any other section titles.
You may add a section entitled "Endorsements", provided it contains nothing but endorsements of your Modified Version by various parties--for example, statements of peer review or that the text has been approved by an organization as the authoritative definition of a standard.
You may add a passage of up to five words as a Front-Cover Text, and a passage of up to 25 words as a Back-Cover Text, to the end of the list of Cover Texts in the Modified Version. Only one passage of Front-Cover Text and one of Back-Cover Text may be added by (or through arrangements made by) any one entity. If the Document already includes a cover text for the same cover, previously added by you or by arrangement made by the same entity you are acting on behalf of, you may not add another; but you may replace the old one, on explicit permission from the previous publisher that added the old one.
The author(s) and publisher(s) of the Document do not by this License give permission to use their names for publicity for or to assert or imply endorsement of any Modified Version.
You may combine the Document with other documents released under this License, under the terms defined in section 4 above for modified versions, provided that you include in the combination all of the Invariant Sections of all of the original documents, unmodified, and list them all as Invariant Sections of your combined work in its license notice.
The combined work need only contain one copy of this License, and multiple identical Invariant Sections may be replaced with a single copy. If there are multiple Invariant Sections with the same name but different contents, make the title of each such section unique by adding at the end of it, in parentheses, the name of the original author or publisher of that section if known, or else a unique number. Make the same adjustment to the section titles in the list of Invariant Sections in the license notice of the combined work.
In the combination, you must combine any sections entitled "History" in the various original documents, forming one section entitled "History"; likewise combine any sections entitled "Acknowledgements", and any sections entitled "Dedications". You must delete all sections entitled "Endorsements."
You may make a collection consisting of the Document and other documents released under this License, and replace the individual copies of this License in the various documents with a single copy that is included in the collection, provided that you follow the rules of this License for verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute it individually under this License, provided you insert a copy of this License into the extracted document, and follow this License in all other respects regarding verbatim copying of that document.
A compilation of the Document or its derivatives with other separate and independent documents or works, in or on a volume of a storage or distribution medium, does not as a whole count as a Modified Version of the Document, provided no compilation copyright is claimed for the compilation. Such a compilation is called an "aggregate", and this License does not apply to the other self-contained works thus compiled with the Document, on account of their being thus compiled, if they are not themselves derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these copies of the Document, then if the Document is less than one quarter of the entire aggregate, the Document's Cover Texts may be placed on covers that surround only the Document within the aggregate. Otherwise they must appear on covers around the whole aggregate.
Translation is considered a kind of modification, so you may distribute translations of the Document under the terms of section 4. Replacing Invariant Sections with translations requires special permission from their copyright holders, but you may include translations of some or all Invariant Sections in addition to the original versions of these Invariant Sections. You may include a translation of this License provided that you also include the original English version of this License. In case of a disagreement between the translation and the original English version of this License, the original English version will prevail.
You may not copy, modify, sublicense, or distribute the Document except as expressly provided for under this License. Any other attempt to copy, modify, sublicense or distribute the Document is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance.
The Free Software Foundation may publish new, revised versions of the GNU Free Documentation License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. See http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number. If the Document specifies that a particular numbered version of this License "or any later version" applies to it, you have the option of following the terms and conditions either of that specified version or of any later version that has been published (not as a draft) by the Free Software Foundation. If the Document does not specify a version number of this License, you may choose any version ever published (not as a draft) by the Free Software Foundation.
To use this License in a document you have written, include a copy of the License in the document and put the following copyright and license notices just after the title page:
Copyright (c) YEAR YOUR NAME. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being LIST THEIR TITLES, with the Front-Cover Texts being LIST, and with the Back-Cover Texts being LIST. A copy of the license is included in the section entitled "GNU Free Documentation License".
If you have no Invariant Sections, write "with no Invariant Sections" instead of saying which ones are invariant. If you have no Front-Cover Texts, write "no Front-Cover Texts" instead of "Front-Cover Texts being LIST"; likewise for Back-Cover Texts.
If your document contains nontrivial examples of program code, we recommend releasing these examples in parallel under your choice of free software license, such as the GNU General Public License, to permit their use in free software.
Python was created in the early 1990s by Guido van Rossum at Stichting Mathematisch Centrum (CWI) in the Netherlands as a successor of a language called ABC. Guido is Python's principal author, although it includes many contributions from others. The last version released from CWI was Python 1.2. In 1995, Guido continued his work on Python at the Corporation for National Research Initiatives (CNRI) in Reston, Virginia where he released several versions of the software. Python 1.6 was the last of the versions released by CNRI. In 2000, Guido and the Python core development team moved to BeOpen.com to form the BeOpen PythonLabs team. Python 2.0 was the first and only release from BeOpen.com.
Following the release of Python 1.6, and after Guido van Rossum left CNRI to work with commercial software developers, it became clear that the ability to use Python with software available under the GNU Public License (GPL) was very desirable. CNRI and the Free Software Foundation (FSF) interacted to develop enabling wording changes to the Python license. Python 1.6.1 is essentially the same as Python 1.6, with a few minor bug fixes, and with a different license that enables later versions to be GPL-compatible. Python 2.1 is a derivative work of Python 1.6.1, as well as of Python 2.0.
After Python 2.0 was released by BeOpen.com, Guido van Rossum and the other PythonLabs developers joined Digital Creations. All intellectual property added from this point on, starting with Python 2.1 and its alpha and beta releases, is owned by the Python Software Foundation (PSF), a non-profit modeled after the Apache Software Foundation. See http://www.python.org/psf/ for more information about the PSF.
Thanks to the many outside volunteers who have worked under Guido's direction to make these releases possible.
Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam, The Netherlands. All rights reserved.
Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee is hereby granted, provided that the above copyright notice appear in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the name of Stichting Mathematisch Centrum or CWI not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.
STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.