Thursday, January 5, 2017

Introduction to Python presentation



In the June 2009 meeting of RSSC, I gave a class on "Introduction to Python".

Introduction to Python presentation

HOWTO: freeze PyOpenGL programs with py2exe

HOWTO: freeze PyOpenGL programs with py2exe


I recently posted a request for help with freezing PyOpenGL programs with cx_Freeze. With the help from Dan Helfman and Mike C Fletcher, I was able to freeze my PyOpenGL program with py2exe. I have not tried this with cx_Freeze, but that will probably work similarly. I'll describe here what I did to make it work, just in case it would help someone out there.

In order to freeze PyOpenGL applications, you need to exclude the OpenGL library from the freeze process and add the library manually in the  distributable directory. In order to make this work, you need to tell your application where to go look for the OpenGL module. To do this, you need to insert the current directory . in front of the sys.path search list.


1) make sure you have py2exe installed

2) in your main source file, include the following at the very top (before all the includes):



import sys
sys.path.insert( 0, '.' )



3) create a file called setup.py in your source folder with the following content:


from distutils.core import setup
import py2exe

setup(windows=[<yourprogramhere>.py'],
      options={"py2exe": {"includes": ["ctypes", "logging"],
                          "excludes": ["OpenGL"],
                         }
              }
      )

note: not sure about ctypes and logging. You might need more.


3) on the command line in your source folder:
python setup.py py2exe

This creates a folder called dist.

4)
Examine the output of py2exe. In my case, it put out something like:

>The following modules appear to be missing
>['OpenGL.GL', 'OpenGL.GLU', 'OpenGL.GLUT', 'OpenGL.arrays.ctypesarrays',
'OpenGL.arrays.
ctypesparameters', 'OpenGL.arrays.ctypespointers', '
>OpenGL.arrays.lists', 'OpenGL.arrays.nones', 'OpenGL.arrays.numbers',
'OpenGL.arrays.strings', 'OpenGL.platform.win32',
'OpenGL.raw.GL<http://opengl.raw.gl/>',
'du
>mmy.Process', 'email.Generator', 'email.Iterators', 'numpy']


In your main Python source file, ensure you import most of the
 OpenGL-related ones. I ommitted the funny ones like 'dummy.Process', 'email.Generator', 'email.Iterators', 'numpy'. That didn't have an impact. In my case, I put the following imports under the "import sys" described in
point 2. In your case, some other OpenGL related modules might appear. You might want to import those too.

import OpenGL.platform.win32
import OpenGL.arrays.ctypesarrays
import OpenGL.arrays.ctypesparameters
import OpenGL.arrays.ctypespointers
import OpenGL.arrays.lists
import OpenGL.arrays.nones
import OpenGL.arrays.numbers
import OpenGL.arrays.strings
import OpenGL.platform.win32
import OpenGL.raw.GL
#import dummy.Process
#import email.Generator
#import email.Iterators
#import numpy
import OpenGL.GL
import OpenGL.GLU
import OpenGL.GLUT
from OpenGL.GL import *
from OpenGL.GLUT import *
from OpenGL.GLU import *


if you use numpy, you need the following too:


import OpenGL.arrays.numpymodule


5)
re-run py2exe (step 3)

6)
>From C:\Python26\Lib\site-packages, copy the folder OpenGL inside the newly created dist folder. You can delete the pyc and pyo files to reduce the footprint



On a side note:
Pyglet programs seem to be less cumbersome to freeze. I was able to freeze them with cx_Freeze with only one minor modification in the Pyglet code (comment out an offending import). According to Mike, this is because Pyglet does not support plugins, which makes the implementation simpler. So, in case you really can't make it work under PyOpenGL, take a look at Pyglet. I almost made the step until I got it working as described above.

Special thanks to Mike and Dan.

Python 2.6 'DLL Hell'

Python 2.6 'DLL Hell'


I recently figured out a problem that came up with the latest versions of Python and cx-Freeze. I thought I post it here so that it might be useful to someone.

The problem was that, when I switched to Python 2.6.x and cx-Freeze-4.0.1.win32-py2.6.msi, the executables that were produced ran perfectly on my PC but not any other PC. They did not print any useful information whatsoever.

Googling around, I learned that it had to do with Visual Studio being installed on my PC and not on the other PC's. The common solution that worked for other people was to put the redistributable manifest and DLL's from C:\Program Files\Microsoft
Visual Studio 9.0\VCredist\x86\Microsoft.VC90.CRT, into the same directory as the generated EXE.

This didn't work for me. With the help of Anthony Tuininga, I finally tracked it down to the DLL version.

In the good ol' days, when your application was lacking the C runtime library DLL in its search path, it gave a clear warning.
e.g. "can not find msvcr71.dll" or something like that.
The only thing you needed to do was to go look for one (either on your PC or on the internet), and pluck whatever DLL with that name into the search path, typically in the same directory as your exe or windows system32. It didn't care about different versions of DLL's.


Now with the advent of at least Visual Studio 2008 (and probably 2005, but I skipped that one), Microsoft, in its infinite wisdom, decided that that method was way too easy for everyone and came up with yet another way to torment everyone that dares to develop software in
anything else than Microsoft monstrosities. I am referring to these mysterious things with names like "side by side installation", "SxS", "manifests", "assemblies", ...

Why does the noble community of enlightened scholars developing Python care at all ? Because Python 2.6 binaries for Windows are now compiled using Visual Studio 2008. In essence, that should not mean more than replacing your msvcr71.dll with msvcr90.dll. Unfortunately, you
inherit this "assemblies" crap with it.

The new terminology invented with this and the MSDN articles on this subject make it only more obscure and complicated than it should be. In a nutshell, DLL's now are attributed with a version number. Executables generated with VS2008 are now restricted to run only witha predefined specific version of a DLL. This allows you to run executable A with version X of a DLL and another exe with version Y of that same DLL and not be affected with obscure bugs caused by pairing the wrong version of a DLL with an exe (the infamous DLL hell).

How do you pair an exe with a particular DLL version ? That is done with manifest files, basically a small XML file, listing the exact version number and some obscure hash numbers for integrity checking. I haven't figured out how to manually produce these (VS2008 does this for you) but fortunately for us, Python developers we don't need to care. We just have to use the same one the Python distribution was compiled with.

How do I know which version Python is compiled with ? The easiest way is to install Python with the option "install just for me". This has the result that the msvcrxx.dll used by Python is copied into the c:\pythonxx directory instead of in a common windowssystem32 folder (or something like that). This was my first mistake, I used the other option "install for other users".

This should not be a problem. cx-Freeze is clever enough to go out and find this DLL for you. It will probably find the right one, PROVIDING you have the right version of the DLL's in your search path.


Now back to Microsoft. Up until around 9/16/2008, there was only 1 version of msvcr DLL's, nl. 9.0.21022.8 and that was good, because it happens to be the same version Python was installed with. The only thing you have to do is to accompany your generated exe with a manifest with the correct hieroglyphs, copy this dll in the same directory and you were done.
(you actually need 2 manifest files: one that accompanies to the exe which DLL version it needs, and another manifest file that specifies the versions of the DLL. The exe manifest can be embedded in the exe).

Poor old Anthony Tuininga had to figure this out the hard way. He makes our lives a lot easier by already embedding the correct exe manifest into the executable. You can check this by opening the generated binary with a hex editor and scroll down until you see some XML. That is the manifest. That will tell you which DLL's it needs and which version (nl. 9.0.21022.8). It needs to be this version because of the version of Visual Studio 2008 used to compile Python itself.

Having Visual Studio 2008 installed on your PC while freezing your Python apps should not be a problem. Even if you didn't specify "install just for me", cx-Freeze will probably find msvcr90.dll somewhere.

But then came Visual Studio 2008 SP1. With it came an updated msvcr90.dll with version 9.00.30729.1. That is what I had installed and things went south from there.

Recent versions of Windows now come with the directory C:WINDOWS WinSxS, in where you find all the versions of these DLL's (this is what they refer to as "side-by-side installation", i.e. instead of copying these common dll's in system32 folder, every version is now copied in a directory with a particular name, containing the version number and all manifest files copied in C:WINDOWSWinSxSManifests).

When I installed VS 2008 SP1, it copied a later version (9.00.30729.1) of msvcr in there "side-by-side" with the original version (9.0.21022.8) which came from the original VS2008 and which I needed to get my frozen Python26 exe's to run. For some reason, Windows couldn't figure out which DLL to choose.

I followed the advise I found on the internet to find the c:\Program Files\Microsoft Visual Studio 9.0\VC\redist\x86\Microsoft.VC90.CRT folder and copy its content (being Microsoft.VC90.CRT.manifest and msvcr90.dll) into my bin directory. This didn=92t work. Unfortunately, I had already updated my Visual Studio 2008 to SP1 and this using the wrong version (9.00.30729.1) Had I not updated to SP1, I would not have had this problem.

Not understanding the problem, it looked like I was forced to have all my users install the Microsoft Visual C++ 2008 Redistributable Package before running my tools.

Fortunately, this is not necessary. The solution is very simple:
after having generated your application with cx-Freeze, copy the following files (and only these files) to your executable directory:

C:\WINDOWS\WinSxS\Manifests\x86-Microsoft.VC90.CRT-1fc8b3b9a1e18e3b-9.0.21022.8-x-ww-d08d0375.manifest

C:\WINDOWS\WinSxS\x86-Microsoft.VC90.CRT-1fc8b3b9a1e18e3b-9.0.21022.8-x-ww-d08d0375\
msvcm90.dll
msvcp90.dll
msvcr90.dll

I think this applies to py2exe as well.

If you do not have these files, you can download these from the following link:
Microsoft Visual C++ 2008 Redistributable Package (x86) (11/29/2007)

note:
beware, there is now an SP1 version of this, which caries the (incorrect) 9.00.30729.1 version. I am not sure it caries the (correct) 9.0.21022.8 as well. If it doesn't, you need the other installation. You can find SP1 from:
Microsoft Visual C++ 2008 SP1 Redistributable Package (x86)


note: I don't understand why Microsoft doesn't distribute the C runtime libraries (msvcrxx.dll) with the next Windows Update and squirt that puppy in every Windows PC on the planet.

Hope this helps.


Parsers and Regular Expressions

Parsers and Regular Expressions


Here are links to the files you need for this article:

Scope


This is a small tutorial on writing parsers in Python using regular expressions. We'll cover parser particular to processing timestamp-based log files and stateful parsers to process various reports. Both parser types are very common in testing embedded systems.


introduction


Regular expressions are a godsend when have to process text-based files. In my world, this happens a lot in embedded system testing. It is also very handy in automating various software-development related tasks that involve processing text-based files (e.g. Makefiles, HEX files, project files, ...).

There are many ways of exchanging data between applications. It can be done either via communication links (e.g. pipes), memory (shared memory) or via files. The format of these files can be categorized as binary or text-based. Binary file handling is usually easier to implement in software, but more error-prone. They are practically impossible to read by humans. When generating these interchange files, and writing its content in text form
(referred to as ASCII data), the developer can open these with any text editor and inspect its content. They are much easier to read. They are usually equally easy to generate then binary files. But they are much harder to parse back into the program. Especially when these files can be modified by humans. Text-based files are usually much more prone to varying whitespace, punctuation, capitalization, newlines, page breaks,
aliases for data elements, optional data elements and data representation format. E.g. floating point values can be represented in many ways  in text format (-0e45, 123.4, 0.0000003, ...).

Binary files are much harder (and sometimes even impossible) to reverse-engineer. You need a detailed file format description to implement a program that can parse a binary file. Not so with text-based files. When given a couple of example files, it is usually not that hard to reverse-engineer their format, just by inspecting the example files. 

Writing software from the ground up, that can parse these textual input files and deal with all the possible phenomena is challenging and very error prone. That is why a lot of software prefer to use binary file formats. 

That is where regular expressions come in. This is an extremely handy tool to make the life the programmer much easier. With regular expressions, it even becomes fun to implement these parsers.

Regular expressions are strings, in which you use some predefined symbols to describe the format a substrings that can occur in an input string (this string can come from a file, is input by the user via the keyboard, or is part of a stream of data coming in via a communications link). You provide this expression together with the input string to the API calls, and they will chop the input string up according to the regular expression. If all the symbols can be mapped to portions of the input string, it is said that it matches. You can then extract pieces of the string according to the symbols they match to.

UNIX tools like AWK and Lex use regular expressions very heavily. The format used in Perl is also used in Python. The Python standard library has a module called re for processing Perl-like regular expressions. It is described really well in the Standard Python help files.


In this tutorial, we will demonstrate two different ways regular expressions can be used in the world of embedded system development. The first will demonstrate how timestamped log files can be parsed. The second example
will parse the output report of a test program to calculate statistics.


parsing log files


In the embedded system world and also in networking, many timestamped log files are constantly generated, either as a normal function of the device, or as a result of testing the system by sniffing the traffic on communication links or data acquisition.These log almost always have the form where every event is represented by a single line. This line is usually prefixed by a representation of the absolute time the event occurred, followed by some fields that give more information about the event (e.g. a message ID, a module ID, a severity indicator, ..), followed finally by a textual description of the event. This sentence can be a standard string, or it can have some variable data elements embedded. In the following example, the event occurred on March 10th of 2010 at 11 o'clock. The C2 can indicate a module. The INFO can denote the severity of the event. The value 012E can be a hexadecimal representation of the message ID. The rest of the line "read A/D V1=500, A1=0.23" can be the textual description of the event. It has two variable data elements "500" and "0.23". 


2010-03-10 11:00:11.53  C2  INFO  012E  read A/D V1=500, A1=0.23

A simple parsing program consists of a loop in which every iteration a line is read, matched to one or more regular expression templates and then execute a set of actions that is associated with that particular template. In our example, we will write a little program, that parses every line of a log file, looks for the lines with the format "read A/D V1=xxx, A1=yyy", extract the values for xxx and yyy and write them line-by-line in a CSV file (another text based file). Here is a sample input file:

TESTDATA system ABC
2010-03-10 11:00:05.23  C1  INFO  ACB5  system startup
2010-03-10 11:00:05.24  C1  INFO  ACB6  checking internal configuration
2010-03-10 11:00:10.99  C1  INFO  DE52  system check passes


2010-03-10 11:00:11.53  C2  INFO  012E  read A/D V1=500, A1=0.23
2010-03-10 11:00:16.83  C2  INFO  012E  read A/D V1=510, A1=0.18
2010-03-10 11:00:17.01  C1  WARN  EEF3  reference voltage low
2010-03-10 11:00:17.02  B1  ERROR EE04  alarm input channel, code 005
2010-03-10 11:00:21.23  C2  INFO  012E  read A/D V1=520, A1=0.19
2010-03-10 11:00:26.24  C2  INFO  012E  read A/D V1=530, A1=0.15


Examine the Python progam regex1.py (attached below):

We first import the standard Python module re

import re

A regular expression template (which is a string) needs to be parsed itself and broken down into a sequence of individual symbols. This can be done each and every time an input line is parsed. But this takes time. The Python re module allows you to 'compile' a regular expression template into a data structure that contains all it needs to know to parse input data. This is done wit the function re.compile(). The data structure it generates is testdata_regex. The last parameter of re.compile() used here (re.IGNORECASE) instructs the parser to treat upper and lower case alphabetic characters as the same. In 90% of your applications, this is enough. The less strict you make your regular expression template, the easier it will parse all the data, but the more prone you are in incorrect parsing. Finding the 'just-enough' regular expression is more a black art than a science. Just trust me for the time being that the incomprehensible string of hieroglyps is an appropriate regular expression for the example given above. We will cover the regular expression template later in this tutorial. For now, we'll focus on the architecture of the parser.
Please note that I included an example string in a comment right above. You do not have to do this. It helps me to better understand the regular expression. I also inserted a comment line wit the number 1 .. 7 right above the regular expression template. These number denote the groups (enclosed in parentheses) that can be extracted from a match object. More on that later.


# example test data line: 2010-03-10 11:00:11.53  C2  INFO  012E  read A/D V1=500, A1=0.23

#                                   1          2           3       4       5                             6                    7
testdata_regex = re.compile("\s*([\d-]+)\s+([\d\:\.]+)\s+(\w+)\s+(\w+)\s+(\w+)\s+read\s*A/D\s+V1\s*=\s*(\d+)\s*,\s*A1\s*=\s*([\.\d\-\+]+)", re.IGNORECASE)


We open an input file to get the string data to parse. We also create a new CSV file to store the results.


fh1 = open("testdata.log", "r")         # open the input file for textual read
fh2 = open("testresults.csv", "w")      # create the output file for write
fh2.write("Voltage,Current\n")          # write the field titles to the output file


Repeat for every line in the file:

line = ""
eof = False
while not eof:                          # repeat until the last line of the input file is reached
    line = fh1.readline()               # read a single text line from the input file
    eof = (line == '')                  # was this the last line ?
    if not eof:


Pass the input string to the regular expression object and see if it matches:


        testdata_match = testdata_regex.match(line) # does this line match our regular expression template


If there is a match, i.e. every symbol in the regular expression could be matched to a section in the input string:


        if testdata_match:                          # does it match ?


In this example we extract the string sections that match the groups, specified in the regular expression and simply print them to the screen. You can do whatever you want with them. Beware, that at this stage, they are still in string form. Depending on how your regular expression was formulated, you might want need to strip leading and trailing blanks and convert them to all upper or lower case. You might also want to convert them to numeric data (integer, float). Be careful when you convert them. They might contain characters that are not allowed and cause an exception to be thrown. E.g. a hexadecimal string will cause an exception when converted to int without specifying the base 16. Or the hexadecimal field could have the 0x prefix. Make sure you know all the ways these numeric fields can be depicted and  pre-format your string data before converting them. 
It is also possible that you want to submit your extracted field to other regular expression objects to parse it even further down. You had the option to include these sub regular expressions in the top-level regular expression or to describe the compound field with a simple one. This is common with date and time fields. A simple regular expression that extracts th whole date field might be: ([\d\-]+). This can extract a date in the format: "2010-03-10". Once extracted, you can apply a more refined regular expression to it to extract the parts: (\d\d\d\d)-(\d\d)-(\d\d), which allows you to extract the year, month and day fields. In this example we explicitly specify that the year field must be 4 digits. The regular expression (\d+) is more tolerant of variable length year fields, which allows for 2- and 4-digit years. But it will also capture a string with 1, 3, 5 and more digits, which most likely are not year fields. Once extracted, you will have to further validate the values to have the right magnitude, range, etc.E.g. the year 1830 is a very unlikely year for a living person to be born in. This is all very application specific.


            # print the extracted fields 
            print "found match !"
            print "line:       ", testdata_match.group(0)
            print "date:       ", testdata_match.group(1)
            print "time:       ", testdata_match.group(2)
            print "module:     ", testdata_match.group(3)
            print "severity:   ", testdata_match.group(4)
            print "message id: ", testdata_match.group(5)
            print "voltage:    ", testdata_match.group(6)
            print "current:    ", testdata_match.group(7)


Here we give an example of what can be done with the information extracted. Here we write some fields to a CSV file,
 so that it can be read into Excel to show a graph.

            # write results to CSV file
            fh2.write(testdata_match.group(6))
            fh2.write(",")
            fh2.write(testdata_match.group(7))
            fh2.write("\n")

        else:
            print "no match: ", line

Make sure you close the opened files at the end of the program.

fh1.close()     # close the input file
fh2.close()     # close the output file



Regular expression templates


In the previous example, we used fairly complex-looking regular expression to parse out a particular event type from the example log file. We skipped over explaining the guts from it.

Please take some time to study the help files on the re module in the Python Standard Library. It is in the online Python documentation or here.

Regular expressions are actually easy to write, but a bitch to read.  There are some useful tools out there that can help you to comprehend and construct regular expressions. The regular expression format for Python and Perl is the same, so any tool that can cover either language will do.

I like to use "The Regex Coach". Please download this (or another tool) for this tutorial.
If my following little tutorial on regular expressions confuses you more than it should, then do not despair. Regular expressions are widely covered in other tutorial. here are a couple of them:

http://en.wikipedia.org/wiki/Regular_expression
http://www.zytrax.com/tech/web/regex.htm
http://www.amk.ca/python/howto/regex/
http://diveintopython.org/regular_expressions/index.html
http://docs.activestate.com/komodo/4.4/regex-intro.html


We'll work our way gradually up from the basics until the point you can construct and comprehend the regular expression used in the previous example. The above referenced help file will describe every regular expression symbol.
Here we will give examples of them. We will not go into the nitty-gritty legalistics of them. That is what the help files are for.

Start up the Regex Coach. You should see something like in the following screen shot:



The screen is divided in 3 areas. On the top is a big edit box for Regular expressions. Here you can You can type or copy/paste your candidate regular expressions.
In the middle is a big edit box for Target strings. Here you can You can type or copy/paste example strings you want to parse.
On the bottom is a multi-pane dialog. The first is called "Control". Disregard the others for now. This is a handy control for checking the groups inside the string.


In the target string field, type aaaa.
Now in the regular expression, type a single a. You will notice that the first a of the Target string is colored yellow. This means that your first regular expression (a) matches the first character.

Now add more a's to the regular expression. For every a you append, another a in the target string becomes yellow.

The regular expression of aaaa simply states that the target string needs to be aaaa. This is a very silly example. Regular expressions do much more.

Replace the regular expression aaaa with \w. This also highlights the first a. Now augment the regular expression to \w+. All of a sudden, this highlights the full target string.

The + indicates 'one or more'. Thus, \w+ means 'one alphanumeric character or more'.


Now type the following Target string (forget the quotes):  'aaaa    01234ABCalice in wonderland|'

The \w+ template still covers the first aaaa.

Now give the following regular expression: \w+\s+The \s+ cover the blanks. \s denotes 'whitespace' characters like blanks, tabs, newlines, ...

Now expand the regular expression to \w+\s+\w+The first 2 strings are highlighted. Now let's say the 2nd string is actually 2 consecutive fields that you want to capture separately. You have several options to achieve this. Which one you choose is a mix of what you prefer and what the format of that field actually is.

It is given here as 01234ABC. That could mean '5 alphanumeric characters - followed by 3 alphanumeric character'. it could also be more restricted to '5 numeric characters - followed by 3 alphabetic character' or 'any number of numeric characters - followed by 3 characters which can only be A, B or C'.

Depending on how it is defined, you can come up with any of the following regular expressions:

\w+\s+\w{5}\w{3}        '5 alphanumeric characters - followed by 3 alphanumeric character'
\w+\s+\d{5}[a-zA-Z]{3}  '5 numeric characters - followed by 3 alphabetic character' 
\w+\s+\d+[ABC]{3}       'any number of numeric characters - followed by 3 characters which can only be A, B or C'

Now to capture the last portion: it could be defined as simply 'the rest'. Add a .+ to the regular expression, making it: \w+\s+\d+[ABC]{3}.+ This highlights the rest of the string, including the last | character.

The definition could also be specified as: 'any character, until the |'. This can be achieved with the expression [^|]. The ^ means 'excluding' from the set. [^|] means 'anything but |'. So the whole regular expression becomes: \w+\s+\d+[ABC]{3}[^|]+

So we have learned:

\w denote alphanumeric characters
\d denote numeric characters
[a-zA-Z] means 'any character between a and z or A and Z'
[^|] means 'anything but |'
a + indicates 'one or more of the preceding symbol'.
{5}  means 'exactly 5'. You could also specify an upper and lower range, as in {3,5} which would mean 'from 3 to 5'.
. means 'any character'


Now, we want to be able to extract these fields. This can be done by enclosing each subsection from the regular expression that you want to extract in parentheses, like:  (\w+)\s+(\d+)([ABC]{3})([^|]+)

You will notice that the first four radio buttons in the control pane below become selectable. Select any of them. You should notice that they result in different fields being highlighted in orange.

This is really nice to have. The numbers correspond with the index of the group. In this current form, group (4) corresponds with the sub string 'alice in wonderland'.

If you remove the parentheses of the first element, \w+\s+(\d+)([ABC]{3})([^|]+) then 'alice in wonderland' becomes linked with group (3).

These group number are important in the Python program. the member function group() of a match object can return that sub string by specifying the group index.


Now let's say, this sub string consists of 2 fields, the first one being only a single, optional digit, followed by anything else. The regular expression becomes \w+\s+(\d?)(\w+)([^|]+). We introduced the ?, which means 'none or one'. Now the first group corresponds with the character 0 and group 2 corresponds with the rest.


Another possibility is that the first field can be either a's or b's. This is achieved with: [a]+|[b]+\s+(\d?)(\w+)([^|]+) The | between 2 regular expressions means 'or'.


Now, it could be that we are not even interested in the first part of the string. Omit the first \w to become \s+(\d?)(\w+)([^|]+)Now the string after aaaa becomes highlighted. This demonstrates that regular expressions can match sub string further inside a target string.



This did not cover every possibility in regular expressions but it does cover 99.99% of what you will ever need. With this, you should be able to understand the more esoterical aspects like greedy matching and (?... like expressions.


the regular expression in regex1.py

Now finally, we should have enough background to understand the 'hieroglyphs' we used in our example above. Copy/paste the following line in the Target string box:

2010-03-10 11:00:16.83  C2  INFO  012E  read A/D V1=510, A1=0.18

And copy/paste the following 

\s*([\d-]+)\s+([\d\:\.]+)\s+(\w+)\s+(\w+)\s+(\w+)\s+read\s*A/D\s+V1\s*=\s*(\d+)\s*,\s*A1\s*=\s*([\.\d\-\+]+)

Here is the breakdown:

\s*
I usually prefix all my regular expressions with \s* to allow for any leading blanks.

([\d-]+)
We group the following section (parentheses) which allows us to extract the date. Here we assume dates can have decimals and dashes (e.g. 2010-03-10). The regular expression is very flexible. It will also match with any string that have any combination of thes(e.g. "1---234-5", or "4345", ...). This is usually sufficient, providing you trust your input data to be well-behaved. If you want to be more rigid in your checking, you might want to specify something like: (\d\d\d\d)\-(\d\d)\-(\d\d) or (\s{4})-(\s{2})-(\s{2}). I hope you have gotten the picture that there are many possible solutions. Pick one that is good enough.

\s+
In these log files, the date is always followed by at least one space character. If you know exactly how may, you can specify the exact number but that is usually over-restrictive, and you might come across occurrences where there happens to be one extra. Even if the system does not generate the extra space, since these are human readable log files, some people have the tendency to add whitespace to these log files as they read them, to improve legibility or to highlight certain events of interest. You parser might be presented with such an 'annotated' log file instead of 'virgin' log files. In my experience with parsing log files that are generated by embedded systems,
simply specifying that "at least one space" can follow almost always does the job.

([\d\:\.]+)\s+
Here we specify that the following field (timestamp) can consist of digits, colons and dots. 

(\w+)\s+
matches the sub string C2. No need to make it any more stricter.

(\w+)\s+
matches the INFO sub string

(\w+)\s+
This is the regular expression for a field that can be interpreted as the event code (012E). In this example we show it with digits and alphabetic characters as in hexadecimal codes. The \w classifier is very broad. You could narrow it down to something like [0-9A-F] or [0-9A-Fa-f]. 


read\s*A/D\s+V1\s*=\s*(\d+)\s*,\s*A1\s*=\s*([\.\d\-\+]+)

This section allow us to match with string that have the format: "read A/D V1=xxx, A1=yyy" where xxx can be any sequence of at least one digit and yyy can be digits, minus, plus and dot characters (usual in floating point values).


Please note that I replaced all spaces with \s+ and that I inserted \s* wherever 2 fields followed each other but conceivably could have one or more spaces inserted. This is just defensive programming based on experience.
I suggest you adopt this practice. It makes the regular expressions a little more uglier but it is worth the robustness.


Notice that we did not use a lot of fancy regular expression constructs like $, ?, {m,n}, (?=...) These are rarely needed. Remember, make your regular expression good-enough. But avoid over-doing it. The more specific a regular expression is, the less it is flexible it is in dealing with unanticipated formats.


parsing reports


To finalize this tutorial, we'l describe a more complex example of a parser. One that requires some form of state machine to help track the location in the input file.

Here is an example of a textual report that can be parsed with regular expressions. It contains a header, and then repeats a table with test results twice. Every table is preceeded a timestamp and a short description of the platform it ran on. If you simply match every line with a regular expression, you will not be able to distinguish between the test data for the Windows platform and the Linux platform.


===========================
= TEST REPORT
===========================

timestamp:   2010-03-10 10:20am 
platform:    Windows

TIMESTAMP              | TESTCASE | STEP | EXPECTED | OBTAINED |  RESULT
-----------------------+----------+------+----------+----------+---------
2010-03-10 11:00:05.23 | TC01     | 001  | 1        | 1        | PASS
2010-03-10 12:10:20.23 | TC01     | 002  | 123.4    | 123.5    | PASS
2010-04-01 05:00:05.23 | TC02     | 001  | AA       | a        | FAIL
-----------------------+----------+------+----------+----------+---------

timestamp:  2010-03-10 11:08pm 
platform:   Linux

TIMESTAMP              | TESTCASE | STEP | EXPECTED | OBTAINED |  RESULT
-----------------------+----------+------+----------+----------+---------
2010-03-10 11:00:05.23 | TC01     | 001  | 1        | 1        | PASS
2010-03-10 12:10:20.23 | TC01     | 002  | 123.4    | 123.5    | PASS
2010-04-01 05:00:05.23 | TC02     | 001  | AA       | aa       | PASS
-----------------------+----------+------+----------+----------+---------


You need to distinguish between the data from the first table and the 2nd table. This can be done with a state machine. This technique consist of finding a 'trigger', some condition that can tell you when you are in the 2nd table.

In the following example, we use the table header 'TIMESTAMP              | TESTCASE | STEP | EXPECTED | OBTAINED |  RESULT'
as a trigger that we about to enter a table. From this example, it is a bit tricky to find out when we have left the table definition. The table is terminated by the line of dashes, but it is the same as the line starting the table.
The only think you can do is what a human would do: count the occurrence of the dashed lines. If you see the first, then you have entered the table. if you see the 2nd occurrence, you have left the table. There are other ways too. It all depends on how certain you are of what this parser is about to expect.

When the cursor leaves the table, the extracted results can be compounded. This example also shows how to deal with multiple regular expressions:


#regex2.py

import re
import collections      # utility module

# we can have multiple regular expression templates

date_regex        = re.compile("\s*timestamp\s*\:\s*(.+)", re.IGNORECASE)
platform_regex    = re.compile("\s*platform\s*\:\s*(.+)", re.IGNORECASE)
hdr_regex         = re.compile("\s*TIMESTAMP\s+\|\s+TESTCASE\s*\|", re.IGNORECASE)
divider_regex     = re.compile("\s*------------", re.IGNORECASE)

# example line: 2010-03-10 11:00:05.23 | TC01     | 001  | 1        | 1        | PASS
#                                     1          2                 3           4           5           6         7
test_result_regex = re.compile("\s*([\d-]+)\s+([\d\:\.]+)\s+\|\s*([^|]+)\|\s*([^|]+)\|\s*([^|]+)\|\s*([^|]+)\|\s*(.+)", re.IGNORECASE)

# define a named tuple for storage of the intermediate results
TestResult = collections.namedtuple('TestResult', 'timestamp result', verbose=False)

fh1            = open("testreport.txt", "r")        # open the input file
line           = ""
eof            = False
in_table       = False
total_pass     = "PASS"
platform       = ""
test_timestamp = ""
test_results   = {}

while not eof:
    line = fh1.readline()
    eof = (line == '')
    if not eof:


here we match the input line with every regular expression. it is possible that you get multiple matches. It is the order of the processing further that determines the actions from this.

There is some execution time penalty associated with every match. You could optimize by selectively matching under certain conditions. But that is very application-specific. The example below is very generic. 

        # compare the input line with all the possible regular expression template        
        date_match        = date_regex.match(line)
        platform_match    = platform_regex.match(line)
        hdr_match         = hdr_regex.match(line)
        divider_match     = divider_regex.match(line)
        test_result_match = test_result_regex.match(line)


At this stage, you could have none, one or more matches. Most common is a mutually exclusive handling, i.e. only handle one match-action at the time, ignoring other potential matches. In this case, it is important to get the order of the if-statements right. Not every match is used to extract fields. A match could be used as a trigger to indicate that certain portions of the input have been reached (e.g. divider_match below). The actions associated with this match can change state variables that control the access to other match actions.


        # execute the section of code that is associated with the mached regular expression        
        if date_match:
            test_timestamp = date_match.group(1).strip()


It is usually a good idea to strip leading and trailing spaces from a extracted fields.

            
        elif platform_match:
            platform = platform_match.group(1).strip()


Whenever a header occurs, a flag is set to indicate the cursor is in a table. We set a global variable to "PASS". Whenever, at least one "FAIL" is found, the global result is setto FAIL, regardless of subsequent matches with "PASS".

            
        elif hdr_match:
            # whenever the table header is found, a flag is set to indicate the cursor
            # is inside the table.
            in_table = True
            divider_cnt = 0
            # we set a global variable to PASS. We then try to find a FAIL. At least
            # one occurrence of a FAIL will result in a FAIL for the whole platform.
            total_pass = "PASS"


Whenever the 2nd divider is found, the test results are collated for this table.

        elif divider_match:
            divider_cnt += 1
            # when we find the 2nd occurrence of a divider, we know we are just past a table
            if divider_cnt == 2:
                # reset the counter and in_table flag
                divider_cnt = 0
                in_table = False
                # record the found 
                test_results[platform] = TestResult(test_timestamp , total_pass)
                
        elif in_table and test_result_match:
            # whenever a FAIL is found, this will result in a FAIL for the whole platform.
            if test_result_match.group(7).upper().strip() == "FAIL":
                total_pass = "FAIL"

fh1.close()

# print an overview of the parsed data
template = "%-10s  %-18s  %s"
print template % ("PLATFORM", "TIME", "RESULT")
for platform in test_results:
    print template % (platform, test_results[platform].timestamp, test_results[platform].result)


conclusion


i hope this clarifies Python regular expressions and gave a good idea of how to implement parsers in the context of parsing log files and report files. This approach is the basis of an automated verification system, that parses all the gathered log files, and automatically verifies against the requirements and generates test reports. Future tutorials will show this mechanism. Stay tuned ...