The extract_lines text extraction module for Python

Table of Contents (or straight to downloads 😎 )

The problem

Engineers often need  to work with data sets that have been published in text or source code from different projects. Often times, these data sets are formatted or embedded in such a way that manual intervention is required to extract them. This slows down the process of reusing the data in a different program or within a spreadsheets.

As an exemple, consider the following constant table (as a list of individual) constants, from a C program (from uClibc source code here):

1
2
3
4
5
6
7
8
9
10
11
12
static const double
u00  = -7.38042951086872317523e-02, /* 0xBFB2E4D6, 0x99CBD01F */
u01  =  1.76666452509181115538e-01, /* 0x3FC69D01, 0x9DE9E3FC */
u02  = -1.38185671945596898896e-02, /* 0xBF8C4CE8, 0xB16CFA97 */
u03  =  3.47453432093683650238e-04, /* 0x3F36C54D, 0x20B29B6B */
u04  = -3.81407053724364161125e-06, /* 0xBECFFEA7, 0x73D25CAD */
u05  =  1.95590137035022920206e-08, /* 0x3E550057, 0x3B4EABD4 */
u06  = -3.98205194132103398453e-11, /* 0xBDC5E43D, 0x693FB3C8 */
v01  =  1.27304834834123699328e-02, /* 0x3F8A1270, 0x91C9C71A */
v02  =  7.60068627350353253702e-05, /* 0x3F13ECBB, 0xF578C6C1 */
v03  =  2.59150851840457805467e-07, /* 0x3E91642D, 0x7FF202FD */
v04  =  4.41110311332675467403e-10; /* 0x3DFE5018, 0x3BD6D9EF */

We would like to obtain a file listing the data as such:

1
2
3
4
5
6
7
8
9
10
11
u00,-7.38042951086872317523e-02
u01,1.76666452509181115538e-01
u02,-1.38185671945596898896e-02
u03,3.47453432093683650238e-04
u04,-3.81407053724364161125e-06
u05,1.95590137035022920206e-08
u06,-3.98205194132103398453e-11
v01,1.27304834834123699328e-02
v02,7.60068627350353253702e-05
v03,2.59150851840457805467e-07
v04,4.41110311332675467403e-10

There are many tricks and programs that can be used to extract this data.

Some methods, such as using a spreadsheet and importing as text, or through a text editor, involve significant amounts of manual labor.

Other methods employ programs such as sed, awk, cut and grep in a UNIX pipeline.

For example, the following pipeline extracts the table from the source code:

$ cat data.c | egrep '.*([a-z][0-9][0-9]) += +([-+]?[0-9]+[.][0-9]+e[-+]?[0-9]+).*' | cut -d "," -f 1 | cut -d "=" -f 1,2 --output-delimiter="," | cut -d ';' -f 1 | sed 's/ \+//

Admitedly, this pipeline is not very easy to understand or maintain. Furthermore, the text is extracted, but it would be a much more difficult task to use it immediately (and automatically) for some calculation, as the text would need to be parsed a second time by a custom program.

In some cases, the data is deep within a text file and the data is in a format that would make it difficult to create the requisite pipeline to allow the extraction of the desired information. It is the case for the following (simple) example:

1
2
3
4
5
6
7
int tens[8] = { 11, 12, 13,
        14, 15, 16, 17, 18};
 
int twenties[12] = { 21, 22, 23, 24, 25,
26, 27, 28, 29, 20, 21, 22};
 
int thirties[4] = { 34, 35, 36, 37 };

In this case, suppose we want a list of the twenties, one per line, recast in hexadecimal representation. Extraction would be complicated because of the following reasons:

  • There is an arbitrary number of items on a line instead of a single item
  • There are several arrays with the same format in the set
  • We want an output representation that requires arithmetic computation on the text extracted before it can be obtained (goodbye sed !)
  • We want an output on each line
  • The array size (“twenties[12]”) is of the same format as the data to extract

Of course, this problem could be solved with awk, but at this point, I think a different approach can be taken.

The solution: the extract_lines Python module

extract_lines is a universal text extraction tool that can be used within Python scripts.

Prerequisites

To use the extract_lines module, basic knowledge of Python is required. Additionally, knowledge of regular expressions (regexps) is essential for anyone wishing to do advanced text processing. Here are some ressources to get you started.

Learning Python

The following resources should get you started:

Learning Regular Expressions:

Using extract_lines.py

The extract_lines module is a single-file module containing a single function (not even a class !). It is amazing what can be done with that versatile function.

The extract_lines() function is defined as such:

extract_lines(lines, startPat, endPat, extractPat, subPat, removeStart = True)

It returns a list of lines where substitution has been done according to the following parameters:

  • lines: a list of lines (strings) upon which to operate
  • startPat: regexp pattern to match before starting the extraction and substitution phase.
  • endPat: regexp pattern to match before ending the extraction and substitution phase.
  • extractPat: regexp pattern to match for substitution.
  • subPat: substitution string to replace every match of extractPat. This parameter can also be a string-returning function which will then be called for every substitution with the re.Match object of the extractPat match (see examples below).
  • removeStart: if True (the default), the initial match from startPat is removed from the lines before starting substitution. In most cases, this should remain True.

Even though the function returns a list of lines, some easy tricks can be used to extract data in arbitrarily complex data structures. This will be shown in the examples below. The number of lines need not be more than one, either.

How it works

The extract_lines module uses a simple state machine, driven by regular expressions. The algorithm is the following:

FOR EACH line IN lines:

IN CASE OF “wait for start match” state (initial state):

Try to match startPat against the line;

IF startPat is matched THEN

IF removeStart is True THEN

Eliminate start match from line;

END IF

Switch to “extract” state;

ELSE

remain in “wait for start match” state

END IF

IN CASE OF  “extract” state:

Substitute subPat for every occurence of extractPat in the current line;

Save the current line to the output list;

Try to match endPat against the line;

IF endPat is matched THEN

Stop iterating through lines;

ELSE

Remain in “extract” state;

END IF

END: Return the output list of matched lines with substitutions done

An interesting aspect of the state machine is that the substitution can happen several times during a single line. If subPat is a function, then it will be called for every match. The algorithm is thus line-based to weed-out undesired data, but is pattern-based for extraction.

The caveat of this algorithm is that patterns cannot span more than a single line. This makes it difficult to extract data that spans several lines, such as data from within long expressions in source code. For most applications, this algorithm is sufficient. For languages such as C, where each statement is separated by a semi-colon (“;”), this can be resolved by a preprocessing step of splitting on semicolons and removing newlines.

Examples

Example 1: Basic usage

The following code uses the extract_lines function to realize the same function as the pipeline shown earlier. The full example source code is in file ex_coeffs.py in the archive.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from extract_lines import *
 
if __name__ == "__main__":
    data = """
    static const double
    u00  = -7.38042951086872317523e-02, /* 0xBFB2E4D6, 0x99CBD01F */
    u01  =  1.76666452509181115538e-01, /* 0x3FC69D01, 0x9DE9E3FC */
    u02  = -1.38185671945596898896e-02, /* 0xBF8C4CE8, 0xB16CFA97 */
    u03  =  3.47453432093683650238e-04, /* 0x3F36C54D, 0x20B29B6B */
    u04  = -3.81407053724364161125e-06, /* 0xBECFFEA7, 0x73D25CAD */
    u05  =  1.95590137035022920206e-08, /* 0x3E550057, 0x3B4EABD4 */
    u06  = -3.98205194132103398453e-11, /* 0xBDC5E43D, 0x693FB3C8 */
    v01  =  1.27304834834123699328e-02, /* 0x3F8A1270, 0x91C9C71A */
    v02  =  7.60068627350353253702e-05, /* 0x3F13ECBB, 0xF578C6C1 */
    v03  =  2.59150851840457805467e-07, /* 0x3E91642D, 0x7FF202FD */
    v04  =  4.41110311332675467403e-10; /* 0x3DFE5018, 0x3BD6D9EF */
    """
    # Split the lines
    # Could do this instead: lines = file("data.c","r").readlines()
    lines = data.split("\n")
 
    # Extract and substitute
    newLines = extract_lines(lines,
        r"u00", # Find first occurence of "u00"
        r";", # End on line with ";"
        # Extraction pattern (identifier and floating point numbers):
        r".*([a-z][0-9][0-9]) += +([-+]?[0-9]+[.][0-9]+e[-+]?[0-9]+).*",
        r"\1,\2", # Keep groups 1 and 2 and separate with comma
        removeStart = False) # Keep start match since it contains u00
 
    # Show the result
    for line in newLines:
        print line

This will yield the same data listing as previously. However, no external programs were called and many extraction passes could have been done without having to reload the (possibly lengthy) file.

Example 2: Extracting values from a C array

Let’s consider the previous example of extracting decimal values from the “twenties” array as noted previously, and storing them in a new flat array as integers instead of strings.

The extract_lines invocation would be:

1
2
3
4
5
6
7
8
9
newData = []
extract_lines(lines_containing_arrays,
    r"twenties\[.*\]", # Find start of twenties array"
    r";", # End on line with ";"
    '([0-9]+)', # Extract decimal values
    lambda matchobj: newData.append(int(matchobj.group(1))) )
 
for item in newData:
    print item

The complete source of this example is in file ex_twenties1.py in the archive.

Here, we are using an anonymous function (Python lambda function) on line 6 to convert the first matching group to an integer and append it to a list. Notice we did not use the return value of extract_lines. In fact, if we look at it, we will notice that all the numbers have been deleted (replaced with nothing) since the lambda function does not return anything !

We used the substitution function’s side-effects to extract data in a new structure.

Instead of using a lambda function, we could have used the reference to a full function. There still needs to be a lambda function as an adapter in order to pass other local parameters along with the match object. Otherwise, the list receiving the data would need to be in global scope. The following snippet shows this case (see file ex_twenties2.py in archive for full example source):

1
2
3
4
5
6
7
8
9
10
11
12
# Function definition
def storeDecimal(container, matchobj):
    container.append(int(matchobj.group(1))) 
 
### later in the script ###
newData = []
 
newLines = extract_lines(lines,
    r"twenties\[.*\]", # Find start of twenties array"
    r";", # End on line with ";"
    '([0-9]+)', # Extract decimal values
    lambda matchobj: storeDecimal(newData, matchobj))

Example 3: Extracting binary vectors from a VHDL testbench of a FIR filter

This example illustrates how the extract_lines module can be used for a typical automated data extraction task.

In this case, we want to extract the coefficients for an FIR digital filter from VHDL source code. The VHDL code was written by an automated tool (HDL Coder in Filter Design and Analysis tool of MATLAB). The source file in this example contains 16 constants for coefficients within a source of about 250 lines. The desired data is encoded as follows:

61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
----------------------------------------------------------------
--Module Architecture: filter
----------------------------------------------------------------
ARCHITECTURE rtl OF filter IS
  -- Local Functions
  -- Type Definitions
  TYPE delay_pipeline_type IS ARRAY (NATURAL range <>) OF signed(11 DOWNTO 0); -- sfix12_En11
  -- Constants
  CONSTANT coeff1   : signed(12 DOWNTO 0) := to_signed(-198, 13); -- sfix13_En12
  CONSTANT coeff2   : signed(12 DOWNTO 0) := to_signed(130, 13); -- sfix13_En12
  CONSTANT coeff3   : signed(12 DOWNTO 0) := to_signed(272, 13); -- sfix13_En12
  CONSTANT coeff4   : signed(12 DOWNTO 0) := to_signed(66, 13); -- sfix13_En12
  CONSTANT coeff5   : signed(12 DOWNTO 0) := to_signed(-312, 13); -- sfix13_En12
  CONSTANT coeff6   : signed(12 DOWNTO 0) := to_signed(-171, 13); -- sfix13_En12
  CONSTANT coeff7   : signed(12 DOWNTO 0) := to_signed(753, 13); -- sfix13_En12
  CONSTANT coeff8   : signed(12 DOWNTO 0) := to_signed(1709, 13); -- sfix13_En12
  CONSTANT coeff9   : signed(12 DOWNTO 0) := to_signed(1709, 13); -- sfix13_En12
  CONSTANT coeff10 : signed(12 DOWNTO 0) := to_signed(753, 13); -- sfix13_En12
  CONSTANT coeff11 : signed(12 DOWNTO 0) := to_signed(-171, 13); -- sfix13_En12
  CONSTANT coeff12 : signed(12 DOWNTO 0) := to_signed(-312, 13); -- sfix13_En12
  CONSTANT coeff13 : signed(12 DOWNTO 0) := to_signed(66, 13); -- sfix13_En12
  CONSTANT coeff14 : signed(12 DOWNTO 0) := to_signed(272, 13); -- sfix13_En12
  CONSTANT coeff15 : signed(12 DOWNTO 0) := to_signed(130, 13); -- sfix13_En12
  CONSTANT coeff16 : signed(12 DOWNTO 0) := to_signed(-198, 13); -- sfix13_En12
 
-- Signals
  SIGNAL delay_pipeline : delay_pipeline_type(0 TO 15); -- sfix12_En11
 
-- etc, etc, etc

We want to extract the values of coeff1 through coeff16, which are stored as signed decimal numbers of arbitrary size. When writing the data output, we want to store them as two’s complement binary coefficients with the proper width, as specified in the to_signed() conversions.

The first lines of the resulting file would look like such:

1
2
3
4
5
6
7
8
# Filter coefficients from file filter.vhd
# -----K(n)----
  1111100111010
  0000010000010
  0000100010000
  0000001000010
  1111011001000
  1111101010101

In order to fully automate the task of extraction for arbitrary filters, we want to be able to read a VHDL file, process it and write the results to a coefficient output file.

The following program (see ex_tbextract.py in the archive) realizes this task:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
from extract_lines import *
import sys
 
def to_std_logic(num, length):
    """
    Convert the "num" numerical argument to a binary string
    over "length" bits. If num is None, make the string undefined
    ("UUUUUU.....UUUU")
    """
    out = []
    for i in range(length):
        if num == None:
            out.append('U')
        else:
            if num & (1 << i):
                out.append('1')
            else:
                out.append('0')
    out.reverse()
    return "".join(out)
 
def subst_coeff(matchobj):
    # Convert from signed int string to 32-bit signed int
    value = int(matchobj.group(1))
    # Extract word width
    width = int(matchobj.group(2))
 
    return to_std_logic(value, width)
 
def extract_coeffs(inFile, outFile):
    # Read-in filter architecture file
    fin = file(inFile, "r")
    lines = fin.readlines()
    fin.close()
 
    # Extract filter coefficients
    coeffs = extract_lines(lines,
        "-- Constants", # Find coeff table declaration
        "^$", # End on empty line
        # Extract value and width
        r".*to_signed\(([-+]?\b\d+\b), +([-+]?\b\d+\b)\).*",
        subst_coeff) # Use substitution function to reformat value
 
    # Find bit width from string width
    width = len(coeffs[0])
 
    # Save coefficients
    fout = file(outFile, "w+")
    fout.write("# Filter coefficients from file %s\n" % inFile)
    fout.write("# " + "K(n)".center(width).replace(' ','-') + "\n")
 
    for coeff in coeffs:
        fout.write("  %s\n" % coeff)
    fout.close()
 
if __name__ == "__main__":
    extract_coeffs("filter.vhd", "coeffs.txt")

The program is built with the following blocks:

  • function to_std_logic(), lines 4-20: takes an integer and a bit width and outputs a binary representation of it.
  • function subst_coeff(), lines 22-28: used by extract_lines() at line 42. Takes a match containing 2 groups and returns the string from a call to to_std_logic() with the groups extracted.
  • function extract_coeffs(), lines 30-54: reads the lines from the filter VHDL source code, extracts the coefficient data and stores the result in a binary table file.

Notice how straightforward the task becomes when the extract_lines module is used.

Conclusion

The extract_lines module can be used to automate a significant number of different text extraction and substitution cases. While it is not a universal solution, it is highly applicable to engineering data in tabular or machine-generated human-readable formats. The simplicity of the solution and availability of the source code allow you to modify it to suit more advanced needs.

Downloads

The extract_lines module is Public Domain. You do what you want with it. This code is offered without any warranty of fitness for any purpose.

The solution: the extract_lines Python module

Leave a Reply