Extracting SQL code from SSIS dtsx packages with Python lxml

Lately, I’ve been refactoring a sprawling SSIS (SQL Server Integration Services) package that ineffectually wrestles with large XML files. In this programmer’s opinion using SSIS for heavy-duty XML parsing is geeky self-abuse so I’ve opted to replace an eye-ball straining[1] SSIS package with half a dozen, “as simple as possible but no simpler”, Python scripts. If the Python is fast enough for production great! If not the scripts will serve as a clear model[2] for something faster.

I’m only refactoring[3] part of a larger ETL process so whatever I do it must mesh with the rest of the mess.

So where is the rest of the SSIS mess?

SSIS’s visual editor does a wonderful job of hiding the damn code!

This is a problem!

If only there was a simple way to troll through large sprawling SSIS spider-webby packages and extract the good bits. Fortunately, Python’s XML parsing tools can be easily applied to SSIS dtsx files. SSIS dtsx files are XML files. The following code snippets illustrate how to hack these files.

First import the required Python modules. lxml is not always included in Python distributions. Use the pip or conda tools to install this module.

# imports
import os
from lxml import etree

Set an output directory. I’m running on a Windows machine. If you’re on a Mac or Linux machine adjust the path.

# set sql output directory
sql_out = r"C:\temp\dtsxsql"
if not os.path.isdir(sql_out):
    os.makedirs(sql_out)

Point to the dtsx package you want to extract code from.

# dtsx files
dtsx_path = r'C:\Users\john\AnacondaProjects\testfolder\bixml'
ssis_dtsx = dtsx_path + r'\ParseXML.dtsx'
ssis_dtsx

Read and parse the SSIS package.

tree = etree.parse(ssis_dtsx)
root = tree.getroot()
root.tag
'{www.microsoft.com/SqlServer/Dts}Executable'

lxml renders XML namespace tags like <DTS:Executable as {www.microsoft.com/SqlServer/Dts}Executable. The following
shows all the transformed element tags in the dtsx package.

# collect unique element tags in dtsx
ele_set = set()
for ele in root.xpath(".//*"):
    ele_set.add(ele.tag)    
print(ele_set)
print(len(ele_set))
{'InnerObject', '{www.microsoft.com/SqlServer/Dts}PrecedenceConstraint', '{www.microsoft.com/SqlServer/Dts}ObjectData', '{www.microsoft.com/SqlServer/Dts}PackageParameter', '{www.microsoft.com/SqlServer/Dts}LogProvider', '{http://schemas.xmlsoap.org/soap/envelope/}Envelope', '{www.microsoft.com/SqlServer/Dts}Executable', 'ExpressionTask', '{http://schemas.xmlsoap.org/soap/envelope/}Body', '{www.microsoft.com/SqlServer/Dts}Variable', '{www.microsoft.com/SqlServer/Dts}ForEachVariableMapping', '{www.microsoft.com/SqlServer/Dts}PrecedenceConstraints', '{www.microsoft.com/SqlServer/Dts}SelectedLogProviders', 'ForEachFileEnumeratorProperties', '{www.microsoft.com/SqlServer/Dts}ForEachVariableMappings', '{www.microsoft.com/SqlServer/Dts}ForEachEnumerator', '{www.microsoft.com/SqlServer/Dts}SelectedLogProvider', '{www.microsoft.com/SqlServer/Dts}DesignTimeProperties', '{www.microsoft.com/SqlServer/Dts}LogProviders', '{www.microsoft.com/SqlServer/Dts}LoggingOptions', '{www.microsoft.com/SqlServer/Dts}Variables', 'FEFEProperty', 'FileSystemData', 'ProjectItem', '{www.microsoft.com/SqlServer/Dts}Property', '{http://www.w3.org/2001/XMLSchema}anyType', '{www.microsoft.com/SqlServer/Dts}Executables', '{www.microsoft.com/sqlserver/dts/tasks/sqltask}ParameterBinding', '{www.microsoft.com/sqlserver/dts/tasks/sqltask}ResultBinding', '{www.microsoft.com/SqlServer/Dts}VariableValue', 'FEEADO', 'BinaryItem', '{www.microsoft.com/SqlServer/Dts}PackageParameters', '{www.microsoft.com/SqlServer/Dts}PropertyExpression', '{www.microsoft.com/sqlserver/dts/tasks/sqltask}SqlTaskData', 'ScriptProject'}
36

Using transformed element tags of interest blast over the dtsx and suck out the bits of interest.

# extract sql code in source statements and write to *.sql files 
total_bytes = 0
package_name = root.attrib['{www.microsoft.com/SqlServer/Dts}ObjectName'].replace(" ","")
for cnt, ele in enumerate(root.xpath(".//*")):
    if ele.tag == "{www.microsoft.com/SqlServer/Dts}Executable":
        attr = ele.attrib
        for child0 in ele:
            if child0.tag == "{www.microsoft.com/SqlServer/Dts}ObjectData":
                for child1 in child0:
                    sql_comment = attr["{www.microsoft.com/SqlServer/Dts}ObjectName"].strip()
                    if child1.tag == "{www.microsoft.com/sqlserver/dts/tasks/sqltask}SqlTaskData":
                        dtsx_sql = child1.attrib["{www.microsoft.com/sqlserver/dts/tasks/sqltask}SqlStatementSource"]
                        dtsx_sql = "-- " + sql_comment + "\n" + dtsx_sql
                        sql_file = sql_out + "\\" + package_name + str(cnt) + ".sql"
                        total_bytes += len(dtsx_sql)
                        print((len(dtsx_sql), sql_comment, sql_file))
                        with open(sql_file, "w") as file:
                            file.write(dtsx_sql)
print(('total sql code bytes',total_bytes))
   (2817, 'Add Record to ZipProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML225.sql')
    (48, 'Dummy SQL - End Loop for ZipFileName', 'C:\\temp\\dtsxsql\\2_ParseXML268.sql')
    (1327, 'Add Record to XMLProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML293.sql')
    (546, 'Delete Prior Loads in XMLProcessLog Table for Looped XML', 'C:\\temp\\dtsxsql\\2_ParseXML304.sql')
    (759, 'Delete Prior Loads to Node Table for Looped XML', 'C:\\temp\\dtsxsql\\2_ParseXML312.sql')
    (48, 'Dummy SQL - End Loop for XMLFileName', 'C:\\temp\\dtsxsql\\2_ParseXML320.sql')
    (1862, 'Set Variable DeletePriorImportNodeFlag', 'C:\\temp\\dtsxsql\\2_ParseXML356.sql')
    (55, 'Shred XML to Node Table', 'C:\\temp\\dtsxsql\\2_ParseXML365.sql')
    (1011, 'Update LoadEndDatetime and XMLRecordCount in XMLProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML371.sql')
    (1060, 'Update LoadEndDatetime and XMLRecordCount in XMLProcessLog - Shred Failure', 'C:\\temp\\dtsxsql\\2_ParseXML382.sql')
    (675, 'Load object VariablesList (Nodes to process for each XML File Category)', 'C:\\temp\\dtsxsql\\2_ParseXML412.sql')
    (1175, 'Set Variable ZipProcessFlag - Has Zip Had A Prior Successful Run', 'C:\\temp\\dtsxsql\\2_ParseXML461.sql')
    (224, 'Set ZipProcessed Status (End Zip)', 'C:\\temp\\dtsxsql\\2_ParseXML474.sql')
    (238, 'Set ZipProcessed Status (Zip Already Processed)', 'C:\\temp\\dtsxsql\\2_ParseXML480.sql')
    (231, 'Set ZipProcessing Status (Zip Starting)', 'C:\\temp\\dtsxsql\\2_ParseXML486.sql')
    (609, 'Update LoadEndDatetime in ZipProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML506.sql')
    (613, 'Update ZipLog UnzipCompletedDateTime Equal to GETDATE', 'C:\\temp\\dtsxsql\\2_ParseXML514.sql')
    (1610, 'Update ZipProcessLog ExtractedFileCount', 'C:\\temp\\dtsxsql\\2_ParseXML522.sql')
    ('total sql code bytes', 14908)

The code snippets in this post are available in this Jupyter notebook: Extracting SQL code from SSIS dtsx packages with Python lxml. Download and tweak for your dtsx nightmare!



  1. I frequently run into SSIS packages that cannot be viewed on 4K monitors when fully zoomed out.
  2. Python’s readability is a major asset when disentangling mess-ware.
  3. Yes, I’ve railed about the word “refactoring” in the past but I’ve moved on and so should you. “A foolish consistency is the hobgoblin of little minds.”

NumPy another Iverson Ghost

During my recent SmugMug API and Python adventures I was haunted by an Iverson ghost: NumPy

An Iverson ghost is an embedding of APL like array programming features in nonAPL languages and tools.

You would be surprised at how often Iverson ghosts appear. Whenever programmers are challenged with processing large numeric arrays they rediscover bits of APL. Often they’re unaware of the rich heritage of array processing languages but in NumPy's case, they indirectly acknowledged the debt. In Numerical Python the authors wrote:

“The languages which were used to guide the development of NumPy include the infamous APL family of languages, Basis, MATLAB, FORTRAN, S and S+, and others.”

I consider “infamous” an upgrade from “a mistake carried through to perfection.”

Not only do developers frequently conjure up Iverson ghosts. They also invariably turn into little apostles of array programming that won’t shut up about how cutting down on all those goddamn loops clarifies and simplifies algorithms. How learning to think about operating on entire arrays, versus one dinky number at a time, frees the mind. Why it’s almost as if array programming is a tool of thought.

Where have I heard this before?

Ahh, I’ve got it, when I first encountered APL almost fifty years ago.

Yes, I am an old programmer, a fossil, a living relic. My brain is a putrid pool of punky programming languages. Python is just the latest in a longish line of languages. Some people collect stamps. I collect programming languages. And, just like stamp collectors have favorite stamps, I find some programming languages more attractive than others. For example, I recognize the undeniable utility of C/C++, for many tasks they are the only serious options, yet as useful and pervasive as C/C++ are they have never tickled my fancy. The notation is ugly! Yeah, I said it; suck on it C people. Similarly, the world’s most commonly used programming language JavaScript is equally ugly. Again, JavaScript is so damn useful that programmers put up with its many warts. Some have even made a few bucks writing books about its meager good parts.

I have similar inflammatory opinions about other widely used languages. The one that is making me miserable now is SQL, particularly Microsoft’s variant T-SQL. On purely aesthetic grounds I find well-formed SQL queries less appalling than your average C pointer fest. Core SQL is fairly elegant but the macro programming features that have grown up around it are depraved. I feel dirty when forced to use them which is just about every other day.

At the end of my programming day, I want to look on something that is beautiful. I don’t particularly care about how useful a chunk of code is or how much money it might make, or what silly little business problem it solves. If the damn code is ugly I don’t want to see it.

People keep rediscovering array programming, best described in Ken Iverson’s 1962 book A Programming Language, for two basic reasons:

  1. It’s an efficient way to handle an important class of problems.
  2. It’s a step away from the ugly and back towards the beautiful.

Both of these reasons manifest in NumPy‘s resounding success in the Python world.

As usual, efficiency led the way. The authors of Numerical Python note:

Why are these extensions needed? The core reason is a very prosaic one, and that is that manipulating a set of a million numbers in Python with the standard data structures such as lists, tuples or classes is much too slow and uses too much space.

Faced with a “does not compute” situation you can either try something else or fix what you have. The Python people fixed Python with NumPy. Pythonistas reluctantly embraced NumPy but quickly went apostolic! Now books like Elegant SciPy and the entire SciPy toolset that been built on NumPy take it for granted.

Is there anything in NumPy for programmers that have been drinking the array processing Kool-Aid for decades? The answer is yes! J programmers, in particular, are in for a treat with the new Python3 addon that’s been released with the latest J 8.07 beta. This addon directly supports NumPy arrays making it easy to swap data in and out of the J/Python environments. It’s one of those best of both worlds things.

The following NumPy examples are from the SciPy.org NumPy quick start tutorial. For each NumPy statement, I have provided a J equivalent. J is a descendant of APL. It was largely designed by the same man: Ken Iverson. A scumbag lawyer or greedy patent troll might consider suing NumPy‘s creators after looking at these examples. APL’s influence is obvious. Fortunately, Ken Iverson was more interested in promoting good ideas that profiting from them. I suspect he would be flattered that APL has mutated and colonized strange new worlds and I think even zealous Pythonistas will agree that Python is a delightfully strange world.

Some Numpy and J examples

Selected Examples from https://docs.scipy.org/doc/numpy-dev/user/quickstart.html Output has been suppressed here. For a more detailed look at these examples browse the Jupyter notebook:  NumPy and J Make Sweet Array Love.

Creating simple arrays

 
 # numpy
 a = np.arange(15).reshape(3, 5)
 
 NB. J
 a =. 3 5 $ i. 15

 # numpy
 a = np.array([2,3,4])
 
 NB. J
 a =. 2 3 4
 
 # numpy
 b = np.array([(1.5,2,3), (4,5,6)])
 
 NB. J
 b =. 1.5 2 3 ,: 4 5 6

 # numpy
 c = np.array( [ [1,2], [3,4] ], dtype=complex )
 
 NB. J
 j. 1 2 ,: 3 4
 
 # numpy
 np.zeros( (3,4) )
 
 NB. J
 3 4 $ 0
 
 # numpy - allocates array with whatever is in memory
 np.empty( (2,3) )
 
 NB. J - uses fill - safer but slower than numpy's trust memory method
 2 3 $ 0.0001 

Basic operations

 
 # numpy
 a = np.array( [20,30,40,50] )
 b = np.arange( 4 )
 c = a - b
 
 NB. J
 a =. 20 30 40 50
 b =. i. 4
 c =. a - b
 
 # numpy - uses previously defined (b)
 b ** 2
 
 NB. J
 b ^ 2

 # numpy - uses previously defined (a)
 10 * np.sin(a)
 
 NB. J
 10 * 1 o. a
 
 # numpy - booleans are True and False
 a < 35
 
 NB. J - booleans are 1 and 0
 a < 35

Array processing

 
 # numpy
 a = np.array( [[1,1], [0,1]] )
 b = np.array( [[2,0], [3,4]] )
 # elementwise product
 a * b

 NB. J
 a =. 1 1 ,: 0 1
 b =. 2 0 ,: 3 4
 a * b

 # numpy - matrix product
 np.dot(a, b)

 NB. J - matrix product
 a +/ . * b  
 
 # numpy - uniform pseudo random
 a = np.random.random( (2,3) )
 
 NB. J - uniform pseudo random
 a =. ? 2 3 $ 0
 
 # numpy - sum all array elements - implicit ravel
 a.sum(a)
 
 NB. J - sum all array elements - explicit ravel
 +/ , a
 
 # numpy
 b = np.arange(12).reshape(3,4)
 # sum of each column
 b.sum(axis=0)
 # min of each row
 b.min(axis=1)
 # cumulative sum along each row
 b.cumsum(axis=1)
 # transpose
 b.T     

 NB. J 
 b =. 3 4 $ i. 12
 NB. sum of each column
 +/ b
 NB. min of each row
 <./"1 b
 NB. cumulative sum along each row
 +/\"0 1 b
 NB. transpose
 |: b

Indexing and slicing

 
 # numpy 
 a = np.arange(10) ** 3 
 a[2]
 a[2:5]
 a[ : :-1]   # reversal

 NB. J
 a =. (i. 10) ^ 3
 2 { a
 (2 + i. 3) { a
 |. a