Extracting SQL code from SSIS dtsx packages with Python lxml

Lately, I’ve been refactoring a sprawling SSIS (SQL Server Integration Services) package that ineffectually wrestles with large XML files. In this programmer’s opinion using SSIS for heavy-duty XML parsing is geeky self-abuse so I’ve opted to replace an eye-ball straining1 SSIS package with half a dozen, “as simple as possible but no simpler”, Python scripts. If the Python is fast enough for production great! If not the scripts will serve as a clear model2 for something faster.

I’m only refactoring3 part of a larger ETL process so whatever I do it must mesh with the rest of the mess.

So where is the rest of the SSIS mess?

SSIS’s visual editor does a wonderful job of hiding the damn code!

This is a problem!

If only there was a simple way to troll through large sprawling SSIS spider-webby packages and extract the good bits. Fortunately, Python’s XML parsing tools can be easily applied to SSIS dtsx files. SSIS dtsx files are XML files. The following code snippets illustrate how to hack these files.

First import the required Python modules. lxml is not always included in Python distributions. Use the pip or conda tools to install this module.

# imports
import os
from lxml import etree

Set an output directory. I’m running on a Windows machine. If you’re on a Mac or Linux machine adjust the path.

# set sql output directory
sql_out = r"C:\temp\dtsxsql"
if not os.path.isdir(sql_out):
    os.makedirs(sql_out)

Point to the dtsx package you want to extract code from.

# dtsx files
dtsx_path = r'C:\Users\john\AnacondaProjects\testfolder\bixml'
ssis_dtsx = dtsx_path + r'\ParseXML.dtsx'

Read and parse the SSIS package.

tree = etree.parse(ssis_dtsx)
root = tree.getroot()
root.tag

lxml renders XML namespace tags like: 

<DTS:Executable as {www.microsoft.com/SqlServer/Dts}Executable.

The following shows all the transformed element tags in the dtsx package.

# collect unique element tags in dtsx
ele_set = set()
for ele in root.xpath(".//*"):
    ele_set.add(ele.tag)    
print(ele_set)
print(len(ele_set))

Using transformed element tags of interest blast over the dtsx and suck out the bits of interest.

pfx = '{www.microsoft.com/'
exe_tag = pfx + 'SqlServer/Dts}Executable'
obj_tag = pfx + 'SqlServer/Dts}ObjectName'
dat_tag = pfx + 'SqlServer/Dts}ObjectData'
tsk_tag = pfx + 'sqlserver/dts/tasks/sqltask}SqlTaskData'
src_tag = pfx + \
  'sqlserver/dts/tasks/sqltask}SqlStatementSource'

# extract sql source statements and write to *.sql files 
total_bytes = 0
package_name = root.attrib[obj_tag].replace(" ","")
for cnt, ele in enumerate(root.xpath(".//*")):
  if ele.tag == exe_tag:
    attr = ele.attrib
    for child0 in ele:
      if child0.tag == dat_tag:
        for child1 in child0:
          sql_comment = attr[obj_tag].strip()
          if child1.tag == tsk_tag:
            dtsx_sql = child1.attrib[src_tag]
            dtsx_sql = "-- " + \
                sql_comment + "\n" + dtsx_sql
            sql_file = sql_out + "\\" \
                 + package_name + str(cnt) + ".sql"
            total_bytes += len(dtsx_sql)
            print((len(dtsx_sql), 
                 sql_comment, sql_file))
            with open(sql_file, "w") as file:
              file.write(dtsx_sql)
print(('total bytes',total_bytes))

The code snippets in this post are available in this Jupyter notebook: Extracting SQL code from SSIS dtsx packages with Python lxml. Download and tweak for your dtsx nightmare!


  1. I frequently run into SSIS packages that cannot be viewed on 4K monitors when fully zoomed out.↩︎
  2. Python’s readability is a major asset when disentangling mess-ware↩︎.
  3. Yes, I’ve railed about the word “refactoring” in the past but I’ve moved on and so should you. “A foolish consistency is the hobgoblin of little minds.”↩︎

3 thoughts on “Extracting SQL code from SSIS dtsx packages with Python lxml

  1. Python scriptz … YIKES babe! … if I understand the problem domain correctly, you still have to run the SSIS program so I’ld write a simple C# script task that serializes incoming XML to a data source object and have that as the source for the data flow in question. C# has lots n lots of great XML features making serializing XML to objects very easy … no need to hack SSIS XML …. however, very cool that you can do this … thought you were retired, cut this shit out already

    1. Hi Glen,

      I expect what you have outlined will be the final solution. Right now I am studying the problem. I’ve always disliked SSIS it’s basically “flaw chart” on steroids. Just browsing all the little process boxes is tedious and still results in yellow triangle hell.

      PS. I am semi retired just doing a bit of hourly consulting.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.