Category Archives: SQL

Extracting SQL code from SSIS dtsx packages with Python lxml

Lately, I’ve been refactoring a sprawling SSIS (SQL Server Integration Services) package that ineffectually wrestles with large XML files. In this programmer’s opinion using SSIS for heavy-duty XML parsing is geeky self-abuse so I’ve opted to replace an eye-ball straining[1] SSIS package with half a dozen, “as simple as possible but no simpler”, Python scripts. If the Python is fast enough for production great! If not the scripts will serve as a clear model[2] for something faster.

I’m only refactoring[3] part of a larger ETL process so whatever I do it must mesh with the rest of the mess.

So where is the rest of the SSIS mess?

SSIS’s visual editor does a wonderful job of hiding the damn code!

This is a problem!

If only there was a simple way to troll through large sprawling SSIS spider-webby packages and extract the good bits. Fortunately, Python’s XML parsing tools can be easily applied to SSIS dtsx files. SSIS dtsx files are XML files. The following code snippets illustrate how to hack these files.

First import the required Python modules. lxml is not always included in Python distributions. Use the pip or conda tools to install this module.

# imports
import os
from lxml import etree

Set an output directory. I’m running on a Windows machine. If you’re on a Mac or Linux machine adjust the path.

# set sql output directory
sql_out = r"C:\temp\dtsxsql"
if not os.path.isdir(sql_out):

Point to the dtsx package you want to extract code from.

# dtsx files
dtsx_path = r'C:\Users\john\AnacondaProjects\testfolder\bixml'
ssis_dtsx = dtsx_path + r'\ParseXML.dtsx'

Read and parse the SSIS package.

tree = etree.parse(ssis_dtsx)
root = tree.getroot()

lxml renders XML namespace tags like <DTS:Executable as {}Executable. The following
shows all the transformed element tags in the dtsx package.

# collect unique element tags in dtsx
ele_set = set()
for ele in root.xpath(".//*"):
{'InnerObject', '{}PrecedenceConstraint', '{}ObjectData', '{}PackageParameter', '{}LogProvider', '{}Envelope', '{}Executable', 'ExpressionTask', '{}Body', '{}Variable', '{}ForEachVariableMapping', '{}PrecedenceConstraints', '{}SelectedLogProviders', 'ForEachFileEnumeratorProperties', '{}ForEachVariableMappings', '{}ForEachEnumerator', '{}SelectedLogProvider', '{}DesignTimeProperties', '{}LogProviders', '{}LoggingOptions', '{}Variables', 'FEFEProperty', 'FileSystemData', 'ProjectItem', '{}Property', '{}anyType', '{}Executables', '{}ParameterBinding', '{}ResultBinding', '{}VariableValue', 'FEEADO', 'BinaryItem', '{}PackageParameters', '{}PropertyExpression', '{}SqlTaskData', 'ScriptProject'}

Using transformed element tags of interest blast over the dtsx and suck out the bits of interest.

# extract sql code in source statements and write to *.sql files 
total_bytes = 0
package_name = root.attrib['{}ObjectName'].replace(" ","")
for cnt, ele in enumerate(root.xpath(".//*")):
    if ele.tag == "{}Executable":
        attr = ele.attrib
        for child0 in ele:
            if child0.tag == "{}ObjectData":
                for child1 in child0:
                    sql_comment = attr["{}ObjectName"].strip()
                    if child1.tag == "{}SqlTaskData":
                        dtsx_sql = child1.attrib["{}SqlStatementSource"]
                        dtsx_sql = "-- " + sql_comment + "\n" + dtsx_sql
                        sql_file = sql_out + "\\" + package_name + str(cnt) + ".sql"
                        total_bytes += len(dtsx_sql)
                        print((len(dtsx_sql), sql_comment, sql_file))
                        with open(sql_file, "w") as file:
print(('total sql code bytes',total_bytes))
   (2817, 'Add Record to ZipProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML225.sql')
    (48, 'Dummy SQL - End Loop for ZipFileName', 'C:\\temp\\dtsxsql\\2_ParseXML268.sql')
    (1327, 'Add Record to XMLProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML293.sql')
    (546, 'Delete Prior Loads in XMLProcessLog Table for Looped XML', 'C:\\temp\\dtsxsql\\2_ParseXML304.sql')
    (759, 'Delete Prior Loads to Node Table for Looped XML', 'C:\\temp\\dtsxsql\\2_ParseXML312.sql')
    (48, 'Dummy SQL - End Loop for XMLFileName', 'C:\\temp\\dtsxsql\\2_ParseXML320.sql')
    (1862, 'Set Variable DeletePriorImportNodeFlag', 'C:\\temp\\dtsxsql\\2_ParseXML356.sql')
    (55, 'Shred XML to Node Table', 'C:\\temp\\dtsxsql\\2_ParseXML365.sql')
    (1011, 'Update LoadEndDatetime and XMLRecordCount in XMLProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML371.sql')
    (1060, 'Update LoadEndDatetime and XMLRecordCount in XMLProcessLog - Shred Failure', 'C:\\temp\\dtsxsql\\2_ParseXML382.sql')
    (675, 'Load object VariablesList (Nodes to process for each XML File Category)', 'C:\\temp\\dtsxsql\\2_ParseXML412.sql')
    (1175, 'Set Variable ZipProcessFlag - Has Zip Had A Prior Successful Run', 'C:\\temp\\dtsxsql\\2_ParseXML461.sql')
    (224, 'Set ZipProcessed Status (End Zip)', 'C:\\temp\\dtsxsql\\2_ParseXML474.sql')
    (238, 'Set ZipProcessed Status (Zip Already Processed)', 'C:\\temp\\dtsxsql\\2_ParseXML480.sql')
    (231, 'Set ZipProcessing Status (Zip Starting)', 'C:\\temp\\dtsxsql\\2_ParseXML486.sql')
    (609, 'Update LoadEndDatetime in ZipProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML506.sql')
    (613, 'Update ZipLog UnzipCompletedDateTime Equal to GETDATE', 'C:\\temp\\dtsxsql\\2_ParseXML514.sql')
    (1610, 'Update ZipProcessLog ExtractedFileCount', 'C:\\temp\\dtsxsql\\2_ParseXML522.sql')
    ('total sql code bytes', 14908)

The code snippets in this post are available in this Jupyter notebook: Extracting SQL code from SSIS dtsx packages with Python lxml. Download and tweak for your dtsx nightmare!

  1. I frequently run into SSIS packages that cannot be viewed on 4K monitors when fully zoomed out.
  2. Python’s readability is a major asset when disentangling mess-ware.
  3. Yes, I’ve railed about the word “refactoring” in the past but I’ve moved on and so should you. “A foolish consistency is the hobgoblin of little minds.”

Common Table Expression (CTE) SQLServer Queries with J

I’ve been blogging long enough to observe that your awesome posts are often ignored while your little “one-offs” sometimes strike hit gold. This is particularly true for “code example” posts. When I’m trolling for code — the geek equivalent of trolling for babes — I don’t want to read about the author’s programming philosophy or what the hell he thinks about the idiot in the White House. Just show me the code and shut up!

So, taking my own advice,  the following is an example of a SQLSever Common Table Expression (CTE) query.

WITH cte1
     AS (SELECT c.obj_id                            AS obj_id,
                a.created                           AS creation_date,
                Datediff(DAY, a.created, Getdate()) AS day_cnt
         FROM   [HIST].[dbo].[History] a
                JOIN [HIST].[dbo].[HistorySummary] b
                  ON a.rowid = b.rowid
                JOIN [HIST].[dbo].[ObjectNames] c
                  ON c.obj_name = b.obj_name),
     AS (SELECT cte1.obj_id AS obj_id,
                  WHEN cte1.day_cnt <= 30 THEN 1
                  ELSE 0
                END         AS d0,
                  WHEN ( cte1.day_cnt > 30 )
                       AND ( cte1.day_cnt <= 60 ) THEN 1
                  ELSE 0
                END         AS d1,
                  WHEN ( cte1.day_cnt > 60 )
                       AND ( cte1.day_cnt <= 90 ) THEN 1
                  ELSE 0
                END         AS d2,
                  WHEN ( cte1.day_cnt > 90 ) THEN 1
                  ELSE 0
                END         AS d3
         FROM   cte1)
SELECT obj_id,
       SUM(d0) AS OneMonth,
       SUM(d1) AS TwoMonths,
       SUM(d2) AS ThreeMonths,
       SUM(d3) AS Overdue
FROM   cte2
GROUP  BY obj_id

CTE queries essentially create temporary virtual tables during query execution. They are similar to nested SQL queries but are easier to code, more general, (see recursive CTE queries), and often perform better than their convoluted equivalents. This example creates two virtual tables cte1 and cte2 that are then used to compute a quick histogram.

You can call CTE queries with the J ODBC interface. The following assumes a SQLServer ODBC connection dsn history.

NB. odbc interface
require 'dd'

NB. read CTE query
HistoryAgeSQL=. read 'c:/temp/HistoryAge.sql'

NB. connect sqlserver database
ch =. ddcon 'dsn=history'

NB. select with CTE query
sh =. HistoryAgeSQL ddsel ch

NB. fetch results
data=. ddfet sh,_1

As a final note it’s worth comparing the SQL CTE histogram code with J equivalents.  The two following J verbs taken from the J wiki, (here and here), compute histograms.

NB. computes histograms uses right open intervals
histogram=:<:@(#/.~)@(i.@#@[ , I.)

NB. variation on (histogram) uses left open intervals
histogram2=:<:@(#/.~)@(i.@>:@#@[ , |.@[ (#@[ - I.) ])

They scale to tens of millions of data points; returning results in a few seconds on my laptop. The SQL CTE, shown above, takes about three seconds running on 180,000 row tables on my employer’s full warp servers.  If I could convince the masses to adopt languages like J a large part of my job would disappear. Fortunately, the world is hostile to terse elegance!