More J Pandoc Syntax HighLighting

Syntax highlighting is essential for blogging program code. Many blog hosts recognize this and provide tools for highlighting programming languages. WordPress.com (this host) has a nifty highlighting tool that handles dozens of mainstream programming languages. Unfortunately, one of my favorite programming languages, J, (yes it’s a single letter name), is way out of the mainstream and is not supported.

There are a few ways to deal with this problem.

  1. Eschew J highlighting.
  2. Upgrade1 your WordPress.com subscription and install custom syntax highlighters that can handle arbitrary language definitions.
  3. Find another blog host that freely supports custom highlighters.
  4. Roll your own or customize an existing highlighter.

A few years ago I went with the fourth option and hacked the superb open-source tool pandoc. The grim details are described in this blog post. My hack produced a customized version of pandoc with J highlighting. I still use my hacked version and I’d probably stick with it if current pandoc versions had not introduced must-have features like converting Jupyter notebooks to Markdown, PDF, LaTeX and HTML. Jupyter is my default thinking-things-through programming environment. I’ve even taken to blogging with Jupyter notebooks. If you write and explain code you owe it to yourself to give Jupyter a try.

Unwilling to eschew J highlighting or forgo Jupyter I was on the verge of re-hacking pandoc when I read the current pandoc (version 2.9.1.1) documentation and saw that J is now officially supported by pandoc. You can verify this with the shell commands.

pandoc --version
pandoc --list-highlight-languages

The pandoc developers made my day! I felt like Wayne meeting a rock star.

Highlighting J is now a simple matter of placing J code in markdown blocks like:

  ~~~~ { .j }
      ... code code code ...
  ~~~~

and issuing shell commands like:

pandoc --highlight-style tango --metadata title="J test" -s jpdh.md -o jpdh.html

The previous command generated the HTML of this post which I pasted into the WordPress.com Classic Editor. Not only do I get J code highlighting on the cheap I also get footnotes which, for god freaking sakes,2 are not supported by the new WordPress block editor for low budget blogs.

The source markdown used for this post is available here – enjoy!


NB. Some J code I am currently using to test TAB
NB. delimited text files before loading them with SSIS.

NB. read TAB delimited table files as symbols - see long document
readtd2s=:[: s:@<;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }. (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))

tdkeytest=:4 : 0

NB.*tdkeytest v-- test natural key columns  of TAB delimited text
NB. files.
NB.
NB. Many of the raw tables of the ETL process depend on  compound
NB. primary keys. This verb applies a basic  test of primary  key
NB. columns. Passing this test  makes it very  likely  the  table
NB. will load  without key constraint  violations.  Failures  are
NB. still possible depending  on how  text  data is converted  to
NB. other  datatypes. Failure of this test indicates  a very high
NB. chance of key constraint violations.
NB.
NB. dyad:  il =. blclColnames tdkeytest clFile
NB.
NB.   f0=. 'C:\temp\dailytsv\raw_ST_BU.txt'
NB.   k0=. ;:'BuId XMLFileDate'
NB.   k0 tdkeytest f0
NB.
NB.   f1=. 'C:\temp\dailytsv\raw_ST_Item.txt'
NB.   k1=. ;:'BuId ItemId XMLFileDate'
NB.   k1 tdkeytest f1

NB. first row is header
h=. 0{d=. readtd2s y

NB. key column positions
'header key column(s) missing' assert -.(#h) e. p=. h i. s: x

c=. #d=. }. d
b=. ~:p {"1 d

NB. columns unique, rowcnt, nonunique rowcnt
if. r=. c = +/b do.
  r , c , 0
else.
  NB. there are duplicates show some sorted duplicate keys
  k=. p {"1 d
  d=. d {~ I. k e. k #~ -.b
  d=. (/: p {"1 d) { d
  b=. ~:p {"1 d
  m=. +/b
  smoutput (":m),' duplicate key blocks'
  n=. DUPSHOW <. m
  smoutput 'first ',(":n),' duplicate row key blocks'
  smoutput (<p { h) ,&.> n {. ,. b <;.1 p {"1 d
  r , c , #d
end.
)

  1. The pay more option is always available.
  2. WordPress.com is beginning to remind me of Adobe. Stop taking away longstanding features when upgrading!

Extracting SQL code from SSIS dtsx packages with Python lxml

Lately, I’ve been refactoring a sprawling SSIS (SQL Server Integration Services) package that ineffectually wrestles with large XML files. In this programmer’s opinion using SSIS for heavy-duty XML parsing is geeky self-abuse so I’ve opted to replace an eye-ball straining[1] SSIS package with half a dozen, “as simple as possible but no simpler”, Python scripts. If the Python is fast enough for production great! If not the scripts will serve as a clear model[2] for something faster.

I’m only refactoring[3] part of a larger ETL process so whatever I do it must mesh with the rest of the mess.

So where is the rest of the SSIS mess?

SSIS’s visual editor does a wonderful job of hiding the damn code!

This is a problem!

If only there was a simple way to troll through large sprawling SSIS spider-webby packages and extract the good bits. Fortunately, Python’s XML parsing tools can be easily applied to SSIS dtsx files. SSIS dtsx files are XML files. The following code snippets illustrate how to hack these files.

First import the required Python modules. lxml is not always included in Python distributions. Use the pip or conda tools to install this module.

# imports
import os
from lxml import etree

Set an output directory. I’m running on a Windows machine. If you’re on a Mac or Linux machine adjust the path.

# set sql output directory
sql_out = r"C:\temp\dtsxsql"
if not os.path.isdir(sql_out):
    os.makedirs(sql_out)

Point to the dtsx package you want to extract code from.

# dtsx files
dtsx_path = r'C:\Users\john\AnacondaProjects\testfolder\bixml'
ssis_dtsx = dtsx_path + r'\ParseXML.dtsx'
ssis_dtsx

Read and parse the SSIS package.

tree = etree.parse(ssis_dtsx)
root = tree.getroot()
root.tag
'{www.microsoft.com/SqlServer/Dts}Executable'

lxml renders XML namespace tags like <DTS:Executable as {www.microsoft.com/SqlServer/Dts}Executable. The following
shows all the transformed element tags in the dtsx package.

# collect unique element tags in dtsx
ele_set = set()
for ele in root.xpath(".//*"):
    ele_set.add(ele.tag)    
print(ele_set)
print(len(ele_set))
{'InnerObject', '{www.microsoft.com/SqlServer/Dts}PrecedenceConstraint', '{www.microsoft.com/SqlServer/Dts}ObjectData', '{www.microsoft.com/SqlServer/Dts}PackageParameter', '{www.microsoft.com/SqlServer/Dts}LogProvider', '{http://schemas.xmlsoap.org/soap/envelope/}Envelope', '{www.microsoft.com/SqlServer/Dts}Executable', 'ExpressionTask', '{http://schemas.xmlsoap.org/soap/envelope/}Body', '{www.microsoft.com/SqlServer/Dts}Variable', '{www.microsoft.com/SqlServer/Dts}ForEachVariableMapping', '{www.microsoft.com/SqlServer/Dts}PrecedenceConstraints', '{www.microsoft.com/SqlServer/Dts}SelectedLogProviders', 'ForEachFileEnumeratorProperties', '{www.microsoft.com/SqlServer/Dts}ForEachVariableMappings', '{www.microsoft.com/SqlServer/Dts}ForEachEnumerator', '{www.microsoft.com/SqlServer/Dts}SelectedLogProvider', '{www.microsoft.com/SqlServer/Dts}DesignTimeProperties', '{www.microsoft.com/SqlServer/Dts}LogProviders', '{www.microsoft.com/SqlServer/Dts}LoggingOptions', '{www.microsoft.com/SqlServer/Dts}Variables', 'FEFEProperty', 'FileSystemData', 'ProjectItem', '{www.microsoft.com/SqlServer/Dts}Property', '{http://www.w3.org/2001/XMLSchema}anyType', '{www.microsoft.com/SqlServer/Dts}Executables', '{www.microsoft.com/sqlserver/dts/tasks/sqltask}ParameterBinding', '{www.microsoft.com/sqlserver/dts/tasks/sqltask}ResultBinding', '{www.microsoft.com/SqlServer/Dts}VariableValue', 'FEEADO', 'BinaryItem', '{www.microsoft.com/SqlServer/Dts}PackageParameters', '{www.microsoft.com/SqlServer/Dts}PropertyExpression', '{www.microsoft.com/sqlserver/dts/tasks/sqltask}SqlTaskData', 'ScriptProject'}
36

Using transformed element tags of interest blast over the dtsx and suck out the bits of interest.

# extract sql code in source statements and write to *.sql files 
total_bytes = 0
package_name = root.attrib['{www.microsoft.com/SqlServer/Dts}ObjectName'].replace(" ","")
for cnt, ele in enumerate(root.xpath(".//*")):
    if ele.tag == "{www.microsoft.com/SqlServer/Dts}Executable":
        attr = ele.attrib
        for child0 in ele:
            if child0.tag == "{www.microsoft.com/SqlServer/Dts}ObjectData":
                for child1 in child0:
                    sql_comment = attr["{www.microsoft.com/SqlServer/Dts}ObjectName"].strip()
                    if child1.tag == "{www.microsoft.com/sqlserver/dts/tasks/sqltask}SqlTaskData":
                        dtsx_sql = child1.attrib["{www.microsoft.com/sqlserver/dts/tasks/sqltask}SqlStatementSource"]
                        dtsx_sql = "-- " + sql_comment + "\n" + dtsx_sql
                        sql_file = sql_out + "\\" + package_name + str(cnt) + ".sql"
                        total_bytes += len(dtsx_sql)
                        print((len(dtsx_sql), sql_comment, sql_file))
                        with open(sql_file, "w") as file:
                            file.write(dtsx_sql)
print(('total sql code bytes',total_bytes))
   (2817, 'Add Record to ZipProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML225.sql')
    (48, 'Dummy SQL - End Loop for ZipFileName', 'C:\\temp\\dtsxsql\\2_ParseXML268.sql')
    (1327, 'Add Record to XMLProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML293.sql')
    (546, 'Delete Prior Loads in XMLProcessLog Table for Looped XML', 'C:\\temp\\dtsxsql\\2_ParseXML304.sql')
    (759, 'Delete Prior Loads to Node Table for Looped XML', 'C:\\temp\\dtsxsql\\2_ParseXML312.sql')
    (48, 'Dummy SQL - End Loop for XMLFileName', 'C:\\temp\\dtsxsql\\2_ParseXML320.sql')
    (1862, 'Set Variable DeletePriorImportNodeFlag', 'C:\\temp\\dtsxsql\\2_ParseXML356.sql')
    (55, 'Shred XML to Node Table', 'C:\\temp\\dtsxsql\\2_ParseXML365.sql')
    (1011, 'Update LoadEndDatetime and XMLRecordCount in XMLProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML371.sql')
    (1060, 'Update LoadEndDatetime and XMLRecordCount in XMLProcessLog - Shred Failure', 'C:\\temp\\dtsxsql\\2_ParseXML382.sql')
    (675, 'Load object VariablesList (Nodes to process for each XML File Category)', 'C:\\temp\\dtsxsql\\2_ParseXML412.sql')
    (1175, 'Set Variable ZipProcessFlag - Has Zip Had A Prior Successful Run', 'C:\\temp\\dtsxsql\\2_ParseXML461.sql')
    (224, 'Set ZipProcessed Status (End Zip)', 'C:\\temp\\dtsxsql\\2_ParseXML474.sql')
    (238, 'Set ZipProcessed Status (Zip Already Processed)', 'C:\\temp\\dtsxsql\\2_ParseXML480.sql')
    (231, 'Set ZipProcessing Status (Zip Starting)', 'C:\\temp\\dtsxsql\\2_ParseXML486.sql')
    (609, 'Update LoadEndDatetime in ZipProcessLog', 'C:\\temp\\dtsxsql\\2_ParseXML506.sql')
    (613, 'Update ZipLog UnzipCompletedDateTime Equal to GETDATE', 'C:\\temp\\dtsxsql\\2_ParseXML514.sql')
    (1610, 'Update ZipProcessLog ExtractedFileCount', 'C:\\temp\\dtsxsql\\2_ParseXML522.sql')
    ('total sql code bytes', 14908)

The code snippets in this post are available in this Jupyter notebook: Extracting SQL code from SSIS dtsx packages with Python lxml. Download and tweak for your dtsx nightmare!



  1. I frequently run into SSIS packages that cannot be viewed on 4K monitors when fully zoomed out.
  2. Python’s readability is a major asset when disentangling mess-ware.
  3. Yes, I’ve railed about the word “refactoring” in the past but I’ve moved on and so should you. “A foolish consistency is the hobgoblin of little minds.”