Python Source Obfuscation using ASTs

Python Source Obfuscation using ASTs

Introduction

For one of the challenges of the Hack in the Box Capture the Flag
game last week, I decided to release an obfuscated and compiled Python class.
After doing some research on the internet about this particular topic, it
appeared there is no real up-to-date tool for this. I mostly found paid
software and/or software that has been outdated for several years, or at least
looks like it is.

Well, that’s great. This allows me to make something new-ish.

The Actual Challenge

The actual challenge was actually rather easy; given a teamname
and a flag on the commandline, the Python script would verify whether the flag
is correct or not. By correctly using a few prints here and there, the
challenge can be solved within minutes. Which is why we obfuscate it! As we
obviously want the teams playing our CTF to get some headaches :)

Abstract Syntax Trees

According to Wikipedia, an Abstract Syntax Tree is a tree representation of
the abstract syntactic structure of source code written in a programming
language. In other words, an AST represents the original source code as a
tree.

Fortunately for us, Python provides a built-in ast module which is able to
parse Python source into an AST (actually using the built-in compile()
method.) Besides that, the ast module gives us access to all available ast
nodes (e.g., Call, BinOp, etc.)

With this knowledge, we’re able to rewrite the original AST using the
ast.NodeTransformer class. For a brief example of what this
looks like, please refer to the official documentation.

Finally, after rewriting the AST, we can do two things. We can generate a
compiled python object directly from the AST. Unfortunately I did not find a
way to do this (or a library, for that matter), if you know of one, feel free
to point me to it :) The other option is to generate Python code from the AST
again and to compile it from there. For this step, one would use the
codegen.py module. (Note that I submitted a pull request,
as the current version gave me an error with regards to the omission of
parentheses for binary operations
.)

We’re now at the point where we can parse Python code into an AST, rewrite it,
and write a new Python source from the rewritten AST. The final step for my
challenge was to compile the created Python source into an object, which can
be done by executing the following command on the commandline. (I’m sure it’s
also possible using a Python function, but this works just fine for the
moment.)

$ python -mcompileall .
    

Decompilation

Before diving into the obfuscations I performed on the AST, I’d like to note
that the compiled Python object can be decompiled using Mysterie’s
python decompiler, uncompyle2.

Obfuscation through ASTs

So basically I only did a few simple obfuscations, which already proved to be
painful enough, but it’s a nice start for anyone that’s looking into doing
something similar.

The ast.NodeVisitor class, which was mentioned earlier in this blogpost,
allows one to visit each AST node in the tree, with the possibility to
modify them or to delete them. We can do this by implementing visit_
functions. For example, in order to analyzer/modify/delete certain Name
nodes, which are used for variable lookups etc, we implement a visit_Name
function in our obfuscation class (which, btw, extends ast.NodeVisitor.)

Modifying AST nodes using the NodeTransformer

The NodeTransformer can modify an existing AST in a fairly simple way. By
returning the original node, the AST remains untouched, as is showed in the
following example code.

from ast import NodeTransformer

class Example01(NodeTransformer):
    def visit_Str(self, node):
        return node
    

One can modify an AST node by return a new node. For example, to replace all
strings with an empty string, see the following snippet.

from ast import NodeTransformer

class Example02(NodeTransformer):
    def visit_Str(self, node):
        return Str(s='')
    

And, finally, to delete a node, simply return None in the visit_ function,
although this can give weird situations in which the new AST is not valid
anymore.

Example Obfuscation – Strings

In the AST, constant String nodes are represented with Str nodes. These
Str nodes have one interesting field, namely the s field, which contains
the actual string. For example, in the AST of the following Python source,
there will be exactly one Str node with the s field set to “Hello AST”.

print 'Hello AST'
    

For the challenge, I implemented a handful simple string obfuscations. Take
for example the following code (rewritten a bit, but similar to the code in
the obfuscator.)

from ast import NodeTransformer, BinOp, Str, Add

class StringObfuscator(NodeTransformer):
    def visit_Str(self, node):
        return BinOp(left=Str(s=node.s[:len(node.s)/2]),
                     op=Add(),
                     right=Str(s=node.s[len(node.s)/2:])),
    

Noteworthy in this example code is that BinOp is a node representing a
binary operation, in this case addition (because of the Add node.) A binary
operation takes a left operand and a right one. On the left we put the first
half of the actual string, and on the right we put the second half of the
string. When running this “obfuscator” on our example once, we get the
following code. (Note that you can run such obfuscator multiple times to
achieve extra painful code. This is what I did for the challenge :p)

print ('Hell' + 'o AST')
    

Other string obfuscations included reversing a string, i.e., “abc” ->
“cba”[::-1], and converting single-length strings (which you’ll get soon
enough when recursively running the obfuscator a few times) into a chr()
statement (i.e., “a” -> chr(0×61).)

The Obfuscated Challenge

After running the original challenge a few times through
the obfuscator, which, in addition to obfuscating strings, also
obfuscates integers, import statements, and global variable names, we get
our actual challenge.

And, yes, running the obfuscator several times does indeed look like the
following.

$ python hitbctfobf.py hitbctforig.py|python hitbctfobf.py -|...
    

Outro

Having pasted the original challenge in the blogpost, there’s not much left of
the challenge itself. However, I found the methods behind the obfuscation
fairly interesting, and perhaps so does somebody else.. :)

14 thoughts on “Python Source Obfuscation using ASTs

  1. Partial reversal of your obfuscation: https://gist.github.com/iksteen/89cd25c753ac91ff785b

    It doesn’t de-obfuscate the import statements because I currently don’t have time to write it. A method to do so would be to first scan the AST tree for assignment where the right hand side is a call to __import__, then store the lefthand id -> module name in a dict and replace occurrences of the lefthand name with the import name during the translation phase.

  2. Pingback: AppSec: Myths about Obfuscation and Reversing Python « Simon Roses Femerling – Blog

  3. Followed all the instructions… (i.e. Downloaded your code from http://jbremer.org/wp-posts/hitbctfobf.py and your modified codegen module from https://github.com/andreif/codegen/pull/4/files)

    Ran it on the following code: (python ./pyastobfuscate.py test1.py > test1-obf.py)

    #!/usr/bin/env python
    
    import os
    import time
    import test2
    
    time.sleep(1)
    os.system("echo \"test2.teststr = " + test2.teststr + "\"")
    

    It generated the following code:

    getattr(sjmtzucx, 'sleep')(((((0 * 149) + 0) * ((0 * 150) + 123)) + ((0 * 145) + 1)))
    getattr(aulyxosqs, ''.join((gthq for gthq in reversed('metsys'))))(((''.join((xtocewpr for xtocewpr in reversed(' = rtstset.2tset" ohce'))) + getattr(yusbnby, ('tes' + 'tstr'))) + chr(((0 * 237) + 34))))
    

    … which when I attempted to run it produced the following error?

    Traceback (most recent call last):
      File "./test1-obf.py", line 1, in 
        getattr(sjmtzucx, 'sleep')(((((0 * 149) + 0) * ((0 * 150) + 123)) + ((0 * 145) + 1)))
    NameError: name 'sjmtzucx' is not defined
    

    What am I missing here? Also it isn’t clear to me how to apply this to a collection of modules some of whom import the other?

    Would appreciate some clarification thanks…

    • The obfuscator requires the following line to indicate the main routine, i.e., not function declarations etc:

      if __name__ == '__main__':
      

      Thus your sample code should be obfuscated correctly when written as following (I haven’t tested that though):

      import os
      import time
      import test2
      
      if __name__ == '__main__':
          time.sleep(1)
          os.system('echo "test2.teststr = ' + test2.teststr + '"')
      
  4. Thanks for your prompt reply…
    I tested your changes and all worked fine.

    I then modified the “test2.py” module accordingly:

    #!/usr/bin/env python

    if __name__ == “__main__”:
    teststr = “test”

    …which resulted in the following error:

    Traceback (most recent call last):
    File “./test1-obf.py”, line 9, in
    getattr(glnx, (‘sys’ + ‘tem’))(((‘ = rtstset.2tset” ohce’[::((-1 * 140) + 139)] + getattr(lxjg, ‘teststr’)) + chr(((0 * 250) + 34))))
    AttributeError: ‘module’ object has no attribute ‘teststr’

    This imples a kind of “ordered dependency” scenario when attempting to obfuscate a collection of modules, some of whom import others.
    The order in which modules are obfuscated becomes critical. i.e. Firstly obfuscate all modules that are imported by others and who import no others.
    Then obfuscate modules that only import the previous already obfuscated group. Continue this process iteratively until all modules have been obfuscated.
    Of course this approach will fail if there are any “cyclic dependencies”.

    Do you think I am on the right track here, or have I completely missed the point?
    I’m also not sure how to deal with imported modules that are simply collections of methods and variables and which execute no code?

  5. BTW: I wrote a script to apply my suggested obfuscation order and noticed that the obfuscater was removing all the “except:” statements?

  6. FYI: The missing “except:” statements were due to a minor bug in the “codegen.py” module… on line 566, “visit_excepthandler” should be “visit_ExceptHandler”.

    Now I am getting a lot of these types of syntax errors: “can’t assign to function call” wherever the obfuscated code has generated something like this:
    “getattr(jxjz, ‘currenttime’) = …” which was generated from “common.currenttime = …”. This looks as if the imports obfuscation isn’t considering assignment to imported variables?

  7. Belatedly realised that to properly obfuscate a collection of related modules I would also need to track reference changes to imported methods and variables… giving up on this approach now… Thanks for your time.

Leave a Reply to jbremer Cancel reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>