Replies: 16 comments 7 replies
-
hello rocky. can i hire you? make it to support pyc 3.9... I can pay high salaries! |
Beta Was this translation helpful? Give feedback.
-
Periodically I intend to post status here towards the 3.9 decompiler (which won't be made public for a while) shooting towards the December target of allowing some 3.9 code (specifically code in comprehensions), do be decompiled. This weekend I made changes to https://github.com/rocky/python-control-flow/tree/loop-break-op to assist in classifying jumps that jump to either a loop or for-loop instruction. With the addition of pseudo operations in the token stream, the parsing should be able remove ambiguity of rules like this:
described in Use of a context-free grammar. Therefore reduction rules should not be needed. With this change, the first instruction of a while loop is made explicit in the token sequence. However this change will require redoing the grammar in the private repository that uses this kind of control flow. Also needed is something similar for the detection of a jump to the end of a python block. Loop jumps are not in this category. Also not in this category would be a jump to say an "else" clause from inside another block such as you might find in |
Beta Was this translation helpful? Give feedback.
-
I spent more time,in revising the decompilation grammar (for 3.8 for now) to make use of the extended loop pseudo operations that control-flow emits in the This all looks promising and much needed. However it is also a bit time consuming and tedious to do. There are just a lot of cases to consider and make small adjustments to. The hope is that if I take the pain in revising the 3.8 grammar, where I already have a good base to compare from, then the work of adapting it to 3.9 and beyond should be easier. Trying to do this directly on 3.9 is too difficult for me. |
Beta Was this translation helpful? Give feedback.
-
For this week, you will see a merge in the control-flow repository of the code mentioned from last week. Jumps to loops and the loop targets I think are pretty well covered and have more than enough information to classify them. It was a bit of tedious work to revise the decompile grammar to match these additional pseudo opcodes. What remains is a similar kind of task to classify the non-loop jumps into those that jump to the joins or the ends of blocks and those that don't. A jump to a "then", "else", or "finally" is not a jump to such a join. But the other non-looping jumps are jumps to the logical end of some control structure such as the end of a "if/else" or the end of "for" or "while loop", or then end of a "try". However as much as I would like to get that under control (which will mean another tedious revision of the 3.8 decompile grammar), I paused that since I said that I'd have some kind of 3.9 decompilation possible starting in December. So today the big news is that I've revised the private 3.8-3.9 decompiler so that it accepts Python 3.9 bytecode. Most of this was boilerplate stuff in scanner, parser, and semantic action code to let it know that there is another bytecode version allowed. In doing this a bug was found on the control-flow github project that didn't allow it to work cross-bytecode. So you'll see another commit in that repository to fix this. Right now, the most basic kinds of lambdas work. And in going over this, I learn that between 3.8 and 3.9 Python has yet again changed the call setup pattern. In 3.9 it is setup to a Python has changed these patterns probably 6 (or more) times. So when you hear that a decompiler "aims" to support all bytecode, that means it has to choose how to handle things one of this many number of ways. Similarly if you think you can take a 3.7-only decompiler and adapt it to 3.9, this is one of the things you'll have to adapt. And in the back of my mind I wonder, really, does it take this many times to come up with a reasonable solution to calling functions? Many (most) compilers manage to get without having maybe 3 changes. That said, I think (or hope) that the new convention is simpler to follow. And that said, I see that Python 3.11 has yet another new opcode here The net effect though, is that will require a bit of work to revise for 3.9 and then it will have to get revised again in 3.11 (if not 3.10 as well). They say that "experience is what you get when you didn't get what you wanted" (here for me it would no more changes to calling setup code). So the experience of these code bases is that we have tried to isolate things so that we can handle n different ways of code setup for something pretty routine like calling a function. |
Beta Was this translation helpful? Give feedback.
-
Rocky, will the new 3.8-3.9 decompiler be public? I want to help with testing and bug reports. I wish you good luck! |
Beta Was this translation helpful? Give feedback.
-
I have been meaning to give status for the last week or so. Decompiling 3.9 lambda functions has been going fine. So far the main change found between 3.8 and 3.9 has been the code generated for calling a function which uses a new I seem to recall bug reports in other decompilers asking this to be "supported". I will let someone else check to see if it is and whether decompilation of 3.9 bytecode when there are calls involving keyword arguments works on existing 3.9 decompilers. (Note: I don't recall the details that much, but I think it changes again in 3.10 or 3.11 or both - Thanks Python!) Right now, the main thing that is annoying is that control-flow dominator information between 3.8 and 3.9 is slightly different and I don't see why yet that should be the case. Right now this is causing boilerplate grammar differences where it feels like there should be none. That needs to be understood, and if this is a bug in python-control-flow, that should be fixed. And the next step is to get the branch joins or "ends" of blocks and the jumps to them classified so that more precise pattern matching for control flow works. This is conceptaully simple. We need to use reverse dominator information at the join points to see that the reverse dominators all dominiate all of the branches to that join point. However in practice this is going to take code on both the python-control-flow side as well as on the decompiler side. So it will take a while. In general, in the next revision of the decompler, the goal is to reduce the vast number of reduction-rule checks involving control flow. This should make that aspect more reliable and faster because it replaces multiple ad-hoc scanning parse trees (which means multiple tree fragment creation) with a single scan of instructions, followed by standard basic-block and dominator information gathering. |
Beta Was this translation helpful? Give feedback.
-
Recently I've been mostly working on the open-source Mathematica project Mathics3 to get out another release. This will probably go on for another few weeks. However yesterday, I prodded myself to look at the much needed revision of decompyle3 using python-control-flow. It took me a while to get into the zone, but at the end of it I was kind of happy. For the geeky, here is a little test I did. First the source code: lambda a, b, c, d: a or b or c or d And now the 3.9 decompilation parse tree:
While from the outside this may seem like a trivial example, I think of this like solving a Rubik's cube: to the outsider it seems like not much is way in unscrambling thing. But to an insider who understands the that there are subgroups to the group of permutations, what looks like meager progress in fact isn't. There is a lot I could write about this. But not now. Suffice it to say that, personally, I am rather pleased with the above, compared to the situation that exists currently in decompyle3. First, note that the decompile parse rule for "or" and "branch_op" end in a "BLOCK_END_JOIN". Roughly, you can think of this a a kind of "end" to some sort of "begin", and here the "begin" is the "or" expression. Even though "or" is algebraic logic, in its Python bytecode implementation "a or b or c or d" is more like (in pseudo Algol):
Second, note that we have both Basic Block and Dominator information. The arguments in the pseudo instructions are Python objects from python-control-flow. This just means that there may be more operations or things that can be done with those instruction object than what is simply shown in the print representation above. The grammar that I have been using in uncompyle6 was largely designed in 2000-2003. (Way before Python had defined an AST to suggest a pattern to follow.) More importantly, there are mysterious bits about the grammar from the beginning on that have felt more heuristic than explanatory. So, now for the first time, I have been doing a more clean room design of the grammar (slowly) and carefully. And in a way that makes sense to me and that I can explain why it is the way it is. The bad news though is that this is tantamount to an entire rewrite of the grammar which is a slow and tedious process. But if that original grammar could serve the project for 20 years, I am hoping that this rewrite will serve the project another 20 years. |
Beta Was this translation helpful? Give feedback.
-
The bulk of Mathics3 6.0.0 release went out last week or so. I spent a little time going back to decompilation and moving forward on handling boolean expressions. Not much to report other than it is the kind of slow slog that I've been through before. However it is satisfying to see in the grammar that we are properly tracking the end and interiors of boolean expressions properly inside the grammar and that the parse tree reflects this. Consider At the end of
In the above notice that BB_END belongs to the block with the jump, while "BB_START" is part of the construct that connects or gathers "and" or "or" expressions. And these are at different levels or attached to different left-hand sides. This is different from all of the prior uncompyle/decompyle parse trees. This is finally makes sense and is correct. I still have long slogs ahead before I finish off expressions and get to control flow. But it is my belief that in fact boolean expressions are have the more intricate and more complex variations than control flow even though it doesn't seem that way when you look at Python programs. Since my goal right now isn't to have a complete decompilation program but instead a solid one, proceeding as I have been doing where the expression grammar subset is completed before statement grammar is done, make things the easiest to work on. |
Beta Was this translation helpful? Give feedback.
-
This weekend I spent a little more time. Basically, the direction is good. There is just a mountain of work. Right now, it looks is going to take a very long time to revise everything. But if that is done the way things deparse will be on much more solid ground and, will be easier to follow how we get what we got. I initially considered writing details detailed parse trees, but I find these are too long and complicated. So below I simplify the parse tree so we can focus on the important difference. Suppose decompiled code is:
The parse tree is roughly:
while before we had:
In the reduced example the difference seems really small and trivial: the And the other difference is that we mark the end of a "block" in a new code while in the old code we simply know that some instruction jumps to the instruction after Why is this important? Suppose instead we added an "if" condition:
Then the parse structure would look like in the new parse:
while in the old parse
The important thing here is that we are tracking the number of block ends. There are two: one in the set comprehension, and one for the block associated with "comprehension-if". In the current decompyle3 code. We only know that there is a jump to instruction after the set comprehension. And it is possible to have situations where there is no jump an instruction after the In the experimental code, because we are tracking block ends and block ends occur after the In the old code, there was a lot of looseness and flexibility in which grammar rule to the But this looseness comes with a price. This grammar flexibility (or imprecision) allow also many kinds grammar parses that are not valid. So in the decompyle3 and uncompyle6 we need to check very single reduction to see if that block ends align with the actual code. Generally this means we have to build the parse tree fragment for the rule we are about to reduce. And then inspect parts of it to make sure that the jumps inside indeed go to the instruction after the And the parse final tree shown by the |
Beta Was this translation helpful? Give feedback.
-
The last few weeks has been again refactoring, basically the optional tree transformation phase is no longer optional. What this now does is turn a parse tree into something that is more like an abstract syntax tree. Recall that the trend in this rewrite has been to add pseudo operations for the basic block and dominator information. I think of this as sort of like adding parentheses and commas, between instructions so we know when something is nested as opposed to sequenced. In the very final phase where we walk the tree to produce string output, all of these basic block operations are extraneous — they don't contribute at all to the final output. After parsing, these pseudo tokens are unnecessary because all of this is captured inside the parse tree. One goal of these decompilers has always been clarity. (I am not saying we have succeeded, just that this is a goal.) As an aside, in my opinion one difference between this decompiler and other Python decompilers, is that as things progress — that is in the later code more than the earlier code — typically one can understand the reason why we do what we do. In earlier code there was more of the "it seems to work if we do things this way". It is true I can't fully comment on the other decompilers because I don't fully understand them and, frankly, not much has been written about how they work. However when I look inside the code, there aren't comments that do much in the way of explanation. My limited understanding though, is that they seem to walk instructions maybe building up some trees along the way and maybe doing a little bit of symbolic interpretation and printing along the way. Back to the recent work. Aside from clarity, the other reason why a more abstract syntax tree is nice is that the final printing phase is not that well integrated with the parser or grammar rules. So all of this stuff has to be specified by hand: if the a parse tree node has 6 nodes, we generally have to refer to those nodes by number in the semantic actions for creating the final string. If we can reduce the garbage node that don't have anything to do with the final constructs, say down to three nodes, that code is now simpler and less likely go change if we decide or need to change pseudo tokens. And that phase is clearer since it allows me to focus more on the essentials since the noise is stripped away. Here is an example of what I am talking about. First some Python: lambda a, b, c: (a and b) or c The parse tree for this the lambda follows. Note that we now can ask to decompile only the lambda body and not the main program setup to call this:
And the new "abstract" syntax tree is:
You will see above that we still have a As I have said numerous time, this is slow and tedious work. |
Beta Was this translation helpful? Give feedback.
-
I had some to take some time off for my day job. RoadmapThings are progressing fine revising code for a new more precise, logical, and more documented grammar. That may come a little later though. There is still a lot to do just to catch up to where I was before I added the current psuedo instructions that marks block sequencing versus block ends. So here is where things stand... I think for the most part simple expressions — those with no control flow like arithmetic operations and calls — are done Earlier this week I finished going over boolean expressions. What's interesting here is that all of the control-flow issues in compound statements are found here in this microcosm. Next I'll try the rest of the of expressions that have control flow such as:
Next are comprehensions:
Some of these I have done a little bit After this we basically have everything that can be seen in lambda expressions. Lambda Expressions form a significant sub grammar of the entire Python grammar. After Lambda Expressions we have statements. All of the simple statements — those without control flow are pretty much done. The compound statements that remain are:
I may have forgotten something above. At any rate, the takeaway here is that there is just a lot of work to |
Beta Was this translation helpful? Give feedback.
-
I had some to take some time off for my day job. RoadmapThings are progressing fine revising code for a new more precise, logical, and more documented grammar. As I mentioned above, I start with 3.8 and after that works, basically I adapt for 3.9 and 3.10, but in truth there has been very little in the way that needs adapting for 3.9 or 3.10. That may come a little later though. There is still a lot to do just to catch up to where I was before I added the current pseudo instructions that marks block sequencing versus block ends. So here is where things stand... I think for the most part simple expressions — those with no control flow like arithmetic operations and calls — are done Earlier this week I finished going over boolean expressions. What's interesting here is that all of the control-flow issues in compound statements are found here in this microcosm. Next I'll try the rest of the of expressions that have control flow such as:
Next are comprehensions:
Some of these I have done a little bit After this we basically have everything that can be seen in lambda expressions. Lambda Expressions form a significant sub grammar of the entire Python grammar. After Lambda Expressions we have statements. All of the simple statements — those without control flow are pretty much done. This includes class and function declaration, The compound statements that remain are:
I may have forgotten something above. At any rate, the takeaway here is that there is just a lot of work to do. And I am doing this part time, and somewhat out of whim. |
Beta Was this translation helpful? Give feedback.
-
It has been a little while since the last update. Here is what has been happening decompiler wise. In moving forward on the grammar rewrite for Python 3.8-3.10, I was seeing a bug in the control-flow analysis for 3.10. Also, this code has been more geared towards library use -- getting information using a function code object than standalone use where one provides for example a python bytecode file. Because of this, the code could use the xdis "dis" module compatible code. This weekend I've cleaned up the interfaces so that we can handle cross version bytecode. However, there is still a 3.10 bytecode bug that I need to work out. Previously, before working on this but after getting blocked by the 3.10 control-flow bug, I was sidetracked in working on Python 3.8 try/except handling. Apparently all of this is quite a bit different from Python 3.7. If you look at the other Python decompilers that work for both 3.7 and 3.8 (among other versions), you will find that they don't address the changes to 3.8 either. |
Beta Was this translation helpful? Give feedback.
-
Short Status: In short, a bug was found in detecting block end joins in the python-control-flow project. Have to go over and make revisions, yet again. Longer story: In looking at decompilation of a chained compare like Here is what I saw around that:
The psuedo-token instruction What we have is a block that does not fall through to the next block but instead returns a value. So the subsequent block is not a join point from two or more places. The fix in However this honesty means that a chain-rule grammar has handle an additional case: when the Although I haven't gotten to Consider: if x == 5:
return 5
# join point?
return 6 A programmer generally thinks of an "if" block having a join point where in other programming languages there would be an "end" of some sort. I have marked that above But as we see above from the code, although there may be a conceptual join of from the "then" branch and when the "if" condition is false, at the code level (for Python 3.8+) such a thing does not exist. For earlier Pythons, code may be different at there may indeed be jumps so this looks like a join of two branches. Since there is no join point, a logical consequence is that we can't tell the difference between what we wrote above and: if x == 5:
return 5
else:
return 6 And indeed, this is truly the case. It is then up to the grammar tho have a grammar rule covering which kind of "if" to use (or something else). The grammar can look at the context and instructions to decide. It is possible that in the bytecde for a Python version there is some additional quirk in the code that may tip things in favor one way or the other. I invite folks to see if indeed there is a version of Python the generates exactly the same code for both. And if not what are the differences. But the decision of whether this quirk is used to distinguish "if" from "if/else" is left up to the parser. As I said above though, the grammar will have to have a rule for:
And because of this correction in the control flow, I now have to go back to all of the rules assumed there were the erroneous join points and matched those in the grammar rule. This is tedious, and it means going back over a number of grammar rules. However, as I have said many times before, the good news is that so far I can look at the parse tree with its additional pseudo-instruction tokens and demonstrate to myself that the parse is logical and matches how a person might reason the same way when seeing the instructions. In other words, there is a rigid logic which is followed where previously there had been a looseness and vagueness. The more I go over the grammar, the more I believe it will make logical sense. Over time I can also get the names to be more logical too. |
Beta Was this translation helpful? Give feedback.
-
Although recently I have worked a little on decompyle3 and the unpublished follow on decompiler, most of my focus has been on this new idea I had for improving xdis disassembly output. See this xdis example for what I mean by improving disassembler output. Here, I will explain why I think this is useful and important. From a users perspective, the reality is that Python decompilers have not been keeping up with current Python versions. So having better disassemblers, ones that do more in the way of reconstruction more of the source code. A lot of this does not change that much from version to version. From my perspective though, seeing what can done easily and reliably be done at disassembly time further clarifies the role of the disassembler versus assembler. But it goes further than that... One of the weaknesses of using a context free grammar in parsing variable length instructions. Recall that virtually all bytecode is laid out in reverse Polish order, so an operator comes after all of its operands. But parsing reads the other way. The problem with variable-length opcodes is that we do not know how many operands (in the current parsing grammar these are One way to do this is to add operator begin marks for variable-length instructions. This can be done as part of the tokenization process. But it could also be done as part of disassembly, since this information is also useful in determining how to show higher-level expression information that you seen the cited example. Finally, I will note that I submitted a paper to BlackHat conference. Early in September I will find out if talk proposal is accepted. I rarely give talks, but when i do it is related to some open-source project that has a topic that I think deserves a wide audience. The last talk I gave if I recall correctly was in 2018. Aside effect of presenting a talk is it tends to focus my attention on the open-source project I will be presenting on. If you have suggestions for a conferences on decompiler technology and or the use of these projects in particular, let me know. |
Beta Was this translation helpful? Give feedback.
-
Here's what's up with moving the currently non-public decompiler for 3.8, 3.9, and 3.10 forward. There was a bug (and there is still more work to do) on computing dominators in control flow. It is important to catch such things early or else the grammars that looks for patterns will get more complicated and try to compensate for the bugs. This is a loosing proposition. I fear that, already, that some of pollution of the grammar is currently in there already. I have marked it suspicious so that later I can come back and address this. Here is one way in the past that I have cleaned up unused grammar rules. I have instrumented the parser save on disk counts of the rules it uses. If a grammar rule is not used after this, it probably can be deleted. Or I should be able to find a test that uses the grammar rule. One thing I have come to learn in fixing up the dominator algorithm, is that while there are sample bits of code for computing dominators, there is not much in the way of tests that the dominator code is correct. If someone can point me to a test suite, I would be most interested in this. Independent of that, I will probably have to devise my own test suite. Sigh. Even if there is a test suite, I would like one that covers the kinds of thing found in programs, particular Python programs which has its own weird control-flow structures. Another thing, I have come to appreciate is that it possibly better than the actual control flow is a weaker version of this that adds edges that are not in the actual control flow, but mimic the block structure that must appear in the source code. Let me give an example to clear this up. Suppose my program is # Basic block 1
if x:
# Basic block 2
some_computation()
return
elif y:
# Basic block 3
raise
# Basic block 4
some_other computation() Strictly speaking, the start of "Basic block 3" is not a join point because it is not flowed into form more than one basic block, Basic block 1. It is indistinguishable strictly control-flow-wise from:
Because of the this, if we do want to recognize parsing as the first way because that is more natural, we need two grammar rules to cover the "if/else" statement. One where both the "then" and the "else" leave the function, and one on the more normal case where things jump to the end. And possibly two more for whether there just one of the "then" or "else" jumps or flows through to the end. If instead, we were to add artificial edges from then ends of each branch, we would not need these additional grammar rules. For now I am opting for precision, so I am adding the additional rules. Someone has privately sponsored the project and asked for some hand 3.11 bytecode decompilation. So I have been doing my first 3.11 hand bytecode decompilation. It has been going along fairly easily since the code is not too long, and the code is not convoluted like much of the code submitted in bug reports. Decompiling Python 3.11 here seems pretty straight-forward. However there are parts that are just different than 3.10. I understand 3.12 will also change things around a bit too. This leads me to the observation I noticed a while ago. Any decompiler that hopes to decompile all Python bytecode versions or even just those in the 3.x series, needs to seriously think about how to modularize and isolate version in more way than just adjusting bytecode. Idioms can vary wildly between versions. So even if a single version idioms are straight-forward, there can be a lot of difference between the idioms of two consecutive versions. |
Beta Was this translation helpful? Give feedback.
-
The state of Python 3.9+ decompilers circa 2022
A decent decompiler for Python 3.9 bytecode and above is something that people ask for over and over. Clearly, there is a great desire for it. And on some occasions, a great and legitimate need.
In the uncompye6/decompile3 projects, although some support for has been shown, on a financial level it comes nowhere the level of remuneration that can be gotten for a person doing any number of garden-variety programming tasks. Still, there continues to be some support.
Also, as far as I can tell, none of the few that projects that have declared intent have shown any measurable progress towards a decompiler for newer Python bytecode. Or if there are reasonable decompilers, they are there in secret.
I do not consider there to be effort up to the challenge in either
pycdc
or the port of an abandoned Python 3.3 decompiler to 3.7 and then modified slightly for 3.9 bytecode as though everything else were the same. If you do, then great, use those or work on those and you can stop reading here.The state of Uncompyle6 & Decompile3 project help circa 2022
Past experience since I have been working on this for the last 7 years is that few have been contributing to
uncompyle6
ordecompyle3
. The few contributors that come along don't stay for any length of time. That's not a criticism and is not surprising, given the amount of work needed and the amount of reward offered.But bug reports come in that indicate people use the code, and not just casually. But too few are willing to contribute to simplifying bug reports, even to the extent of narrowing or isolate or bugs to make it easier for someone more knowledgeable to address and fix the box. For my part, this is an endless time sink for something that I have very little personal use for.
In other words, all of my activity on the project feels like the thing one typically pays someone to do; I am doing the work that others are unable or unwilling to do for mostly their benefit. The only justification that I can see to make an excuse as to way people feel that this is "normal" is that this is open-source software. Or maybe this follows the principle there is no harm in asking, and on it takes very little effort to do.
But this doesn't mean I want to contribute my time for free forever at the expense of doing the vast number of other things I could be doing or want to do.
Going forward
In order to support Python 3.9 and above, updated control-flow technology is needed. (The parser and overall system could use an upgrade, but those aren't falling behind so the need isn't as dire there.)
I have been working on both on these decompilers here and there. And I have been working on a new one as well. However I don't feel compelled to make that public just yet, especially given the history.
I think I have been pretty open about this, but in case you didn't get this, well, take this as a straightforward and clear mention.
I do not intend to make this repository public until it feels right. And that may take a while.
However, as I have been asking people to show support for the project, I offer thanks and do want to have some way of showing that there is work on this.
So my current suggestion and plan is to allow people to send me 3.9 bytecode (staring this December) and I will decompile portions of that and send that back. That way you can know that there is work going on.
Again, I have been working on a private rewrite of decompyle3. Yes, even though decompyle3 was a break off and purported to be a rewrite of uncompyle6, alas, even that didn't go far enough with new technology. Currently, that private rewrite is for 3.8 to make it easier for me. So far changes of this to support 3.9 and 3.10 are easy.
Handling 3.11 and beyond will take a lot more work though.
The private decompiler currently handles comprehensions and lambdas very well. (Again, I have automated means for checking this)
So after extending that, it would start off by handling something similar for 3.9 bytecode. By December or so, I expect I'll be handle to some limited form of comprehensions, say set comprehensions - no list comprehensions, generators, or lambdas.
And then depending on support and will, I'll just keep expanding that. In conjunction I'll probably expand the 3.8 aspect of that decompiler to various kinds of statements.
The next thing up was to start handling statement loops in 3.8 in using the newer control-flow technology. I have been working on other projects on the weekends which is basically the only time I have for this.
Beta Was this translation helpful? Give feedback.
All reactions