-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for run-bug-run runbugrun #39
Comments
https://github.com/giganticode/run_bug_run_data/releases/tag/v0.0.1 Seems like the first release is out |
happy to take on that |
sounds good, let me know if you have any question! |
I started looking into this yesterday. Few things about run bug run:
|
When you integrate the benchmark you define which commands to run when compiling/testing each bug/patch. The only part that is currently hard-coded for Java is the extraction of single functions, removal of comments, etc. See https://github.com/ASSERT-KTH/elle-elle-aime/tree/master/elleelleaime/core/utils/java To integrate a Python benchmark you'll need to implement similar functions for Python (or even better, using tree-sitter to support more languages). |
Sharing progress so far: #166 I'm a little unclear on the Bug.failing_tests -- it maps test methods to the resulting error message? In run bug run there are simply test inputs and expected outputs, and the buggy code is not always a self-contained function. Also the ground_truth diff only comes into play when evaluating the LLM-generated fix? Why not simply check if tests pass. Similarly, not sure if I'm using the checkout logic correctly -- seems like a drag to have to make a copy each time and I instead simply read from the original buggy file. Any feedback/corrections welcome! |
Exactly, it maps fully qualified test method names to the error messages.
I see the solution you came up with, and that seems reasonable. The only problem will be in extracting the test case (see https://github.com/ASSERT-KTH/elle-elle-aime/blob/master/elleelleaime/core/utils/java/java.py#L269). This means that we need to add a special case for RunBugRun here.
The ground_truth diff is used in two places right now:
Executing tests to check is great, but there is known problem in program repair called patch overfitting. This problem lies in patches that pass the tests but are different from what the developer intends (see e.g., Is the cure worse than the disease? overfitting in automated program repair. For this reason, we use the ground-truth patch as a reference in some evaluation metrics like exact-match or ast-match.
It's important to have that logic (every checkout copies the files from an untouched source) due to the parallelism. We want to be able to evaluate hundreds/thousands of patches at the same time, and this requires them to be in different locations.
Could you rebase your PR? I changed the CI config to enable it on PRs. That way we can check if the tests are green. Thanks :) |
So, test cases right now are simple asserts about the returned value. febe8e4#diff-3f4ea3e207b6866ea3514390ef0148073207b05d1a8ca4da933d8f926e1be2d5 Got all the other points, will rebase PR. |
Looks like a good solution, thanks! Let me know if you have any problem with the CI |
PR updated. One tricky issue is that during initialize to get failing test cause, it is necessary to execute test cases, which takes a very long time. For now I'm only checking if a bug has an associated runtime exception (which is stored as part of the dataset), without executing. This skips about 3/4ths of all bugs that do not throw an exception but output the wrong result. Here is what samples generated for run bug run look like:
|
One solution is to execute once, store the results, and then always load from them. WDYT? You can store there in a fork of RunBugRun |
They look great! Thanks for the work in integrating RunBugRun :)) |
RunBugRun -- An Executable Dataset for Automated Program Repair
https://github.com/giganticode/run_bug_run
The text was updated successfully, but these errors were encountered: