-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some GH PR references were not ported correctly #22
Comments
Found two more: python/cpython#91294 (comment) references GH-76305 instead of GH-32124. |
It seems like |
Another example. In https://bugs.python.org/issue47075 I explicitly wrote |
During the transfer, https://bugs.python.org/issue32112 was transferred to python/cpython#76293. The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR), and since bpo-32112 is now GH-76293, it converted GH-32112 into GH-76293. |
How could that happen? That seems terrible. :-( |
Any update on whether/how/when this could hopefully get fixed? |
Would it be possible to process all data once more time to detect the issue and fix it? This issue is quite annoying. The current workaround is to go to the original issue, find the message to look for the correct GH issue/PR number. |
I agree that it is very serious issue. For now we have incorrect references and the reader can not even know that they are incorrect. Can we repeat translation which correctly detects GitHub references locally, compare it with the old incorrect translation, and apply the diff? The diff should be orders smaller that the whole data (but it still can be thousands messages). |
Since this happened at transfer time, one of the test repos that I kept around might still have the correct references and it might be used to fix the links. Doing it from bpo is a bit trickier because there is quite a bit of processing that happened at import time (even though it should be possible to extract only the references). The main issue is that this requires a mass-rewrite which will have to go through thousands of issues and comments and edit them, and this is not trivial to write and test. At this point we will also have to use the public API, which is rate limited. If it is done, the same infrastructure could also be used to replace bpo links with GH links in the first message of the PRs, and to @mention again nosy list members in the first message of issues. I think this will also edit the "last updated" field of all the affected issues/PRs, making it more difficult to find old/stale issues/PRs. |
How many issues are affected? 100, 1000 or 10000? If it is only few 100s, we can survive editing the "last updated" field. But in long term it would be better to find a way to edit messages without changing the "last updated" field. We may need it to fix references to old Subversion and Mercurial revisions. |
I don't have a number, and when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help). |
I can get an estimate of this shortly. |
There are about 19270 lines in messages on bpo (as of late March, shortly before the migration) that contain a link with the text
In addition, there are about 5993 lines in bpo messages that contain other links to GH PRs. Roughly half of these are |
One thing I suspected regarding those correctly migrated links is that perhaps messages that first contain a bpo link and then a GH PR link worked. This does not appear to be the case: exhibit A, exhibit B. I also found one issue with a weird link on GitHub: this comment was rewritten as |
A few things that might help you:
Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by If you want to see the pre-transfer messages, I can give you access to one of the test repos. If you have other questions let me know. |
Hmm, your regex would not match
This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.
I can't easily filter for |
You are right: I didn't rewrite the
Note that there is quite a bit of overlapping between the PRs ids and bpo ids, so a bpo message that linked to the PR
Instead of scraping the HTML directly (if this is what you are doing), you could use the XMLRPC interface to fetch all the bpo messages. Even if you are scraping HTML, it shouldn't be difficult to isolate the individual messages. If you need more info about this let me know, however it might be better to look at the test repo to see how the messages look like after the import (I gave you read access to one of them 1).
Glad that helped! I noticed similar problems while testing before the migration.
This might be a separate bug in the transfer tool:
So my export tool kept references to
Can you confirm that your observations match these statements? Footnotes
|
I don't see the "last updated" change as an issue. It's a good thing that changes are traced and public, no? |
Technically yes, however if we mass-update the issues you won't be able to find old or stale issues anymore since they will all show as recently updated. Before the migration we decided it would be better to preserve the original "last updated" date, but we could revisit that decision. Even if we do, we would still need a script to perform the mass update. |
python/cpython#90908 (comment) has two incorrect references. The first one goes to GH-75453, which is bpo-31270. It should be GH-31270 (https://bugs.python.org/issue46752#msg413260). Note that the reference is correct in the following 'New changeset' comment.
This is the only case I've seen so far, but I haven't systematically looked for more.
The text was updated successfully, but these errors were encountered: