Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some GH PR references were not ported correctly #22

Open
JustAnotherArchivist opened this issue Apr 14, 2022 · 19 comments
Open

Some GH PR references were not ported correctly #22

JustAnotherArchivist opened this issue Apr 14, 2022 · 19 comments

Comments

@JustAnotherArchivist
Copy link

python/cpython#90908 (comment) has two incorrect references. The first one goes to GH-75453, which is bpo-31270. It should be GH-31270 (https://bugs.python.org/issue46752#msg413260). Note that the reference is correct in the following 'New changeset' comment.

This is the only case I've seen so far, but I haven't systematically looked for more.

@JustAnotherArchivist
Copy link
Author

Found two more:

python/cpython#91294 (comment) references GH-76305 instead of GH-32124.
python/cpython#91302 (comment) references GH-76242 instead of GH-32061.

@ezio-melotti
Copy link
Member

ezio-melotti commented Apr 14, 2022

It seems like GH-* were interpreted as issue number and converted automatically to the new migrated issue during the transfer. This automatic conversion was supposed to be disabled during the transfer, but apparently it wasn't.

@vstinner
Copy link
Member

Another example. In https://bugs.python.org/issue47075 I explicitly wrote I proposed GH-32112 for that. to get a link to my PR python/cpython#32112 but the converted issue python/cpython#91231 points to python/cpython#76293 i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)

@ezio-melotti
Copy link
Member

i don't understand how GH-32112 was transformed to #76293 which is an unrelated issue (not a PR)

During the transfer, https://bugs.python.org/issue32112 was transferred to python/cpython#76293. The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR), and since bpo-32112 is now GH-76293, it converted GH-32112 into GH-76293.

@gvanrossum
Copy link

The transfer tool thought that your GH-32112 was a reference to bpo-32112 (rather than referring to a GitHub PR)

How could that happen? That seems terrible. :-(

@JustAnotherArchivist
Copy link
Author

Any update on whether/how/when this could hopefully get fixed?

@vstinner
Copy link
Member

The transfer tool thought that your GH-32112 was a reference to bpo-32112

Would it be possible to process all data once more time to detect the issue and fix it? This issue is quite annoying.

The current workaround is to go to the original issue, find the message to look for the correct GH issue/PR number.

@serhiy-storchaka
Copy link

I agree that it is very serious issue. For now we have incorrect references and the reader can not even know that they are incorrect.

Can we repeat translation which correctly detects GitHub references locally, compare it with the old incorrect translation, and apply the diff? The diff should be orders smaller that the whole data (but it still can be thousands messages).

@ezio-melotti
Copy link
Member

Since this happened at transfer time, one of the test repos that I kept around might still have the correct references and it might be used to fix the links. Doing it from bpo is a bit trickier because there is quite a bit of processing that happened at import time (even though it should be possible to extract only the references).

The main issue is that this requires a mass-rewrite which will have to go through thousands of issues and comments and edit them, and this is not trivial to write and test. At this point we will also have to use the public API, which is rate limited. If it is done, the same infrastructure could also be used to replace bpo links with GH links in the first message of the PRs, and to @mention again nosy list members in the first message of issues. I think this will also edit the "last updated" field of all the affected issues/PRs, making it more difficult to find old/stale issues/PRs.

@serhiy-storchaka
Copy link

How many issues are affected? 100, 1000 or 10000? If it is only few 100s, we can survive editing the "last updated" field. But in long term it would be better to find a way to edit messages without changing the "last updated" field. We may need it to fix references to old Subversion and Mercurial revisions.

@ezio-melotti
Copy link
Member

I don't have a number, and when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).

@JustAnotherArchivist
Copy link
Author

I can get an estimate of this shortly.

@JustAnotherArchivist
Copy link
Author

There are about 19270 lines in messages on bpo (as of late March, shortly before the migration) that contain a link with the text GH-*. This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly). I don't see any differences between the two in bpo's HTML. For example, the links mentioned in the original issue are:

  • <a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a> – Incorrect in GH issue
  • <a href="https://github.com/python/cpython/pull/31313" title="GitHub PR 31313">GH-31313</a> – Incorrect in GH issue
  • <a href="https://github.com/python/cpython/pull/31270" class="closed" title="GitHub PR 31270: [merged] Merged">GH-31270</a> – Correct in GH issue

In addition, there are about 5993 lines in bpo messages that contain other links to GH PRs. Roughly half of these are PR # links (example), the others are mostly just plain URLs (example). A brief sampling seems to indicate that those were transferred correctly.

@JustAnotherArchivist
Copy link
Author

One thing I suspected regarding those correctly migrated links is that perhaps messages that first contain a bpo link and then a GH PR link worked. This does not appear to be the case: exhibit A, exhibit B.

I also found one issue with a weird link on GitHub: this comment was rewritten as python/issues-test-cpython#30631. It should clearly point to python/cpython#30631.

@ezio-melotti
Copy link
Member

A few things that might help you:

  • the migration happened in two stages: the import from bpo to a new GitHub repo, and the transfer from that repo to python/cpython.
  • before the import, I did a lot of processing to fix/convert some links, add MarkDown markup, add the first message with the table, etc. In particular, I converted GH-* and PR-* references (but not plain URLs) using this regex:
      CPYTHON_PR = 'https://github.com/python/cpython/pull/'
      # PR-123, PR 123, PR123, pull request 123, GH 123, GH123
      # use a markdown link to preserve the original text -- the popup works
      (re.compile(r'(?P<text>(\b(?<![-/_])(PR-?|GH|pull\s*request))\s*'
                  r'(?P<pr_no>\d+))\b', re.I),
       r'[\g<text>](%s\g<pr_no>)' % CPYTHON_PR),
    and bpo issues with
      # issue1234, issue 1234, bpo 1234, #1234
      (re.compile(r'(?:(?<!\w)\#|\b(?<![-/_])(?:issue|bpo))'
                  r' ?(1?\d{4,6})\b', re.I),
       r'bpo-\1'),
  • during the transfer, the tool used by GitHub apparently converted some of the links. I know that one limitation of the tool is that it could only convert issues that had already been migrated. For example, if there was an issue with id 1234 that was migrated to e.g. https://github.com/python/cpython/issue/4321, issues that had references like #1234 could have been converted to #4321, but only in messages migrated after issue 1234. This was the reason why we converted all the issue references to bpo-* instead of relying on their link conversion (that was supposed to be disabled, since it wasn't relyable).

This number includes the 'New changeset' messages, which were migrated correctly, and I'm not sure how to filter them out (without a complete export of the GitHub issues to compare it directly).

Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).

If you want to see the pre-transfer messages, I can give you access to one of the test repos. If you have other questions let me know.

@JustAnotherArchivist
Copy link
Author

Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?

I know that one limitation of the tool is that it could only convert issues that had already been migrated.

This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.

Maybe they were migrated correctly because the regex used by GitHub to find links/references didn't match when the reference was surrounded by (...), like in the "New changesets" messages. All those message have the same format, so it shouldn't be difficult to filter them out (a msg.startswith('New changeset') is probably enough).

I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this), but the hint about brackets is a good one! Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903, https://bugs.python.org/msg364827, https://bugs.python.org/msg410766. I also found this example of the form (GH-# and GH-#) where the former link is correct but the latter goes to that test repo: https://bugs.python.org/msg331261python/cpython#53775 (comment). Overall, there are about 777 GH-# links that are not preceded by (.

@ezio-melotti
Copy link
Member

Hmm, your regex would not match GH-1234 with a hyphen at all. This would suggest that such references were all rewritten by GitHub...?

You are right: I didn't rewrite the GH-* references because those are already recognized by GitHub, but other refs like GH 1234 or GH1234 or PR-1234 aren't, so I used the regex to rewrite them. Instead of converting them into GH-1234 I converted them into MarkDown links in order to preserve the original text.

This makes sense. However, it should not affect PR references since all those PRs were already on GitHub beforehand.

Note that there is quite a bit of overlapping between the PRs ids and bpo ids, so a bpo message that linked to the PR GH-1234 could have been mistaken by the tool for a link to an issue with id 1234 (since issues and PRs share the same namespace on GitHub), and after migrating bpo-1234 to GH-4321 all the references to PR GH-1234 could have been rewritten to GH-4321.

I can't easily filter for New changeset (since it's generally on a different line than the PR reference in the HTML served by bpo that I'm processing for this)

Instead of scraping the HTML directly (if this is what you are doing), you could use the XMLRPC interface to fetch all the bpo messages. Even if you are scraping HTML, it shouldn't be difficult to isolate the individual messages. If you need more info about this let me know, however it might be better to look at the test repo to see how the messages look like after the import (I gave you read access to one of them 1).

but the hint about brackets is a good one!

Glad that helped! I noticed similar problems while testing before the migration.

Firstly, I found several more of those python/issues-test-cpython examples: https://bugs.python.org/msg404903

This might be a separate bug in the transfer tool:

  • the original message on bpo contains the text: we get a resolution in GH-25718 as it will likely conflict.
  • the imported message contains the text: we get a resolution in GH-25718 as it will likely conflict. (since my regex didn't touch the GH-*s)
  • the transferred message contains the text: we get a resolution in python/issues-test-cpython#25718 as it will likely conflict. (note that python/issues-test-cpython is the repo where we imported the issues before the transfer to python/cpython)

So my export tool kept references to GH-* unchanged and converted other references (e.g. GH *, PR *, etc.) to MarkDown links. Then, the transfer tool:

  • converted some (all?) GH-* references to point to a different id
    • this happened when the id matched an already migrated issue id and when the GH-* wasn't preceded by other chars (like ();
    • the GH-* was converted into org/repo#ID;
    • sometimes it used the source repo (python/issues-test-cpython#4321) instead of the target repo (python/cpython#4321) -- not sure why
  • left full links alone (e.g. the first two in Pin Jinja2 to fix docs build python/cpython#91294 (comment) - https://bugs.python.org/issue47138#msg416171)
  • maybe left Markdown links alone too

Can you confirm that your observations match these statements?

Footnotes

  1. I actually realized that the one I gave you access to (issues-test-cypthon-2) was used to test the transfer, so it had the same issues as the python/cpython repo. I now revoked your permissions to that repo and added you to a different repo (issues-test-2) that had the imported messages before they were transferred.

@vstinner
Copy link
Member

vstinner commented Jun 6, 2022

when we discussed different options to edit the messages after the fact and they all changed the "last updated" field (technically GitHub could rewrite the messages directly in the DB, but it's not something they would like to do and we would need their help).

I don't see the "last updated" change as an issue. It's a good thing that changes are traced and public, no?

@ezio-melotti
Copy link
Member

Technically yes, however if we mass-update the issues you won't be able to find old or stale issues anymore since they will all show as recently updated. Before the migration we decided it would be better to preserve the original "last updated" date, but we could revisit that decision. Even if we do, we would still need a script to perform the mass update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants