-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] for testing : Fix 4Gb limit for large files on Git for Windows #2179
base: main
Are you sure you want to change the base?
Conversation
The DCO bot is in error (at least from my viewpoint). The ref commit was provided by the signed off by author. It was a necessary pre-requisite, though does have some obvious conflict fixes with the upstream pu.
Not sure if I should just double sign Torsten's patch? |
I went back and rebase the series Unsure how to get the CI testing to get started. |
5539b3d
to
f293144
Compare
c686e70
to
ed36d4f
Compare
ff345ba
to
66c432b
Compare
ed04f07
to
2e5a910
Compare
Probably, but you will want to make sure that the recorded author matches the first Signed-off-by: line. |
Could I ask you to re-target this PR to |
(You will of course want to run |
@dscho I'll rebuild/rebase the series in the next day or three, applying Torsten's patch directly at the start of the series. I'd deliberately targeted pu to at least be ahead of the game for the conflicts with the As a side issue, I'm having problems fathoming how the MSVC=1 build should work seeing as I need to patch compat\vcbuild\MSVC-DEFS-GEN which is generated apparently by the vcpkg but I can't find where. See my [email protected] on the googlegroups list https://groups.google.com/forum/#!topic/git-for-windows/Y99a0dzlVJY. |
f866bf0
to
71122ee
Compare
71122ee
to
aa65c89
Compare
I've decided against fighting the conflicts until the (What's cooking in git.git (May 2019, #1; Thu, 9))
Once that's landed my rebase should be cleaner and easier - it was too easy to get confused as to which way all the conflicts were going, especially as they are, for the purpose of the series, incidental irrelevances. |
Makes a ton of sense to me, @PhilipOakley ! |
@dscho I see Junio has announced |
just seen (follow up to the rc0 announcement):
So I hope to get on it today. |
Oh, sorry! I was so heads-down into getting Git for Windows v2.22.0-rc0 out the door (and then on trying to tie up a few loose ends in Git v2.22.0-rc0 in time for -rc1) that I missed this. The update of You may ask "why?" and it is a very legitimate question. The answer boils down to "I want to keep the door open for Git for Windows v2.21.1, if need be". You see, I am not at all a fan of many release branches. And my automation really is centered around So the update to However, hope is near ;-) I do, of course, make those -rc* previews from Git commits, but those live on And once Git v2.22.0 is close enough (I won't publish a Git for Windows v2.21.1 if I expect to publish a v2.22.0 within a week, unless there are serious security concerns), I will fast-forward Therefore, I would suggest to simply re-target your branch to |
Updated! Let's see where the Pipeline leads us. |
I've seen quite a few C99 features in current git source. Indeed, even |
And we try to stay compatible even with setups that are ridiculously different from Linux, such as NonStop. There is even work under way to port Git to Plan 9. So yes, while some of us (although not me) are always eager to make use of these shiny features, Git is actually quite conservative (in a good sense, of course). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments, but the biggest one will be to break down the huge commit (I have no doubt that it will be rejected by the Git mailing list, both because the mail server will bounce it, and reviewers will refuse to even look at it).
The real trick for this entire PR will be to draw appropriate boundaries between logically separate changes.
test_description='test large file handling on windows' | ||
. ./test-lib.sh | ||
|
||
test_expect_success SIZE_T_IS_64BIT 'require 64bit size_t' ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message claims that this is for Windows, but the prereq does not actually test that. You will want to use the prereq MINGW
for that.
Besides, I think that SIZE_T_IS_64BIT
is only part of the remaining prereqs: we probably want to also ensure that sizeof(long) == 4
, because otherwise this test case will trivially succeed, even if nothing in Git's source code was fixed.
To that end, I would like to imitate the way we test for time_t-is64bit
via the test-tool
.
And when we do that, we do not even need the MINGW
prereq, nor do we need to mention Windows specifically: it is no longer about Windows, it is about systems where the size of long
differs from that of void *
.
Finally, I think it would make sense to fold the commit that introduces the SIZE_T_IS_64BIT
prereq into this commit, and probably even move it into this here file, as that would make it easier on reviewers (both present as well as future ones).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the initial test was incomplete, and just sufficient to ensure I could see all the steps that include a long conversion. When I looked at the use of zlib there were 28 places! so a lot of different parts of the test could be impacted, hence checks everywhere.
I also wanted to explicitly test that the zlib compile flags were the problematic ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's indeed what you should do while developing the test case. Once your done, it needs to be polished to serve the particular use case to allow easy debugging of future regressions. And that is where I think that the git log --stat
and the compile flags and stuff like that won't help, so it needs to go before we merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compile flags should be part of the pre-requisite to detect that Size_t is 64 bit but zlib is only 32bit.. which is currently part of the >4Gb failing case. Your right it's not in there yet but should be (not sure how to do it yet though).
@@ -0,0 +1,23 @@ | |||
#!/bin/sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I read this commit message, I wonder how we can munge it to look more like the other commit messages in Git. I.e. if you compare the commit message e.g. of 581d2fd to this:
Add test for large files on windows
Original test by Thomas.
Add the extra fsck to get diagnostics after the add.
Verify the pack at earliest opportunity
Slight confusion as to why index-pack vs verify-pack...
It's -v (verbose) not --verify
Specify an output file to index-pack, otherwise it clashes with
the existing index file.
Check that the sha1 matches the existing value;-)
then it is hard not to spot a stark difference. Maybe we could write something like this instead:
Add a test for files >2GB
There is nothing in the C specification that claims that `long` should
have the same size as `void *`. For that reason, Git's over-use of
`unsigned long` when it tries to store addresses or offsets is
problematic.
This issue is most prominent on Windows, where the size of `long` is
indeed 32 bit, even on 64-bit Windows.
Let's add a test that demonstrates problems with our current code base
when `sizeof(long)` is different from `sizeof(void *)`.
Thoughts?
BTW I have no idea what you mean by this:
Check that the sha1 matches the existing value;-)
I do not see this at all in this here patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sha1 check was one of the reported values in the test but I only manually checked it against what I'd seen via WSL.
the test (and hence message) should also cover the zlib compile flags (not sure if we can determine the crc32 compile flags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test (and hence message) should also cover the zlib compile flags (not sure if we can determine the crc32 compile flags)
Would a change of those flags constitute a regression? If not (which I think is the case), then it should not be tested.
The sha1 check was one of the reported values in the test but I only manually checked it against what I'd seen via WSL.
Ah, okay. I am not sure that is strictly an interesting information for the commit message, but more for the cover letter.
git fsck --verbose --strict --full && | ||
git commit -m msg file && | ||
git log --stat && | ||
git gc && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to git gc
both here and 3 lines further down? Would this really help debugging regressions in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost very step is error prone and some do not repeat outside of the test environment (differing big file threshold effects I think) i.e. pack vs loose object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we really need a more robust test. Otherwise a developer might be tempted to think that they fixed a flawed regression test, all the while a real regression crept in.
So please indulge me: what exactly is this regression test case supposed to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an expansion of the 'does it perform correctly' test that Thomas B initially developed.
Each of the fsck, verify-pack, log and gc commands can barf. Plus if run manually it actually does a slightly different test (add to loose vs pack), which I didn't get to the bottom of!
In this case he had deliberately set the compression to zero so that the compressed file would exceed the limit, we also 'need' (if we are to be complete) to check for packed and loose encoding, etc etc (including breakages at various process stages, add vs commit,
I'd agree that the testing still has a long way to go to get a smooth set of tests that satisfy all. It's all low level code issues that can pop up in many places.
It may be that we have a two phase plan. First make the code patches visible (if the code isn't working...), then make the testing acceptable (to as yet unknown/discussed/decided standards).
git verify-pack -s .git/objects/pack/*.pack && | ||
git fsck --verbose --strict --full && | ||
git commit -m msg file && | ||
git log --stat && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather leave this out. It might have made sense for debugging when this patch was developed, but the test case should be optimized for catching, diagnosing and debugging future regressions, nothing more, nothing less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on the testing criteria.
Maybe we have a full fledged 'check everything every step' commit immediately followed by a commit that trims it back to 'just detect an end to end fail', with message that states that the debug version is the previous commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I regularly investigate test failures in Git's test suite, as I am one of the very few who even looks at our CI builds.
You should try it, too. It's very educative.
I can tell you precisely how I feel about test cases like this one, but you would have to hide younger children first.
In short: no, let's not do this. Let's not make life hard on the developers who will inevitably have to investigate regressions. Let's make it easy instead. The shorter, conciser, and more meaningful we can make the test case, the better.
Remember: regression tests are not so much about finding regressions, as they are about helping regressions to be fixed. That is the criteria you should aim to optimize for. Always.
@@ -370,7 +370,7 @@ static void start_put(struct transfer_request *request) | |||
|
|||
/* Set it up */ | |||
git_deflate_init(&stream, zlib_compression_level); | |||
size = git_deflate_bound(&stream, len + hdrlen); | |||
size = git_deflate_bound(&stream, len + (size_t) hdrlen); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since len
is already of type size_t
, this cast is superfluous. The only real difference this would make was if hdrlen
was negative. Which cannot be the case because xsnprintf()
is explicitly not allowed to return negative values.
So I'd just drop this hunk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume the type of len because it (git_deflate_bound) may be a macro, or a function, and it is then subject to the compile flags.
Also, one of my readings of the various 'standards' suggested that it is within the implementation defined definition to downcast oversize variables if one (of the computation) was the normal 'right' size (e.g. long).
The whole up/down cast cascade and dual default function templates thing made me pedantic/explicit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume the type of len because it (git_deflate_bound) may be a macro, or a function, and it is then subject to the compile flags.
Yes we can. Because we declared it ourselves, a couple lines above the shown diff context, as size_t len
. So we can very much assume that its type is size_t
.
it is within the implementation defined definition to downcast oversize variables if one (of the computation) was the normal 'right' size (e.g. long).
Nope. Arithmetic expressions are always evaluated after upcasting narrower data types to larger ones. A compiler that downcasts is buggy.
You probably think about an example like this:
short result;
long a = 1;
int b = 2;
result = a + b;
This is a bit tricky to understand, as the evaluation of the arithmetic expression a + b
really upcasts the int
to a long
(unless they already have the same bit size). Only after that is the result downcast in order to be assigned to the narrower result
.
That example has little relationship to the code under discussion, though, except in the case where git_deflate_bound()
is defined as a function whose second parameter is declared with a too-narrow datatype. In which case you simply cannot do anything about it, except replace the function by our custom macro (as you did in another patch in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about the upcast to long. But when long is meant to be the maximum arithmetic calculation, then size_t may be down cast (as per Microsoft's desire that 32 bit code stays unchanged on 64 bit machines, so only (arithmetically) address 32bit range, unless one specially changes to full pointer calculation. As I understand it it's a bit open ended (which standard is your go to reference document?)
It maybe that the commit message needs to be long to explain the ripple through consequences, or we make the code informative. Its one of those 'discursive' areas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't assume the type of len
But we can. It is declared as size_t len;
. See here:
Line 363 in a4b2bb5
size_t len; |
* On Windows, uInt/uLong are only 32 bits. | ||
*/ | ||
extern uLong xcrc32(uLong crc, const unsigned char *buf, size_t bytes); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The crc32()
function is declared and defined in zlib. It would be most helpful to mention this in the code comment, along with the hint that zlib.h
is already included in cache.h
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I had though that we got it from the gnu stuff (i.e. I found it's definition from the gnu ref pages, not the zlib pages).
Double check needed. It's good if you are right (probably are..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I was misinformed. The crc32 is part of the zlib library and I had missed that fact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I did not know either until I looked which C standard defines the crc32()
function, and did not find any.
@@ -352,7 +352,7 @@ static int write_zip_entry(struct archiver_args *args, | |||
if (!buffer) | |||
return error(_("cannot read %s"), | |||
oid_to_hex(oid)); | |||
crc = crc32(crc, buffer, size); | |||
crc = xcrc32(crc, buffer, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, but the sentence "Zero length initialisations are not converted." is very cryptic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any better suggestions. Zero length fits into both 32bit and 64bit longs ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea what you mean by that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(That sentence makes as much sense to me as "Correct horse battery staple".)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe? Zero length initialisations work with both 32bit and 64bit uInt len
values ;-) (only slightly less cryptic)
Otherwise it's a lot of noise changes for no benefit (there are quite a few initialisations..).
unsigned long newpos, newlines; | ||
size_t leading, trailing; | ||
size_t oldpos, oldlines; | ||
size_t newpos, newlines; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think that we can, and should, break apart this huge patch. In other words, I don't believe the statement that this is the minimal patch ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimal patch
Can I just say that the Github review interface is really poor here. I have no idea (from the web page) which patch is which, which belongs to which commit etc.
Assuming this was Torsten's original patch it did fit on the mailing list as a 'less than sufficient' patch https://public-inbox.org/git/[email protected]/ (it's how I applied it, from an email on-list;-).
If the 'minimal' comment is about something else, excuse my mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea (from the web page) which patch is which, which belongs to which commit etc.
That's an optimization in the web UI, with a very easy way out: click on the file name above the diff, i.e. https://github.com/git-for-windows/git/pull/2179/files/5b60f72b34f213309d5870b7df4b1038914e4fc0..9c2f8de134713f7e04fc2402567c5b0c29605737#diff-3edc96d0eefd86960d342f661171a62c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this was Torsten's original patch it did fit on the mailing list as a 'less than sufficient' patch
It changes 85 files. Eighty-five! Nobody can review such a patch properly, not @tboegi, not you, not me.
And when I look at the diffstat, I immediately see e.g. that there are plenty of changes in apply.c
and plenty in builtin/pack-objects.c
. Those two files share preciously little concern. So I am rather certain that it would be easy even to split this at the file boundary.
Which still does not make sense, as it is still not a logically-minimal patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to split this at the file boundary.
Sorry, I meant at that file boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And indeed, if I apply only the builtin/pack-objects.c
part, for example, it compiles Just Fine. So the commit message is maybe not quite truthful when it says
The smallest step seems to be much bigger than expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or when it says this:
However, when the Git code base is compiled with a compiler that complains that "unsigned long" is different from size_t, we end up in this huge patch, before the code base cleanly compiles.
As I just proved, there is a quite a bit smaller patch that already compiles cleanly: just the 70 changed lines of builtin/pack-objects.c
. And I bet that we can do more of the same to cut this into nice mouthfuls, without breaking the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilipOakley I'd like to focus your attention on the two comments above this one.
@@ -481,7 +481,7 @@ static void *unpack_raw_entry(struct object_entry *obj, | |||
unsigned char *p; | |||
size_t size, c; | |||
off_t base_offset; | |||
unsigned shift; | |||
size_t shift; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done, but let's not state that this is for Windows. This is a data type cleanup that benefits 64-bit systems where long
is 32 bit wide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
@@ -1057,7 +1057,7 @@ size_t unpack_object_header_buffer(const unsigned char *buf, | |||
size = c & 15; | |||
shift = 4; | |||
while (c & 0x80) { | |||
if (len <= used || bitsizeof(long) <= shift) { | |||
if (len <= used || bitsizeof(size_t) <= shift) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change. The commit message still needs to be changed to answer the question "why?" instead of the question "what?" because the latter leaves the reader puzzled while the former will leave the reader enlightened.
I also wonder whether we should try to pour the learnings from #1848 into a test case: after all, I found a couple of issues with an earlier iteration of these patches, while the "big file" test case seemed to pass already... |
I'll just respond to a few of the points that I have immediate recollection of. Many of the casts were about having an abundance of caution regarding the whole 'implementation defined' aspect of this here 'Bug' (i.e. we can't trust anything for the 64 vs 32 bit long issue ;-) |
I understand that. Yet I think some of that caution erred on the side of redundancy, and I would rather not spend the time on the mailing list trying to defend unnecessary casts... |
@dscho: https://github.com/tboegi/git/tree/tb.190828_convert_size_t_mk_size_t |
Having reviewed many patch series, I much prefer reviewing a dozen relatively short patch series over the course of a dozen weeks to having to review all the changes in one big patch. Remember: there are patches so simple and elegant that there is no obvious place for bugs to hide, and there are patches so gigantic and repetitive that there is no obvious place for bugs to hide (but plenty of non-obvious places). No, I still think we need to do our best job to make this thing easy to review. If you can augment that big patch by a small Coccinelle patch that generates it, that would make it reviewable, of course. But that enormous wall of diff? Not reviewable at all. |
@dscho The first size_t commit is not going away. It has always been that big because there are compilers that do not accept less. This is part of the inconsistent behaviours of these implementation defined behaviours (the type It maybe that someone wants to split it into a few "won't compile on my machine" patches, but I'd rather we stay with what we have. Torsten was compiling for Rasbian (gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516), not GfW anyway. Martin Koegler's series (mk/use-size-t-in-zlib still on pu) was started in 2017 as a >4Gb issue (found by searching the archive for The size_t stuff actually had compile warnings back as far as 2007. see https://public-inbox.org/git/[email protected]/, so at some point maybe we need to bite the bullet and actually do the big change. ... my main problem is the test system, which is something I'm not that familiar with. I now see the tests as being: Maybe need a size graded test: 1.5Gb, 3.5Gb, 5.5Gb to walk through the two potential barriers at 2Gb and 4Gb. We also need an easily accessible compiler that has something equivalent to the former |
I remember clearly that I was able to split out a part from it, earlier this year, into a self-contained commit. Yes, it is possible to do this incrementally. It absolutely is. You can easily pass |
I think I even mentioned this somewhere in this PR or in a related one. And note: I did not even try hard. So I don't buy the argument "this patch is big, it cannot be broken down further". I don't buy that at all. |
Test whether the Git CI system allows, or fails, when LLP64 systems need variable type coercion from size_t back to unsigned long. All the *nix systems should pass, but compile checks for LLP64 systems may fail. This is a crucial information spike for Git files > 4Gb, i.e. Issues: git-lfs git-for-windows#2434, GfW git-for-windows#2179, git-for-windows#1848, git-for-windows#3306. Signed-off-by: Philip Oakley <[email protected]>
Let #3487 serve as a counter argument, Tiny, well-contained changes from @PhilipOakley If you are truly interested in pushing this forward, I encourage you to have a look at the end of this PR comment, where I follow the rabbit hole of the call chain of Granted, that won't be as easily written as a semi-automated search-and-replace. But then, such a semi-automated, huge commit has not a slither of a chance to get reviewed on the Git mailing list, let alone accepted. |
I had a simplistic look at the call tree to see how many times each was referenced.
following your rabbit hole of the call chain (URLs in link) of read_object_file(). My first wondering was : Depth first or Width first? I noticed there were two identical I may, as a narrow starter, just do |
You could try both approaches, and then go with the one that is easier to review. |
This patch series should fix the large file limit on Windows where 'long' is only 32 bits.
Hopefully this PR will go through the CI/Test pipeline to check if all the patches pass the test suit
in a proper incremental manner. Plus test the MSVC=1 option.
Signed-off-by: Philip Oakley [email protected]
The series did compile at every step, and rebased cleanly on the latest shears/pu.