[DRAFT] for testing : Fix 4Gb limit for large files on Git for Windows #2179

PhilipOakley · 2019-05-03T18:15:50Z

This patch series should fix the large file limit on Windows where 'long' is only 32 bits.

Hopefully this PR will go through the CI/Test pipeline to check if all the patches pass the test suit
in a proper incremental manner. Plus test the MSVC=1 option.

Signed-off-by: Philip Oakley [email protected]

The series did compile at every step, and rebased cleanly on the latest shears/pu.

PhilipOakley · 2019-05-03T18:20:33Z

The DCO bot is in error (at least from my viewpoint). The ref commit was provided by the signed off by author. It was a necessary pre-requisite, though does have some obvious conflict fixes with the upstream pu.

Commit sha: 9d0e45d, Author: Philip Oakley, Committer: Philip Oakley;
Expected "Philip Oakley [email protected]", but got "Torsten Bögershausen [email protected]".

Not sure if I should just double sign Torsten's patch?

PhilipOakley · 2019-05-03T19:11:25Z

I went back and rebase the series -i and amended the author for the first patch to be Torsten and force pushed.

Unsure how to get the CI testing to get started.

dscho · 2019-05-09T18:35:20Z

Not sure if I should just double sign Torsten's patch?

Probably, but you will want to make sure that the recorded author matches the first Signed-off-by: line.

dscho · 2019-05-09T18:37:37Z

Could I ask you to re-target this PR to master instead of shears/pu? The latter is a really fast-moving target, hard to catch.

dscho · 2019-05-09T18:38:48Z

(You will of course want to run git rebase -i --onto git-for-windows/master 5539b3d26c81c622c40ea34f117a0afe4c789293 and force-push...)

PhilipOakley · 2019-05-09T22:52:06Z

@dscho I'll rebuild/rebase the series in the next day or three, applying Torsten's patch directly at the start of the series.

I'd deliberately targeted pu to at least be ahead of the game for the conflicts with the extern that are being considered upstream (which raises the issue of how to pass around pre-determined rerere redo fixups for such conflicts

As a side issue, I'm having problems fathoming how the MSVC=1 build should work seeing as I need to patch compat\vcbuild\MSVC-DEFS-GEN which is generated apparently by the vcpkg but I can't find where. See my [email protected] on the googlegroups list https://groups.google.com/forum/#!topic/git-for-windows/Y99a0dzlVJY.

PhilipOakley · 2019-05-13T07:37:58Z

@dscho

I've decided against fighting the conflicts until the extern series lands in shears/master.

(What's cooking in git.git (May 2019, #1; Thu, 9))

dl/no-extern-in-func-decl (2019-05-05) 3 commits
(merged to 'next' on 2019-05-09 at d165ac4)
+ *.[ch]: manually align parameter lists
+ *.[ch]: remove extern from function declarations using sed
+ *.[ch]: remove extern from function declarations using spatch

Mechanically and systematically drop "extern" from function
declarlation.

Will merge to 'master'.

Once that's landed my rebase should be cleaner and easier - it was too easy to get confused as to which way all the conflicts were going, especially as they are, for the purpose of the series, incidental irrelevances.

dscho · 2019-05-13T11:11:39Z

I've decided against fighting the conflicts until the extern series lands in shears/master.

Makes a ton of sense to me, @PhilipOakley !

PhilipOakley · 2019-05-13T20:27:17Z

I've decided against fighting the conflicts until the extern series lands in shears/master.

Makes a ton of sense to me, @PhilipOakley !

@dscho I see Junio has announced Git v2.22.0-rc0, with the extern series included. Could you ping me when shears/master gets the update? 🚥

PhilipOakley · 2019-05-14T09:30:32Z

just seen (follow up to the rc0 announcement):

phili@Philip-Win10 MINGW64 /usr/src/git (size_t7)
$ git fetch --all
Fetching origin
From https://github.com/git-for-windows/git
   d4e5e1ea92..898cf2707e  master     -> origin/master
Fetching my
Fetching junio
Fetching gitster
Fetching dscho-git

So I hope to get on it today.

dscho · 2019-05-14T12:48:43Z

So I hope to get on it today.

Oh, sorry! I was so heads-down into getting Git for Windows v2.22.0-rc0 out the door (and then on trying to tie up a few loose ends in Git v2.22.0-rc0 in time for -rc1) that I missed this.

The update of master that you received does not include the changes in v2.22.0-rc0.

You may ask "why?" and it is a very legitimate question. The answer boils down to "I want to keep the door open for Git for Windows v2.21.1, if need be".

You see, I am not at all a fan of many release branches. And my automation really is centered around master, so it would make things a bit hairy if I had to publish v2.21.1 after rebasing master onto the v2.22.0-rc* train.

So the update to master that you got is a rebase onto v2.21.0, essentially, for the sole purpose of restructuring the entire branch thicket (and applying all the fixup!s/squash!s). In other words, if you call git diff d4e5e1ea92..898cf2707e, you will see that I only reworded one comment and moved it slightly. That is the tree diff. The range-diff looks quite a bit different, though! I moved tons of branches into the ready-for-upstream thicket, in preparation for upstreaming them after Git for Windows v2.22.0 is out.

However, hope is near ;-)

I do, of course, make those -rc* previews from Git commits, but those live on prepare-for-v2.22.0 (and are tested in #2192).

And once Git v2.22.0 is close enough (I won't publish a Git for Windows v2.21.1 if I expect to publish a v2.22.0 within a week, unless there are serious security concerns), I will fast-forward master to prepare-for-v2.22.0 (possibly merging master into the prepare-for-v2.22.0 before that).

Therefore, I would suggest to simply re-target your branch to prepare-for-v2.22.0 for the moment, that'll be stable enough.

dscho · 2019-06-04T11:11:23Z

Now for updating this here PR.

Updated! Let's see where the Pipeline leads us.

dscho · 2019-06-04T11:17:09Z

@tboegi you were active over in #1848... did you work more on the patches? If so: I would like to have your updated in this here PR. If not, could I ask you to give this a look-over? (I will wait with my review until the build is complete, in case that I need to fix anything there.)

danilaml · 2019-06-04T19:58:02Z

I vaguely remember that it (%z) wasn't fully compatible with all the different type and compiler systems

Indeed, it's not in C89/C90 (which is what Git aims for).

I've seen quite a few C99 features in current git source. Indeed, even inttypes.h, which is used in a workaround, comes from C99. However, I understand the compatibility concern. For example, MSVC annoyingly didn't support %z specifiers until VS2015.

dscho · 2019-06-04T20:10:45Z

For example, MSVC annoyingly didn't support %z specifiers until VS2015.

And we try to stay compatible even with setups that are ridiculously different from Linux, such as NonStop. There is even work under way to port Git to Plan 9.

So yes, while some of us (although not me) are always eager to make use of these shiny features, Git is actually quite conservative (in a good sense, of course).

dscho

I left a couple of comments, but the biggest one will be to break down the huge commit (I have no doubt that it will be rejected by the Git mailing list, both because the mail server will bounce it, and reviewers will refuse to even look at it).

The real trick for this entire PR will be to draw appropriate boundaries between logically separate changes.

dscho · 2019-06-04T20:39:13Z

t/t-large-files-on-windows.sh

+test_description='test large file handling on windows'
+. ./test-lib.sh
+
+test_expect_success SIZE_T_IS_64BIT 'require 64bit size_t' '


The commit message claims that this is for Windows, but the prereq does not actually test that. You will want to use the prereq MINGW for that.

Besides, I think that SIZE_T_IS_64BIT is only part of the remaining prereqs: we probably want to also ensure that sizeof(long) == 4, because otherwise this test case will trivially succeed, even if nothing in Git's source code was fixed.

To that end, I would like to imitate the way we test for time_t-is64bit via the test-tool.

And when we do that, we do not even need the MINGW prereq, nor do we need to mention Windows specifically: it is no longer about Windows, it is about systems where the size of long differs from that of void *.

Finally, I think it would make sense to fold the commit that introduces the SIZE_T_IS_64BIT prereq into this commit, and probably even move it into this here file, as that would make it easier on reviewers (both present as well as future ones).

Yes, the initial test was incomplete, and just sufficient to ensure I could see all the steps that include a long conversion. When I looked at the use of zlib there were 28 places! so a lot of different parts of the test could be impacted, hence checks everywhere.

I also wanted to explicitly test that the zlib compile flags were the problematic ones.

Yes, that's indeed what you should do while developing the test case. Once your done, it needs to be polished to serve the particular use case to allow easy debugging of future regressions. And that is where I think that the git log --stat and the compile flags and stuff like that won't help, so it needs to go before we merge.

The compile flags should be part of the pre-requisite to detect that Size_t is 64 bit but zlib is only 32bit.. which is currently part of the >4Gb failing case. Your right it's not in there yet but should be (not sure how to do it yet though).

dscho · 2019-06-04T20:42:02Z

t/t-large-files-on-windows.sh

@@ -0,0 +1,23 @@
+#!/bin/sh


After I read this commit message, I wonder how we can munge it to look more like the other commit messages in Git. I.e. if you compare the commit message e.g. of 581d2fd to this:

Add test for large files on windows Original test by Thomas. Add the extra fsck to get diagnostics after the add. Verify the pack at earliest opportunity Slight confusion as to why index-pack vs verify-pack... It's -v (verbose) not --verify Specify an output file to index-pack, otherwise it clashes with the existing index file. Check that the sha1 matches the existing value;-)

then it is hard not to spot a stark difference. Maybe we could write something like this instead:

Add a test for files >2GB There is nothing in the C specification that claims that `long` should have the same size as `void *`. For that reason, Git's over-use of `unsigned long` when it tries to store addresses or offsets is problematic. This issue is most prominent on Windows, where the size of `long` is indeed 32 bit, even on 64-bit Windows. Let's add a test that demonstrates problems with our current code base when `sizeof(long)` is different from `sizeof(void *)`.

Thoughts?

BTW I have no idea what you mean by this:

Check that the sha1 matches the existing value;-)

I do not see this at all in this here patch.

The sha1 check was one of the reported values in the test but I only manually checked it against what I'd seen via WSL.

the test (and hence message) should also cover the zlib compile flags (not sure if we can determine the crc32 compile flags)

the test (and hence message) should also cover the zlib compile flags (not sure if we can determine the crc32 compile flags)

Would a change of those flags constitute a regression? If not (which I think is the case), then it should not be tested.

The sha1 check was one of the reported values in the test but I only manually checked it against what I'd seen via WSL.

Ah, okay. I am not sure that is strictly an interesting information for the commit message, but more for the cover letter.

dscho · 2019-06-04T20:43:07Z

t/t-large-files-on-windows.sh

+	git fsck --verbose --strict --full &&
+	git commit -m msg file &&
+	git log --stat &&
+	git gc &&


Do we really want to git gc both here and 3 lines further down? Would this really help debugging regressions in the future?

Almost very step is error prone and some do not repeat outside of the test environment (differing big file threshold effects I think) i.e. pack vs loose object.

Then we really need a more robust test. Otherwise a developer might be tempted to think that they fixed a flawed regression test, all the while a real regression crept in.

So please indulge me: what exactly is this regression test case supposed to test?

This is an expansion of the 'does it perform correctly' test that Thomas B initially developed.

Each of the fsck, verify-pack, log and gc commands can barf. Plus if run manually it actually does a slightly different test (add to loose vs pack), which I didn't get to the bottom of!

In this case he had deliberately set the compression to zero so that the compressed file would exceed the limit, we also 'need' (if we are to be complete) to check for packed and loose encoding, etc etc (including breakages at various process stages, add vs commit,

I'd agree that the testing still has a long way to go to get a smooth set of tests that satisfy all. It's all low level code issues that can pop up in many places.

It may be that we have a two phase plan. First make the code patches visible (if the code isn't working...), then make the testing acceptable (to as yet unknown/discussed/decided standards).

dscho · 2019-06-04T20:43:52Z

t/t-large-files-on-windows.sh

+	git verify-pack -s .git/objects/pack/*.pack &&
+	git fsck --verbose --strict --full &&
+	git commit -m msg file &&
+	git log --stat &&


I'd rather leave this out. It might have made sense for debugging when this patch was developed, but the test case should be optimized for catching, diagnosing and debugging future regressions, nothing more, nothing less.

depends on the testing criteria.
Maybe we have a full fledged 'check everything every step' commit immediately followed by a commit that trims it back to 'just detect an end to end fail', with message that states that the debug version is the previous commit?

I regularly investigate test failures in Git's test suite, as I am one of the very few who even looks at our CI builds.

You should try it, too. It's very educative.

I can tell you precisely how I feel about test cases like this one, but you would have to hide younger children first.

In short: no, let's not do this. Let's not make life hard on the developers who will inevitably have to investigate regressions. Let's make it easy instead. The shorter, conciser, and more meaningful we can make the test case, the better.

Remember: regression tests are not so much about finding regressions, as they are about helping regressions to be fixed. That is the criteria you should aim to optimize for. Always.

dscho · 2019-06-04T20:46:11Z

http-push.c

@@ -370,7 +370,7 @@ static void start_put(struct transfer_request *request)

 	/* Set it up */
 	git_deflate_init(&stream, zlib_compression_level);
-	size = git_deflate_bound(&stream, len + hdrlen);
+	size = git_deflate_bound(&stream, len + (size_t) hdrlen);


Since len is already of type size_t, this cast is superfluous. The only real difference this would make was if hdrlen was negative. Which cannot be the case because xsnprintf() is explicitly not allowed to return negative values.

So I'd just drop this hunk.

We can't assume the type of len because it (git_deflate_bound) may be a macro, or a function, and it is then subject to the compile flags.

Also, one of my readings of the various 'standards' suggested that it is within the implementation defined definition to downcast oversize variables if one (of the computation) was the normal 'right' size (e.g. long).

The whole up/down cast cascade and dual default function templates thing made me pedantic/explicit!

We can't assume the type of len because it (git_deflate_bound) may be a macro, or a function, and it is then subject to the compile flags.

Yes we can. Because we declared it ourselves, a couple lines above the shown diff context, as size_t len. So we can very much assume that its type is size_t.

it is within the implementation defined definition to downcast oversize variables if one (of the computation) was the normal 'right' size (e.g. long).

Nope. Arithmetic expressions are always evaluated after upcasting narrower data types to larger ones. A compiler that downcasts is buggy.

You probably think about an example like this:

short result; long a = 1; int b = 2; result = a + b;

This is a bit tricky to understand, as the evaluation of the arithmetic expression a + b really upcasts the int to a long (unless they already have the same bit size). Only after that is the result downcast in order to be assigned to the narrower result.

That example has little relationship to the code under discussion, though, except in the case where git_deflate_bound() is defined as a function whose second parameter is declared with a too-narrow datatype. In which case you simply cannot do anything about it, except replace the function by our custom macro (as you did in another patch in this PR).

I agree about the upcast to long. But when long is meant to be the maximum arithmetic calculation, then size_t may be down cast (as per Microsoft's desire that 32 bit code stays unchanged on 64 bit machines, so only (arithmetically) address 32bit range, unless one specially changes to full pointer calculation. As I understand it it's a bit open ended (which standard is your go to reference document?)

It maybe that the commit message needs to be long to explain the ripple through consequences, or we make the code informative. Its one of those 'discursive' areas.

We can't assume the type of len

But we can. It is declared as size_t len;. See here:

git/http-push.c

Line 363 in a4b2bb5

size_t len;

dscho · 2019-06-04T21:05:43Z

cache.h

+ * On Windows, uInt/uLong are only 32 bits.
+ */
+extern uLong xcrc32(uLong crc, const unsigned char *buf, size_t bytes);
+


The crc32() function is declared and defined in zlib. It would be most helpful to mention this in the code comment, along with the hint that zlib.h is already included in cache.h.

Hmm, I had though that we got it from the gnu stuff (i.e. I found it's definition from the gnu ref pages, not the zlib pages).
Double check needed. It's good if you are right (probably are..)

Looks like I was misinformed. The crc32 is part of the zlib library and I had missed that fact.

TBH I did not know either until I looked which C standard defines the crc32() function, and did not find any.

dscho · 2019-06-04T21:06:29Z

archive-zip.c

@@ -352,7 +352,7 @@ static int write_zip_entry(struct archiver_args *args,
 			if (!buffer)
 				return error(_("cannot read %s"),
 					     oid_to_hex(oid));
-			crc = crc32(crc, buffer, size);
+			crc = xcrc32(crc, buffer, size);


I think this makes sense, but the sentence "Zero length initialisations are not converted." is very cryptic.

Any better suggestions. Zero length fits into both 32bit and 64bit longs ;-)

I have no idea what you mean by that.

(That sentence makes as much sense to me as "Correct horse battery staple".)

maybe? Zero length initialisations work with both 32bit and 64bit uInt len values ;-) (only slightly less cryptic)

Otherwise it's a lot of noise changes for no benefit (there are quite a few initialisations..).

dscho · 2019-06-04T21:09:17Z

apply.c

-	unsigned long newpos, newlines;
+	size_t leading, trailing;
+	size_t oldpos, oldlines;
+	size_t newpos, newlines;


I do think that we can, and should, break apart this huge patch. In other words, I don't believe the statement that this is the minimal patch ;-)

minimal patch

Can I just say that the Github review interface is really poor here. I have no idea (from the web page) which patch is which, which belongs to which commit etc.

Assuming this was Torsten's original patch it did fit on the mailing list as a 'less than sufficient' patch https://public-inbox.org/git/[email protected]/ (it's how I applied it, from an email on-list;-).

If the 'minimal' comment is about something else, excuse my mistake.

I have no idea (from the web page) which patch is which, which belongs to which commit etc.

That's an optimization in the web UI, with a very easy way out: click on the file name above the diff, i.e. https://github.com/git-for-windows/git/pull/2179/files/5b60f72b34f213309d5870b7df4b1038914e4fc0..9c2f8de134713f7e04fc2402567c5b0c29605737#diff-3edc96d0eefd86960d342f661171a62c

Assuming this was Torsten's original patch it did fit on the mailing list as a 'less than sufficient' patch

It changes 85 files. Eighty-five! Nobody can review such a patch properly, not @tboegi, not you, not me.

And when I look at the diffstat, I immediately see e.g. that there are plenty of changes in apply.c and plenty in builtin/pack-objects.c. Those two files share preciously little concern. So I am rather certain that it would be easy even to split this at the file boundary.

Which still does not make sense, as it is still not a logically-minimal patch.

to split this at the file boundary.

Sorry, I meant at that file boundary.

And indeed, if I apply only the builtin/pack-objects.c part, for example, it compiles Just Fine. So the commit message is maybe not quite truthful when it says

The smallest step seems to be much bigger than expected.

Or when it says this:

However, when the Git code base is compiled with a compiler that complains that "unsigned long" is different from size_t, we end up in this huge patch, before the code base cleanly compiles.

As I just proved, there is a quite a bit smaller patch that already compiles cleanly: just the 70 changed lines of builtin/pack-objects.c. And I bet that we can do more of the same to cut this into nice mouthfuls, without breaking the build.

@PhilipOakley I'd like to focus your attention on the two comments above this one.

dscho · 2019-06-04T21:10:43Z

builtin/index-pack.c

@@ -481,7 +481,7 @@ static void *unpack_raw_entry(struct object_entry *obj,
 	unsigned char *p;
 	size_t size, c;
 	off_t base_offset;
-	unsigned shift;
+	size_t shift;


Nicely done, but let's not state that this is for Windows. This is a data type cleanup that benefits 64-bit systems where long is 32 bit wide.

dscho · 2019-06-04T21:11:58Z

packfile.c

@@ -1057,7 +1057,7 @@ size_t unpack_object_header_buffer(const unsigned char *buf,
 	size = c & 15;
 	shift = 4;
 	while (c & 0x80) {
-		if (len <= used || bitsizeof(long) <= shift) {
+		if (len <= used || bitsizeof(size_t) <= shift) {


This is a good change. The commit message still needs to be changed to answer the question "why?" instead of the question "what?" because the latter leaves the reader puzzled while the former will leave the reader enlightened.

dscho · 2019-06-04T21:15:54Z

I also wonder whether we should try to pour the learnings from #1848 into a test case: after all, I found a couple of issues with an earlier iteration of these patches, while the "big file" test case seemed to pass already...

PhilipOakley · 2019-06-04T21:46:33Z

I'll just respond to a few of the points that I have immediate recollection of. Many of the casts were about having an abundance of caution regarding the whole 'implementation defined' aspect of this here 'Bug' (i.e. we can't trust anything for the 64 vs 32 bit long issue ;-)

dscho · 2019-06-05T07:33:04Z

Many of the casts were about having an abundance of caution regarding the whole 'implementation defined' aspect of this here 'Bug' (i.e. we can't trust anything for the 64 vs 32 bit long issue ;-)

I understand that. Yet I think some of that caution erred on the side of redundancy, and I would rather not spend the time on the mailing list trying to defend unnecessary casts...

tboegi · 2019-08-28T04:24:17Z

@dscho:
I couldn't find a way to split up the "big patch" into smaller ones without risking
that the steps in between don't compile cleanly.
(And my personal view is that it is the same amount of work to review a patch
with 84 file changed or to review 84 patches changing 1 file each)
@ PhilipOakley I re-worded the commit message and rebased to gfw/master:

https://github.com/tboegi/git/tree/tb.190828_convert_size_t_mk_size_t

dscho · 2019-08-28T21:17:46Z

my personal view is that it is the same amount of work to review a patch
with 84 file changed or to review 84 patches changing 1 file each

Having reviewed many patch series, I much prefer reviewing a dozen relatively short patch series over the course of a dozen weeks to having to review all the changes in one big patch. Remember: there are patches so simple and elegant that there is no obvious place for bugs to hide, and there are patches so gigantic and repetitive that there is no obvious place for bugs to hide (but plenty of non-obvious places).

No, I still think we need to do our best job to make this thing easy to review.

If you can augment that big patch by a small Coccinelle patch that generates it, that would make it reviewable, of course. But that enormous wall of diff? Not reviewable at all.

PhilipOakley · 2019-09-02T22:46:41Z

@dscho The first size_t commit is not going away. It has always been that big because there are compilers that do not accept less. This is part of the inconsistent behaviours of these implementation defined behaviours (the type size_t is not a C89 feature, even though large pointer spaces are allowed you can't name them).

It maybe that someone wants to split it into a few "won't compile on my machine" patches, but I'd rather we stay with what we have. Torsten was compiling for Rasbian (gcc (Raspbian 6.3.0-18+rpi1+deb9u1) 6.3.0 20170516), not GfW anyway. Martin Koegler's series (mk/use-size-t-in-zlib still on pu) was started in 2017 as a >4Gb issue (found by searching the archive for LLP64)

The size_t stuff actually had compile warnings back as far as 2007. see https://public-inbox.org/git/[email protected]/, so at some point maybe we need to bite the bullet and actually do the big change.

...

my main problem is the test system, which is something I'm not that familiar with.

I now see the tests as being:
Test 0 check we are linked to the right library (the compile flags helper reports, and match prereqs)
Test 1 needs to: 'add large file as loose object', invokes zlib only, check it reads back OK, without invoking wider code sections.
Test 2 should be ' commit large file as a pack', invokes crc32 checking (same pointer issue as zlib, but fresh code)

Maybe need a size graded test: 1.5Gb, 3.5Gb, 5.5Gb to walk through the two potential barriers at 2Gb and 4Gb.

We also need an easily accessible compiler that has something equivalent to the former -Wshorten-64-to-32 (or vice versa) e.g. Clang? (un-tested). - Just found https://public-inbox.org/git/[email protected]/ in the archives, so it's something.
...

dscho · 2019-09-03T13:41:33Z

The first size_t commit is not going away.

I remember clearly that I was able to split out a part from it, earlier this year, into a self-contained commit.

Yes, it is possible to do this incrementally. It absolutely is. You can easily pass unsigned long to functions accepting size_t, that is not a problem! So you can fix an API function to accept size_t, leave its callers alone (until you take care of them, in a later patch). The trick is to break this PR down into bite-sized chunks, so that we can benefit maximally from code review (if the patches are too big, you will not benefit from that extra safety).

dscho · 2019-09-03T13:42:33Z

The first size_t commit is not going away.

I remember clearly that I was able to split out a part from it, earlier this year, into a self-contained commit.

I think I even mentioned this somewhere in this PR or in a related one. And note: I did not even try hard. So I don't buy the argument "this patch is big, it cannot be broken down further". I don't buy that at all.

Test whether the Git CI system allows, or fails, when LLP64 systems need variable type coercion from size_t back to unsigned long. All the *nix systems should pass, but compile checks for LLP64 systems may fail. This is a crucial information spike for Git files > 4Gb, i.e. Issues: git-lfs git-for-windows#2434, GfW git-for-windows#2179, git-for-windows#1848, git-for-windows#3306. Signed-off-by: Philip Oakley <[email protected]>

dscho · 2021-10-27T05:44:21Z

The first size_t commit is not going away. It has always been that big because there are compilers that do not accept less.

Let #3487 serve as a counter argument, Tiny, well-contained changes from unsigned long to size_t that compile and pass, and even demonstrably fix bugs.

@PhilipOakley If you are truly interested in pushing this forward, I encourage you to have a look at the end of this PR comment, where I follow the rabbit hole of the call chain of read_object_file(). This lays out a natural plan to split up that big commit in neat, compileable patches: just convert the individual functions of that call tree, leaves first.

Granted, that won't be as easily written as a semi-automated search-and-replace. But then, such a semi-automated, huge commit has not a slither of a chance to get reviewed on the Git mailing list, let alone accepted.

PhilipOakley · 2021-10-28T16:06:23Z

a natural plan to split up that big commit in neat, compileable patches: just convert the individual functions of that call tree, leaves first.

I had a simplistic look at the call tree to see how many times each was referenced.

    read_object_file()  38 ***
        repo_read_object_file() 5 
            read_object_file_extended() 2
                read_object() 3 (2 identical fns)
                    oid_object_info_extended() 26 [if's]
                        do_oid_object_info_extended() 2 [ret]
                            loose_object_info() 1 [self]
                                parse_loose_header_extended()  3[if's]
                            packed_object_info()  5
                                cache_or_unpack_entry()  1  **
                                    unpack_entry() 1
                                        unpack_object_header() 7
                                        -> unpack_object_header_buffer() 3 + t/1
                                get_size_from_delta() 2
                                    get_delta_hdr_size() 4

following your rabbit hole of the call chain (URLs in link) of read_object_file().

My first wondering was : Depth first or Width first?
There is the potential for a bit of a combinatorial explosion there.
(I did notice how the solution there for git-lfs used casts and flipped to the fn marking for the tbd steps)

I noticed there were two identical static void *read_object( definitions (object-file.c and packfile.c) though I wasn't sure if it was supposed to happen.

I may, as a narrow starter, just do hash-object for a loose object only (behind a bigFileThreshold protection..), to try to drain rod the process with a simple testable example. (i.e. no pack files at this step.! )

dscho · 2021-10-28T22:05:02Z

My first wondering was : Depth first or Width first?

You could try both approaches, and then go with the one that is easier to review.

PhilipOakley requested a review from dscho May 3, 2019 18:15

PhilipOakley force-pushed the size_t6 branch from c804c72 to 2b87560 Compare May 3, 2019 19:08

PhilipOakley marked this pull request as ready for review May 3, 2019 19:09

git-for-windows-ci force-pushed the shears/pu branch from 5539b3d to f293144 Compare May 6, 2019 19:55

dscho force-pushed the shears/pu branch 2 times, most recently from c686e70 to ed36d4f Compare May 7, 2019 14:55

git-for-windows-ci force-pushed the shears/pu branch 2 times, most recently from ff345ba to 66c432b Compare May 8, 2019 17:23

dscho force-pushed the shears/pu branch 3 times, most recently from ed04f07 to 2e5a910 Compare May 9, 2019 17:53

dscho force-pushed the shears/pu branch 3 times, most recently from f866bf0 to 71122ee Compare May 10, 2019 18:42

git-for-windows-ci force-pushed the shears/pu branch from 71122ee to aa65c89 Compare May 10, 2019 21:06

PhilipOakley mentioned this pull request May 10, 2019

The generated MSVC-DEFS-GEN is incomplete #2186

Closed

dscho force-pushed the shears/pu branch from aa65c89 to 31f643d Compare May 13, 2019 09:51

dscho force-pushed the shears/pu branch from 31f643d to 61eea16 Compare May 14, 2019 12:04

dscho force-pushed the shears/pu branch from 61eea16 to e488603 Compare May 14, 2019 13:38

dscho force-pushed the size_t6 branch from c7da05f to a85708c Compare June 4, 2019 11:10

dscho mentioned this pull request Jun 4, 2019

Whole team gets 'inflate returned -5' when cloning a specific repo on Windows #1848

Closed

dscho mentioned this pull request Jun 4, 2019

Size_t : Make large files work #2166

Closed

dscho requested changes Jun 4, 2019

View reviewed changes

weekly-digest bot mentioned this pull request Sep 1, 2019

Weekly Digest (25 August, 2019 - 1 September, 2019) #2317

Closed

weekly-digest bot mentioned this pull request Sep 8, 2019

Weekly Digest (1 September, 2019 - 8 September, 2019) #2324

Closed

PhilipOakley mentioned this pull request Oct 7, 2019

VS compilation (of extra code) not picking up zlib when 'gcc make' does #2352

Closed

1 task

PhilipOakley mentioned this pull request Dec 5, 2019

How to: create a variable usage trail? CoatiSoftware/Sourcetrail#832

Open

dscho mentioned this pull request Jan 1, 2020

Git can't read huge files out of a pack #2065

Closed

1 task

git-for-windows-ci changed the base branch from master to main June 17, 2020 18:10

marbx mentioned this pull request Feb 26, 2021

Git on Windows client corrupts files > 4Gb git-lfs/git-lfs#2434

Closed

rimrul mentioned this pull request Apr 12, 2022

git svn fetch fails with: index uses e 2. extension, which we do not understand #3780

Open

dscho mentioned this pull request May 3, 2022

git status extremely slow if any 8GBi file is in the repo #3833

Closed

1 task

thschmidt392 mentioned this pull request Jun 26, 2024

Can't clone big repo on Windows #5019

Closed

1 task

[DRAFT] for testing : Fix 4Gb limit for large files on Git for Windows #2179

Are you sure you want to change the base?

[DRAFT] for testing : Fix 4Gb limit for large files on Git for Windows #2179

Conversation

PhilipOakley commented May 3, 2019

Signed-off-by: Philip Oakley [email protected]

PhilipOakley commented May 3, 2019

PhilipOakley commented May 3, 2019

dscho commented May 9, 2019

dscho commented May 9, 2019

dscho commented May 9, 2019

PhilipOakley commented May 9, 2019

PhilipOakley commented May 13, 2019

dscho commented May 13, 2019

PhilipOakley commented May 13, 2019

PhilipOakley commented May 14, 2019

dscho commented May 14, 2019

dscho commented Jun 4, 2019

dscho commented Jun 4, 2019

danilaml commented Jun 4, 2019 • edited Loading

dscho commented Jun 4, 2019

dscho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dscho commented Jun 4, 2019

PhilipOakley commented Jun 4, 2019

dscho commented Jun 5, 2019

tboegi commented Aug 28, 2019

dscho commented Aug 28, 2019

PhilipOakley commented Sep 2, 2019

dscho commented Sep 3, 2019

dscho commented Sep 3, 2019

dscho commented Oct 27, 2021

PhilipOakley commented Oct 28, 2021

dscho commented Oct 28, 2021

danilaml commented Jun 4, 2019 •

edited

Loading