Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption when interrupting a computation with BigInt #56545

Open
jmichel7 opened this issue Nov 13, 2024 · 8 comments
Open

Memory corruption when interrupting a computation with BigInt #56545

jmichel7 opened this issue Nov 13, 2024 · 8 comments
Labels
bignums BigInt and BigFloat

Comments

@jmichel7
Copy link

jmichel7 commented Nov 13, 2024

If you do a big computation with BigInts like

function test_hm(rtype,rank)
  m=[rtype(1)//(n+m) for n in 1:rank, m in 1:rank]
  one(m)==m*inv(m)
end

Then
test_hm(BigInt,100)
takes about 2s so you can interrupt it with ^C. Then if you restart the same command you get often a message like

double free or corruption (!prev)

[2342677] signal 6 (-6): Abandon
in expression starting at REPL[5]:1

You may need to do the restart/break a few times to get the error. I have a bigger computation with BigInt where the error is systematic after a Ctrl-C. This is on

Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, rocketlake)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores
@jmichel7 jmichel7 changed the title Memory corruption when interruption a computation with BigInt Memory corruption when interrupting a computation with BigInt Nov 13, 2024
@giordano giordano added the bignums BigInt and BigFloat label Nov 13, 2024
@giordano
Copy link
Contributor

Can you please try on nightly? With julia v1.11 I sometimes get julia crashing with

julia> function test_hm(rtype,rank)
         m=[rtype(1)//(n+m) for n in 1:rank, m in 1:rank]
         one(m)==m*inv(m)
       end
test_hm (generic function with 1 method)

julia> test_hm(BigInt,100)
^CERROR: InterruptException:
Stacktrace:
  [1] sub!(z::Rational{BigInt}, x::Rational{BigInt}, y::Rational{BigInt})
    @ Base.GMP.MPQ ./gmp.jl:1013
  [2] -
    @ ./gmp.jl:1056 [inlined]
  [3] generic_trimatdiv!(C::Matrix{Rational{BigInt}}, uploc::Char, isunitc::Char, tfun::typeof(identity), A::Matrix{Rational{BigInt}}, B::Matrix{Rational{BigInt}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:1360
  [4] _ldiv!
    @ ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:939 [inlined]
  [5] ldiv!
    @ ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/triangular.jl:944 [inlined]
  [6] ldiv!
    @ ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/lu.jl:503 [inlined]
  [7] ldiv!(Y::Matrix{Rational{BigInt}}, A::LinearAlgebra.LU{Rational{BigInt}, Matrix{Rational{BigInt}}, Vector{Int64}}, B::Matrix{Rational{BigInt}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/factorization.jl:176
  [8] inv!
    @ ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/lu.jl:564 [inlined]
  [9] inv(A::Matrix{Rational{BigInt}})
    @ LinearAlgebra ~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/LinearAlgebra/src/dense.jl:993
 [10] test_hm(rtype::Type{BigInt}, rank::Int64)
    @ Main ./REPL[1]:3

julia> test_hm(BigInt,100)
julia(3918,0x205b07ac0) malloc: *** error for object 0x60000a0a04a0: pointer being freed was not allocated
julia(3918,0x205b07ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[3918] signal 6: Abort trap: 6
in expression starting at REPL[2]:1
__pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 13256910 (Pool: 13256788; Big: 122); GC: 26

but with nightly I never have this issue. But your example is hard to reproduce that I'm not sure whether I'm just lucky or what.

@jmichel7
Copy link
Author

I still have the issue with nightly with my big program; the message is

malloc_consolidate(): unaligned fastbin chunk detected

[2345027] signal 6 (-6): Abandon
in expression starting at REPL[19]:1

I indeed cannot get it anymore with the smaller test_hmabove. I am pretty sure the problem is with BigInt because I can run my big program on some smaller examples with Int128 and I do not have the problem. It only occurs when I use BigInt.

@oscardssmith
Copy link
Member

I don't think this is especially surprising or that there's anything we can do about it. in general, interrupting a program arbitrarily can corrupt memory

@jmichel7
Copy link
Author

It is part of the usability of the REPL to be able to interrupt too long computations. I only ever got this problem with BigInts; I would have noticed if it happened in any other circumstance.

@jpsamaroo
Copy link
Member

The issue here is related to #49541 - basically, as @oscardssmith points out, it is currently not safe to interrupt basically any kind of code, because of how aggressive and non-cooperative our interrupt mechanism is. It's pretty good at stopping all kinds of code from continuing to run and subsequently returning to the REPL (to an extent), but because of how good it is at this, no guarantees can be made about the consistency of the current process' data structures (which may be in an inconsistent state as they were interrupted while working on some operation). Basically, you can interrupt just once, but you're left with a potentially garbage Julia session.

The PR I linked would resolve this, not through magically making the current mechanism better, but by providing an alternative cooperative mechanism for allowing running code to gracefully stop itself and clean up safely when an interrupt has been triggered. Yet, even with that PR, much more work needs to be done - more graceful interrupt logic will need to be added to Julia/Base, maybe at allocation points, the top of loops, function entry, etc. There are many possibilities, with a variety of performance and interrupt latency trade-offs, and it's not at all clear that this will come anytime soon, even if that PR is merged.

In summary: it's a hard problem, with some proposed solutions, but it will always be a battle to make graceful interrupts of arbitrary code reliable.

@PallHaraldsson
Copy link
Contributor

it is currently not safe to interrupt basically any kind of code

Arguably then it should drop out of the REPL too (or print a warning). I often DO CTRL-C when precompiling, most often it works... or seems to, if it is actually safe, then please do do not kill the REPL. I think in same cases it doesn't work, or could be handled better, could it be caught there and handled gracefully?

@KristofferC
Copy link
Member

I think in same cases it doesn't work, or could be handled better, could it be caught there and handled gracefully?

The post you quote from discusses exactly this.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Nov 13, 2024

Up to now, I've thought that CTRL-C is safe, for Pkg, why I do it, and if Pkg/precompile can't handle it, it's a bug there (what I was referring to with that sentence). Just as a power-failure shouldn't damage disk structures (of Julia).

I take it you confirm disk structures aren't safe from abort of Pkg, even if I do not use Pkg further in same session.

Back to the issue here, I confirm a bug with yet another error:

julia> @time m * m3
malloc_consolidate(): invalid chunk size

[2907418] signal 6 (-6): Aborted
in expression starting at REPL[38]:1
^C

most probably inv alone might also fail, and any of BigInt operation, it's not like either was conceptually mutable, so I think all for BigInt (and likely BigFloats too) risky.

[Then the REPL froze, maybe since I pressed CTRL-C again.]

https://forums.raspberrypi.com/viewtopic.php?t=367805

[not here hardware related but] or continuing to use memory after freeing it, which eventually trashes some of libc's metadata.

https://stackoverflow.com/questions/18760999/sample-example-program-to-get-the-malloc-consolidate-error

Running valgrind is extremely slow. Maybe you can try setenv MALLOC_CHECK_ to 1 and run your program to see any diagnosis message first.

https://stackoverflow.com/questions/18153746/what-is-the-difference-between-glibcs-malloc-check-m-check-action-and-mcheck

I'm guessing there's really no way to recover from this, after the fact, and all such C code. This would also fail in C (and Python and any language wrapping this?).

For Julia code, i.e. those relying on GC then it likely shouldn't be an issue, but still all mutable structures in danger that have been half processed.

Wouldn't all immutable memory structures (including for GC survive this?)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bignums BigInt and BigFloat
Projects
None yet
Development

No branches or pull requests

6 participants