Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Octavian.jl for large mixed-mode CPU calculations. #125

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Jul 2, 2023

LinearAlgebra is hilariously slow for large mixed-mode (i.e. not supported by BLAS) multiplications:

# non mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 289 samples with 1 evaluation.
 Range (min … max):  12.774 ms …  15.729 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.110 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.218 ms ± 469.316 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▁ ▁█▄▄▃▄
  ▃▆██▇██████▅▇▅▃▃▁▁▁▁▂▁▁▁▂▂▃▁▂▂▁▁▃▁▁▃▁▂▁▁▁▂▂▂▁▁▁▁▁▁▂▁▁▁▂▁▁▃▂▂ ▃
  12.8 ms         Histogram: frequency by time         15.4 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

# mixed-mode
julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 8 samples with 1 evaluation.
 Range (min … max):  8.342 s …   8.429 s  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.361 s              ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.375 s ± 28.960 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁        █ ▁ ▁                    ▁▁                    ▁
  █▁▁▁▁▁▁▁▁█▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  8.34 s         Histogram: frequency by time        8.43 s <

 Memory estimate: 20.81 KiB, allocs estimate: 3.

Octavian.jl fares quite a bit better:

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 452 samples with 1 evaluation.
 Range (min … max):  128.814 ms … 132.015 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     129.092 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   129.234 ms ± 412.416 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▂▂ ▂█▂▂
  ▄▄▆██████████▇▅▆▅▃▅▃▄▃▃▄▄▂▂▃▃▃▃▄▂▃▃▃▃▁▃▂▃▃▁▁▁▃▂▂▂▁▂▃▂▃▃▂▁▃▃▂▃ ▃
  129 ms           Histogram: frequency by time          130 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?

For now, only use Octavian for large mixed-mode cases, which gets test times back to before #124.

@maleadt
Copy link
Member Author

maleadt commented Jul 2, 2023

Benchmark results for commit 4dac743 (comparing to 51bf8ee):
No regressions or improvements detected.

@maleadt
Copy link
Member Author

maleadt commented Jul 2, 2023

Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?

@codecov
Copy link

codecov bot commented Jul 2, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (781f1de) 30.27% compared to head (4dac743) 30.27%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #125   +/-   ##
=======================================
  Coverage   30.27%   30.27%           
=======================================
  Files          11       11           
  Lines         786      786           
=======================================
  Hits          238      238           
  Misses        548      548           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@chriselrod
Copy link

For timings, I get

julia> @time using Octavian
  0.217284 seconds (396.12 k allocations: 21.375 MiB, 6.10% gc time, 6.09% compilation time)

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 13 samples with 1 evaluation.
 Range (min  max):  43.139 ms   44.684 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     43.791 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.750 ms ± 447.341 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁  ▁   ▁   ▁ ▁   ▁       ▁▁█              ▁ ▁              ▁  
  █▁▁█▁▁▁█▁▁▁█▁█▁▁▁█▁▁▁▁▁▁▁███▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  43.1 ms         Histogram: frequency by time         44.7 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark matmul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048))
BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min  max):  42.711 ms   43.548 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     43.004 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.067 ms ± 267.509 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁      ▁▁  █ █              ▁▁ ▁   ▁         ▁            ▁▁  
  █▁▁▁▁▁▁██▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁██ ▁
  42.7 ms         Histogram: frequency by time         43.5 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mul!(C, A, B) setup=(C=zeros(Float32, 2048, 2048); A=rand(Float32, 2048, 2048); B=rand(Float32, 2048, 2048))
BenchmarkTools.Trial: 19 samples with 1 evaluation.
 Range (min  max):  44.262 ms  54.795 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     45.080 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   47.153 ms ±  3.564 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▂                                                        
  ▅▅▁██▅▁▅▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▁▅▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▅ ▁
  44.3 ms         Histogram: frequency by time        54.8 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.10.0-DEV.1608
Commit 0e8af1c162 (2023-06-30 04:06 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_PATH = @.
  LD_LIBRARY_PATH = /usr/local/lib/
  JULIA_NUM_THREADS = 8

julia> BLAS.get_config()
LinearAlgebra.BLAS.LBTConfig
Libraries: 
├ [ILP64] libmkl_rt.so
└ [ LP64] libmkl_rt.so

Which, aside from mul!, are much better timings than you report here.
My laptop isn't a particularly powerful machine.
Perhaps you started Julia with only a single thread?

Although, github actions CI is generally restricted to 1 core, so single threaded is probably representative. I don't know about buildkite.

@chriselrod
Copy link

Interestingly, this only speeds up 1.9. I can't imagine Octavian.jl being that much slower on <1.9?

I'm surprised it isn't <1.8, as 1.8 added --code-coverage=user, which made a tremendous difference vs --code-coverage=all for Octavian.

However, replacing all of our LinearAlgebra.mul! uses with Octavian.matmul! regresses test time. @chriselrod is that expected? I guess there's a significant compilation-time overhead for invoking Octavian.jl with many differently typed and sized inputs?

It should not be compiling for differently sized inputs, only different types.
That said, latency is significant; no code-coverage:

julia> C=zeros(Float32, 2048, 2048); A=rand(Float16, 2048, 2048); B=rand(Float16, 2048, 2048);   

julia> @time using Octavian
  0.205357 seconds (396.14 k allocations: 21.375 MiB, 2.34% gc time, 6.29% compilation time)

julia> @time @eval matmul!(C,A,B);
 10.354272 seconds (25.52 M allocations: 1.312 GiB, 2.72% gc time, 99.67% compilation time)

Code coverage:

julia> @time @eval matmul!(C,A,B);

202.818763 seconds (82.94 M allocations: 3.568 GiB, 0.28% gc time, 34.71% compilation time)

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

@maleadt
Copy link
Member Author

maleadt commented Jul 2, 2023

Thanks for the input!

Yes, we're only using a single thread, as we use multiple processes to run multiple tests in parallel.
However, I had not started with OPENBLAS_NUM_THREADS=1, so the comparison to OpenBLAS above was unfair, and is actually much closer to what you report. Still, running the entire GemmKernels.jl test suite with Octavian.jl is much slower than when using OpenBLAS. I'll have to look into this more closely.

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?

@maleadt
Copy link
Member Author

maleadt commented Jul 3, 2023

But hopefully only GemmKernel's coverage gets taken with --coverage=user?

We're just setting coverage=true with Pkg.test. It doesn't seem like that uses --coverage=user, https://github.com/JuliaLang/Pkg.jl/blob/e8197dd0ed8132d4a7619f3657363c8415249c47/src/Operations.jl#L1672-L1681, but I don't think it's doing the equivalent of =all?

Disabling coverage on 1.6-1.8 didn't help, so this seems like a different issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants