Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Apple Metal/MPS -- TVM/MLC-LLM won't compile from source #2540

Closed
BuildBackBuehler opened this issue Jun 7, 2024 · 8 comments
Closed
Labels
bug Confirmed bugs

Comments

@BuildBackBuehler
Copy link

BuildBackBuehler commented Jun 7, 2024

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

I've compiled each a few times each. But since I updated and attempted to compile, I've been unable to (*except once, not sure if pure luck or a matter of a stock, no features, build). With that little detail of success, it was off a fresh git repo DL, whereas when I have dropped the features after a failed build back to stock, it still fails.

Features that seem to exacerbate the issue: BLAS, MKL, CoreML, Arm Compute Lib., basically anything that'd go through MPS and it causes this foundational error (as in there may be an error about inability to find, for the file src/runtime/contrib/ACL/allocator.cc, <#include acl/runtime/IAllocator.h> + Core/Types.h (which doesn't make sense, I've gone out of the way to incorporate the precise directory that IAlloc. is in my Include flags/CMake conf. (ACL/arm_compute/core + ACL/arm_compute/runtime):

[ 53%] Building CXX object tvm/CMakeFiles/tvm_objs.dir/src/tir/analysis/verify_ssa.cc.o
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:36:25: error: 'CopyDataFromTo' is a protected member of 'tvm::runtime::metal::MetalWorkspace'
   36 |   entry_ptr->metal_api->CopyDataFromTo((__bridge void*)mtlbuf, 0, (__bridge void*)temp, 0,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/../../metal/metal_common.h:187:8: note: declared protected here
  187 |   void CopyDataFromTo(const void* from, size_t from_size, void* to, size_t to_size, size_t size,
      |        ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:72:25: error: 'CopyDataFromTo' is a protected member of 'tvm::runtime::metal::MetalWorkspace'
   72 |   entry_ptr->metal_api->CopyDataFromTo((__bridge void*)temp, 0, (__bridge void*)mtlbuf, 0,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/../../metal/metal_common.h:187:8: note: declared protected here
  187 |   void CopyDataFromTo(const void* from, size_t from_size, void* to, size_t to_size, size_t size,
      |        ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:106:53: error: no member named 'GetCommandQueue' in 'tvm::runtime::metal::MetalWorkspace'
  106 |   id<MTLCommandQueue> queue = entry_ptr->metal_api->GetCommandQueue(data->device);
      |                               ~~~~~~~~~~~~~~~~~~~~  ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/conv.mm:115:25: error: 'CopyDataFromTo' is a protected member of 'tvm::runtime::metal::MetalWorkspace'
  115 |   entry_ptr->metal_api->CopyDataFromTo((__bridge void*)bufB, 0, (__bridge void*)tempB, 0,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/contrib/mps/../../metal/metal_common.h:187:8: note: declared protected here
  187 |   void CopyDataFromTo(const void* from, size_t from_size, void* to, size_t to_size, size_t size,
      |        ^
4 errors generated.

I originally posted about this issue in TVM 6 days ago...but it is inactive. mlc-ai/relax#321
More errors/gen. information for context if needed.

Environment

  • Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal/MPS
  • Operating system (e.g. Ubuntu/Windows/MacOS/...): Mac V. 14.5
  • Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...) MBP M1 Max
  • How you installed MLC-LLM (conda, source): Source (Github)
  • How you installed TVM-Unity (pip, source): Source (Github)
  • Python version (e.g. 3.10): 3.11.9
  • GPU driver version (if applicable):
  • CUDA/cuDNN version (if applicable):
  • TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): [Haven't actually reinstalled TVM as a Python instance, just built.
  • Any other relevant information:

Additional context

Sadly couldn't find anything on the net about how to fix this error. Figured there'd be a lot of these MTLCommandQueue errors but nothing concrete.

I only tried compiling MLC-LLM 1 or 2 times. And that was with my TVM (that compiled that one time...well the .dylibs, but not the pure 100% instance, I had attempted more builds after). I suppose I'll give a try to compile MLC w/ 3rd Party TVM, but I need to manipulate MLC's quantization file so I can import a custom-quantized model of mine.

@BuildBackBuehler BuildBackBuehler added the bug Confirmed bugs label Jun 7, 2024
@tqchen
Copy link
Contributor

tqchen commented Jun 7, 2024

Likely you don't need to turn on arm compute and mps since we generate our own metal code

@BuildBackBuehler
Copy link
Author

Hm, funny, came back here to comment that I got MLC-LLM to compile -- without MPS on. Problem is that now when I went to compile a model, I kept getting

    raise ValueError(f"No target detected from device: {hint}. Please specify explicitly")

And when I included --device metal

File "/Users/zack/.home/gitrepos/LLMLife/frontend/mlc-llm/python/mlc_llm/support/auto_device.py", line 42, in detect_device
    raise ValueError(f"Device is not found on your local environment: {device_hint}")

I also have device="mps" set as an env. var. and MTLDevice=1

@tqchen
Copy link
Contributor

tqchen commented Jun 7, 2024

ah, you need to write device="metal"

@BuildBackBuehler
Copy link
Author

BuildBackBuehler commented Jun 7, 2024

ah, you need to write device="metal"

🤦‍♂️ it's always something so stupid, agh! Thank you

Welp, ValueError: Cannot detect device `metal(0)`. Please make sure the device and its driver is installed properly, and TVM is compiled with the driver I'm guessing when I did that compilation that worked, I must've neglected to include Metal even. Will see what happens when I do another clean-build + Metal.

And while I'm here, I neglected to mention the warning that always serve as a precursor to the 4 "foundational" errors I had mentioned.

[ 16%] Building CXX object CMakeFiles/tvm_runtime_objs.dir/src/runtime/workspace_pool.cc.o
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:296:30: warning: zero as null pointer constant [-Wzero-as-null-pointer-constant]
  296 |     SetThreadFullCpuAffinity(CURRENT_THREAD_HANDLE, mode);
      |                              ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:51:77: note: expanded from macro 'CURRENT_THREAD_HANDLE'
   51 | #define CURRENT_THREAD_HANDLE (static_cast<std::thread::native_handle_type>(0))
      |                                                                             ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:441:25: warning: zero as null pointer constant [-Wzero-as-null-pointer-constant]
  441 |       SetThreadAffinity(CURRENT_THREAD_HANDLE,
      |                         ^
/Users/zack/.home/gitrepos/LLMLife/backend/tvm-unity/src/runtime/threading_backend.cc:51:77: note: expanded from macro 'CURRENT_THREAD_HANDLE'
   51 | #define CURRENT_THREAD_HANDLE (static_cast<std::thread::native_handle_type>(0))

Edit: TVM just compiled with MPS off! (!!! because I was able to use ACL, MKL and whatever else)
I imagine that when I recompile MLC it'll work. Should I leave this open since it is still...technically an issue? 😂 At least I had been compiling with MPS on for months

@BuildBackBuehler
Copy link
Author

BuildBackBuehler commented Jun 7, 2024

I have a feeling this is the answer to my prayers.

I had swapped out DMLC_Core and the related files that were changed when that was updated without luck. Then sought to check out the last time those threading/pool files were modified.

Haven't tried those files yet but figure it must be Apple's policies w/ threading and this seems to confirm

But donno, just glazed over and saw a couple apples but maybe compared to a couple oranges.

Just came up in my search so...
https://github.com/JuvignyEnsta/IN203_SUPPORT_COURS/blob/master/Examples/thread_extension.cpp
https://github.com/rurban/smhasher/blob/master/Platform.h

And noticed this change
mlc-ai/relax@3a42361

@BuildBackBuehler
Copy link
Author

BuildBackBuehler commented Jun 12, 2024

I was able to compile TVM/MLC but its producing segmentation fault errors on conversion (weights) of my Codestral model. Also gotten errors with compiling a 3-bit Omniquant Llama model (gen_config worked fine) and trying to chat with an AQLM 2-bit model I managed to get to compile previously. However, I'm not sure if the .dylib I'd compiled was legitimate (used no_quant for gen_config/compilation) so I'd need to double back on another compile anyways.

Seems it is the fact that there's only a

"Protected: CopyDataFromTo(vars, etc. etc.)"

No "Public: CopyDataFromTo" defined in runtime/metal/metal_common.h. There's also no "GetCommandQueue" (used in metal_api.mm and conv.mm (IIRC)) defined in metal_common.h.

I remedied those issues. Then it was only a matter of introducing PublicCopyDataFromTo function & the GCQ definition in conv.mm, gemm.mm (contrib/MPS files) and metal_api.mm. Well, if the resolution is kosher, but I guess from the errors I'm experiencing, it breaks something important (I'm guessing the data should be transported in a protected state. I'm thinking that there must've just been a discrepancy (in conv.mm or metal_api?)/missing definition (GCQ) prohibiting it from protected data transference

Compiling with arguments:
  --config          LlamaConfig(hidden_size=8192, intermediate_size=28672, num_attention_heads=64, num_hidden_layers=80, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=2048, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
  --quantization    GroupQuantize(name='AQLM_2bit', kind='group-quant', group_size=16, quantize_dtype='int2', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=16, num_storage_per_group=1, max_int_value=1)
  --model-type      llama
  --target          {"max_num_threads": 256, "max_shared_memory_per_block": 32768, "max_function_args": 31, "max_threads_per_block": 1024, "thread_warp_size": 32, "keys": ["metal", "gpu"], "host": {"keys": ["cpu"], "mtriple": "5", "tag": "", "kind": "llvm"}, "tag": "", "kind": "metal"}
  --opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
  --system-lib-prefix ""
  --output          /Users/zack/.home/local/models/2bitllama/aqlm.dylib
  --overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=None
[2024-06-12 20:10:05] INFO compile.py:127: Creating model from: LlamaConfig(hidden_size=8192, intermediate_size=28672, num_attention_heads=64, num_hidden_layers=80, rms_norm_eps=1e-05, vocab_size=128256, position_embedding_base=500000.0, context_window_size=8192, prefill_chunk_size=2048, num_key_value_heads=8, head_dim=128, tensor_parallel_shards=1, max_batch_size=80, kwargs={})
[2024-06-12 20:10:05] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-06-12 20:10:13] INFO compile.py:151: Running optimizations using TVM Unity
[2024-06-12 20:10:13] INFO compile.py:171: Registering metadata: {'model_type': 'llama', 'quantization': 'AQLM_2bit', 'context_window_size': 8192, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-06-12 20:10:15] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-06-12 20:22:53] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-06-12 20:23:09] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-06-12 20:24:30] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-06-12 20:24:31] INFO pipeline.py:52: Lowering to VM bytecode
[2024-06-12 20:24:44] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 32.00 MB
[2024-06-12 20:24:44] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 23.12 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode_to_last_hidden_states`: 24.38 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 593.25 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill_to_last_hidden_states`: 624.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_select_last_hidden_states`: 1.25 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 592.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify_to_last_hidden_states`: 624.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.29 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode_to_last_hidden_states`: 0.30 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 32.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `gather_hidden_states`: 0.00 MB
[2024-06-12 20:24:45] INFO estimate_memory_usage.py:58: [Memory usage] Function `get_logits`: 0.00 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 592.02 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill_to_last_hidden_states`: 624.00 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `scatter_hidden_states`: 0.00 MB
[2024-06-12 20:24:46] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-06-12 20:24:51] INFO pipeline.py:52: Compiling external modules
[2024-06-12 20:24:51] INFO pipeline.py:52: Compilation complete! Exporting to disk
!!!!!!! TVM encountered a Segfault !!!!!!!
Stack trace:

[1]    30875 segmentation fault  mlc_llm compile /Users/zack/.home/local/models/2bitllama --host 5 --device

@BuildBackBuehler
Copy link
Author

BuildBackBuehler commented Jun 13, 2024

[1] 63036 segmentation fault mlc_llm convert_weight /Users/zack/.home/local/models/Uncensored_Llama-70B
/Users/zack/.home/local/mise/installs/python/3.11.9/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

I'm losing my sanity here at this point. My Python/Poetry appear to be ARM64...so it nixes that possibility. I checked because I saw all the multiprocessing errors here have been related to that or other user errors.

My last remaining guess, and I wish I just turned it off when I turned off ARM Comp. Lib. is BLAS. I'm feeling a bit dumb now really, because I believe when issues started and I posted this in TVM, I noted Apple BLAS was suspect. I think the code is out of date because it shoots warnings (and before I messed around w/ the Metal/MPS code, errors!) about how the code is relying on sgemm and dgemm or w/e functions/scripts that are deprecated. And I tried everything under the sun to force CMake to incorporate the new Apple BLAS without change. So I'll be turning that off now, too.

I do have tons of modules on normally. AOTExec, UMA, BNNS, Threads, RPC, CPP TVM, CPP RPC, Profiler, Graph Executor, CoreML, TCMalloc, MLIR, Pipeline. I think that might be it 😂😅

@BuildBackBuehler
Copy link
Author

BuildBackBuehler commented Jun 14, 2024

The CMake module for Modules/OpenMP.cmake should be updated because there's nothing Apple-friendly

OpenMPcmake.txt

/Users/zack/.home/gitrepos/LLMLife/backend/tvm/src/relay/backend/contrib/bnns/codegen.cc:93:16: error: call to 'GetRootCall' is ambiguous
   93 |         call = GetRootCall(body, 1, {"nn.conv2d", add_op_type});
      |                ^~~~~~~~~~~

codegen copycc.txt
Also had to update this to get rid of an error

Sadly stock, with all options off (except Metal), segfault errors on Convert_Weight + Compile 😭

Edit: 🤦 -- turns out the dang EXE binary wasn't updating, no wonder nothing was happening. Just got it working with a stock build. Time to try to piece it back up to the full shebang

Edit 2: Also, this should be added to Metal_Device_API.mm
case kAvailableGlobalMemory:
break;

(Under ICHECK_LT(index, devices.size()) << "Invalid device id " << index;
switch (kind) {
case kMaxThreadsPerBlock: {
*rv = static_cast([devices[dev.device_id] maxThreadsPerThreadgroup].width);
break;
})

Edit 3: Seems I've gotten everything on besides MPS, hopefully that can be fixed sooner than later!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs
Projects
None yet
Development

No branches or pull requests

3 participants