You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are large inconsistency results when running a single forward pass of a torchrec model (a dense layer, a sparse layer, a weighted sparse layer, and an over layer) under distributed and non-distributed settings.
Below is the code to reproduce the inconsistency. In the code, I created a model and inputs and quantized the model with dtype = torch.qint8 and output_dtype = torch.qint8. I then run a forward pass with the distributed model and the non-distributed model. Since the model's weights are copied, I expect their results to be the same. However, there are large inconsistencies in the results. The inconsistencies are shown in the log. The environment is Python 3.10.14, torch 2.3.0+cu121, torchrec 0.7.0
Note that this code is updated from torchrec 0.2.0. When running the below code in 0.2.0, the sparse layer prints NaN output.
The inconsistencies should be bugs because the distributed model and the non-distributed model have the same parameters and inputs. When running a single forward pass, they should return the same results.
Linf: tensor(nan, device='cuda:0')
Linf dense: tensor(0., device='cuda:0')
Linf sparse: tensor(nan, device='cuda:0')
Traceback (most recent call last):
File "/root/miniconda3/envs/pt112tr02/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/pt112tr02/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/mnt/util/reproduce_quant_nan_2.py", line 312, in <module>
main()
File "/mnt/util/reproduce_quant_nan_2.py", line 303, in main
main_test_quant(
File "/mnt/util/reproduce_quant_nan_2.py", line 265, in main_test_quant
sharding_single_rank_test(
File "/mnt/util/reproduce_quant_nan_2.py", line 213, in sharding_single_rank_test
print("Linf sparse weighted: ", torch.max(torch.abs(local_sparse_weighted_r.values() - global_sparse_weighted_r.values())))
RuntimeError: The size of tensor a (1212) must match the size of tensor b (1236) at non-singleton dimension 1
The text was updated successfully, but these errors were encountered:
There are large inconsistency results when running a single forward pass of a torchrec model (a dense layer, a sparse layer, a weighted sparse layer, and an over layer) under distributed and non-distributed settings.
Below is the code to reproduce the inconsistency. In the code, I created a model and inputs and quantized the model with
dtype = torch.qint8
andoutput_dtype = torch.qint8
. I then run a forward pass with the distributed model and the non-distributed model. Since the model's weights are copied, I expect their results to be the same. However, there are large inconsistencies in the results. The inconsistencies are shown in the log. The environment isPython 3.10.14, torch 2.3.0+cu121, torchrec 0.7.0
Note that this code is updated from
torchrec 0.2.0
. When running the below code in 0.2.0, the sparse layer prints NaN output.The inconsistencies should be bugs because the distributed model and the non-distributed model have the same parameters and inputs. When running a single forward pass, they should return the same results.
Reproduction code
Logs
Logs in torchrec 0.2.0
The text was updated successfully, but these errors were encountered: