You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I encountred an error/bug while trying to execute a docstring code example from the file keras_nlp.src.models.gpt2.causal_lm.py and I have reproduced the example code below:
Th error is clear: the -1 value. I've traced the error to the following function from the file keras.src.backend.tensorflow.trainer:
@tf.autograph.experimental.do_not_convertdefone_step_on_iterator(iterator):
"""Runs a single training step given a Dataset iterator."""data=next(iterator)
outputs=self.distribute_strategy.run(
one_step_on_data, args=(data,)
)
outputs=reduce_per_replica(
outputs,
self.distribute_strategy,
reduction="auto",
)
returnoutputs
The line data=next(iterator) computes the labels and therefore the -1 value is created here. The iterator argument is a tensorflow OwnedIterator and executes from the file tensorflow.python.data.ops.iterator_ops and the executed function reproduced below:
def_next_internal(self):
autograph_status=autograph_ctx.control_status_ctx().statusautograph_disabled=autograph_status==autograph_ctx.Status.DISABLEDifnotcontext.executing_eagerly() andautograph_disabled:
self._get_next_call_count+=1ifself._get_next_call_count>GET_NEXT_CALL_ERROR_THRESHOLD:
raiseValueError(GET_NEXT_CALL_ERROR_MESSAGE)
ifnotcontext.executing_eagerly():
# TODO(b/169442955): Investigate the need for this colocation constraint.withops.colocate_with(self._iterator_resource):
ret=gen_dataset_ops.iterator_get_next(
self._iterator_resource,
output_types=self._flat_output_types,
output_shapes=self._flat_output_shapes)
returnstructure.from_compatible_tensor_list(self._element_spec, ret)
which executes gen_dataset_ops.iterator_get_next from the file tensorflow.python.data.ops.gen_dataset_ops, and from here to the relevant ops execution which I didn't trace further since it also leads to C++ execution code.
Enviroment
Linux 6.5.0-26-generic #26~22.04.1-Ubuntu
keras - 3.5.0
python - 3.10.12
tensorflow - 2.17.0
kerasNLP - 0.14.4
Additional tensorflow info:
2024-08-19 12:20:02.135293: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-19 12:20:02.154198: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-19 12:20:02.159831: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-19 12:20:02.174579: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-08-19 12:20:03.092334: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-08-19 12:20:04.517556: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:266] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Describe the bug
I encountred an error/bug while trying to execute a docstring code example from the file
keras_nlp.src.models.gpt2.causal_lm.py
and I have reproduced the example code below:The following is a comprehensive description of the error, reproduced below and debugging using
pdb
:Th error is clear:
the -1 value
. I've traced the error to the following function from the filekeras.src.backend.tensorflow.trainer
:The line
data=next(iterator)
computes the labels and therefore the -1 value is created here. Theiterator
argument is a tensorflowOwnedIterator
and executes from the filetensorflow.python.data.ops.iterator_ops
and the executed function reproduced below:which executes
gen_dataset_ops.iterator_get_next
from the filetensorflow.python.data.ops.gen_dataset_ops
, and from here to the relevant ops execution which I didn't trace further since it also leads to C++ execution code.Enviroment
To Reproduce
Link to a Colab Notebook
Expected behavior
I expected the model to train normally by running the
fit()
function without any complications and return aHistory
object.Would you like to help us fix it?
The text was updated successfully, but these errors were encountered: