-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace NamedTuple
with dataclass
#1105
replace NamedTuple
with dataclass
#1105
Conversation
all data classes should be serializable recursively using this: from dataclasses import asdict
import json
json.dumps(asdict(your_dataclass)) let me know if this satisfies your use case and whether your PR has any other use cases that are not mentioned here |
Thanks for looking at this! I'm not too familiar with dataclasses, but it looks like there's an issue with parsing nested dataclasses, for example: import json
from dataclasses import dataclass, asdict
@dataclass
class Bar:
y: str
@dataclass
class Foo:
x: str
bar: Bar
baz = Foo(x="x", bar=Bar(y="y"))
out = json.dumps(asdict(baz))
# '{"x": "x", "bar": {"y": "y"}}' The parameter Foo(**json.loads(out))
# Foo(x='x', bar={'y': 'y'}) We can do it manually, but this will break if new keys are added: obj = json.loads(out)
Foo(x=obj["x"], bar=Bar(**obj["bar"]))
# Foo(x='x', bar=Bar(y='y')) |
I didn't consider the parsing case tbh, is there a use case in your mind that needs converting the json back to a dataclass in faster whisper? as far as I'm concerned, these classes are pretty stable and no keys are being added or removed since a long time ago so the manual solution might not be so bad after all This PR solves the serialization problem with minimal code changes and no additional dependencies, Pydantic might be the best solution, but I think it's overkill in this repo and requires a lot of changes that I might not be comfortable with |
4fc33d2
to
3e7d1b8
Compare
The main use case for us is when working with AWS SageMaker endpoints, for which we have to serialize parameters and deserialize the output of the transcription. For now we're using a helper library which we've pinned to specific commits of faster-whisper. |
You can still use import pydantic
from faster_whisper.transcribe import Segment
@pydantic.dataclasses.dataclass
class Segment(Segment):
...
Segment(
**{
"id": "invalid_id",
"seek": 0,
"start": 0.0,
"end": 1.0,
"text": "transcription",
"tokens": [1, 2, 3],
"avg_logprob": 0.1,
"compression_ratio": 0.1,
"no_speech_prob": 0.1,
"words": None,
"temperature": 1.0,
}
)
When this PR is merged, you can use the wrapper to add pydantic type validation to the existing dataclass instances, these instances can be passed to any function that expects an instance of the original classes since the parameters don't contain nested dataclass instances, deserialization should work perfectly, and you can deserialize the transcription results without going through them recursively |
That's a nice solution. Pydantic supports nested stdlib dataclasses so I think this will also work for us (since maybe the deserialization use case isn't sufficient to have pydantic as an additional dependency). We'll probably still use Pydantic models for parameter passing into |
this is to allow conversion of
Segment
andWords
todict
orjson
without recursively going through themSimilar to #667 and #1104
Unlike NamedTuples, dataclasses are not iterable, so they need to be converted to a dict first