Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace NamedTuple with dataclass #1105

Merged
merged 2 commits into from
Nov 5, 2024

Conversation

MahmoudAshraf97
Copy link
Collaborator

@MahmoudAshraf97 MahmoudAshraf97 commented Oct 30, 2024

this is to allow conversion of Segment and Words to dict or json without recursively going through them
Similar to #667 and #1104

Unlike NamedTuples, dataclasses are not iterable, so they need to be converted to a dict first

@MahmoudAshraf97
Copy link
Collaborator Author

MahmoudAshraf97 commented Oct 30, 2024

@extrange

all data classes should be serializable recursively using this:

from dataclasses import asdict
import json
json.dumps(asdict(your_dataclass))

let me know if this satisfies your use case and whether your PR has any other use cases that are not mentioned here

@extrange
Copy link

Thanks for looking at this!

I'm not too familiar with dataclasses, but it looks like there's an issue with parsing nested dataclasses, for example:

import json
from dataclasses import dataclass, asdict

@dataclass
class Bar:
    y: str

@dataclass
class Foo:
    x: str
    bar: Bar

baz = Foo(x="x", bar=Bar(y="y"))

out = json.dumps(asdict(baz))
# '{"x": "x", "bar": {"y": "y"}}'

The parameter bar is not parsed, nor type-checked:

Foo(**json.loads(out))
# Foo(x='x', bar={'y': 'y'})

We can do it manually, but this will break if new keys are added:

obj = json.loads(out)
Foo(x=obj["x"], bar=Bar(**obj["bar"]))
# Foo(x='x', bar=Bar(y='y'))

@MahmoudAshraf97
Copy link
Collaborator Author

I didn't consider the parsing case tbh, is there a use case in your mind that needs converting the json back to a dataclass in faster whisper? as far as I'm concerned, these classes are pretty stable and no keys are being added or removed since a long time ago so the manual solution might not be so bad after all

This PR solves the serialization problem with minimal code changes and no additional dependencies, Pydantic might be the best solution, but I think it's overkill in this repo and requires a lot of changes that I might not be comfortable with

@extrange
Copy link

The main use case for us is when working with AWS SageMaker endpoints, for which we have to serialize parameters and deserialize the output of the transcription.

For now we're using a helper library which we've pinned to specific commits of faster-whisper.

@MahmoudAshraf97
Copy link
Collaborator Author

MahmoudAshraf97 commented Oct 31, 2024

You can still use Pydantic validation, at least for type checking

import pydantic
from faster_whisper.transcribe import Segment

@pydantic.dataclasses.dataclass
class Segment(Segment):
    ...

Segment(
    **{
        "id": "invalid_id",
        "seek": 0,
        "start": 0.0,
        "end": 1.0,
        "text": "transcription",
        "tokens": [1, 2, 3],
        "avg_logprob": 0.1,
        "compression_ratio": 0.1,
        "no_speech_prob": 0.1,
        "words": None,
        "temperature": 1.0,
    }
)
ValidationError: 1 validation error for Segment
id
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='invalid_id', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/int_parsing

When this PR is merged, you can use the wrapper to add pydantic type validation to the existing dataclass instances, these instances can be passed to any function that expects an instance of the original classes

since the parameters don't contain nested dataclass instances, deserialization should work perfectly, and you can deserialize the transcription results without going through them recursively

@extrange
Copy link

extrange commented Nov 1, 2024

That's a nice solution. Pydantic supports nested stdlib dataclasses so I think this will also work for us (since maybe the deserialization use case isn't sufficient to have pydantic as an additional dependency).

We'll probably still use Pydantic models for parameter passing into .transcribe(), since we need to serialize/deserialize the arguments to/from JSON.

@extrange extrange mentioned this pull request Nov 1, 2024
@MahmoudAshraf97 MahmoudAshraf97 merged commit 203dddb into SYSTRAN:master Nov 5, 2024
3 checks passed
@MahmoudAshraf97 MahmoudAshraf97 deleted the recursive_dict branch November 5, 2024 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants