replace `NamedTuple` with `dataclass` #1105

MahmoudAshraf97 · 2024-10-30T12:30:45Z

this is to allow conversion of Segment and Words to dict or json without recursively going through them
Similar to #667 and #1104

Unlike NamedTuples, dataclasses are not iterable, so they need to be converted to a dict first

MahmoudAshraf97 · 2024-10-30T12:31:44Z

@extrange

all data classes should be serializable recursively using this:

from dataclasses import asdict
import json
json.dumps(asdict(your_dataclass))

let me know if this satisfies your use case and whether your PR has any other use cases that are not mentioned here

extrange · 2024-10-30T13:16:01Z

Thanks for looking at this!

I'm not too familiar with dataclasses, but it looks like there's an issue with parsing nested dataclasses, for example:

import json
from dataclasses import dataclass, asdict

@dataclass
class Bar:
    y: str

@dataclass
class Foo:
    x: str
    bar: Bar

baz = Foo(x="x", bar=Bar(y="y"))

out = json.dumps(asdict(baz))
# '{"x": "x", "bar": {"y": "y"}}'

The parameter bar is not parsed, nor type-checked:

Foo(**json.loads(out))
# Foo(x='x', bar={'y': 'y'})

We can do it manually, but this will break if new keys are added:

obj = json.loads(out)
Foo(x=obj["x"], bar=Bar(**obj["bar"]))
# Foo(x='x', bar=Bar(y='y'))

MahmoudAshraf97 · 2024-10-30T13:43:10Z

I didn't consider the parsing case tbh, is there a use case in your mind that needs converting the json back to a dataclass in faster whisper? as far as I'm concerned, these classes are pretty stable and no keys are being added or removed since a long time ago so the manual solution might not be so bad after all

This PR solves the serialization problem with minimal code changes and no additional dependencies, Pydantic might be the best solution, but I think it's overkill in this repo and requires a lot of changes that I might not be comfortable with

extrange · 2024-10-30T22:18:06Z

The main use case for us is when working with AWS SageMaker endpoints, for which we have to serialize parameters and deserialize the output of the transcription.

For now we're using a helper library which we've pinned to specific commits of faster-whisper.

MahmoudAshraf97 · 2024-10-31T08:08:43Z

You can still use Pydantic validation, at least for type checking

import pydantic
from faster_whisper.transcribe import Segment

@pydantic.dataclasses.dataclass
class Segment(Segment):
    ...

Segment(
    **{
        "id": "invalid_id",
        "seek": 0,
        "start": 0.0,
        "end": 1.0,
        "text": "transcription",
        "tokens": [1, 2, 3],
        "avg_logprob": 0.1,
        "compression_ratio": 0.1,
        "no_speech_prob": 0.1,
        "words": None,
        "temperature": 1.0,
    }
)

ValidationError: 1 validation error for Segment
id
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='invalid_id', input_type=str]
    For further information visit https://errors.pydantic.dev/2.9/v/int_parsing

When this PR is merged, you can use the wrapper to add pydantic type validation to the existing dataclass instances, these instances can be passed to any function that expects an instance of the original classes

since the parameters don't contain nested dataclass instances, deserialization should work perfectly, and you can deserialize the transcription results without going through them recursively

extrange · 2024-11-01T01:44:24Z

That's a nice solution. Pydantic supports nested stdlib dataclasses so I think this will also work for us (since maybe the deserialization use case isn't sufficient to have pydantic as an additional dependency).

We'll probably still use Pydantic models for parameter passing into .transcribe(), since we need to serialize/deserialize the arguments to/from JSON.

replace NamedTuple with dataclass

3e7d1b8

MahmoudAshraf97 force-pushed the recursive_dict branch from 4fc33d2 to 3e7d1b8 Compare October 30, 2024 14:06

add deprecation warnings

9b3800e

extrange mentioned this pull request Nov 1, 2024

Add Pydantic typing #1104

Closed

MahmoudAshraf97 mentioned this pull request Nov 1, 2024

update: Ensure method _asdict also converts Word object into a native dict #667

Closed

MahmoudAshraf97 merged commit 203dddb into SYSTRAN:master Nov 5, 2024
3 checks passed

MahmoudAshraf97 deleted the recursive_dict branch November 5, 2024 13:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace `NamedTuple` with `dataclass` #1105

replace `NamedTuple` with `dataclass` #1105

MahmoudAshraf97 commented Oct 30, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 30, 2024 •

edited

Loading

extrange commented Oct 30, 2024

MahmoudAshraf97 commented Oct 30, 2024

extrange commented Oct 30, 2024

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

extrange commented Nov 1, 2024 •

edited

Loading

replace NamedTuple with dataclass #1105

replace NamedTuple with dataclass #1105

Conversation

MahmoudAshraf97 commented Oct 30, 2024 • edited Loading

MahmoudAshraf97 commented Oct 30, 2024 • edited Loading

extrange commented Oct 30, 2024

MahmoudAshraf97 commented Oct 30, 2024

extrange commented Oct 30, 2024

MahmoudAshraf97 commented Oct 31, 2024 • edited Loading

extrange commented Nov 1, 2024 • edited Loading

replace `NamedTuple` with `dataclass` #1105

replace `NamedTuple` with `dataclass` #1105

MahmoudAshraf97 commented Oct 30, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 30, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

extrange commented Nov 1, 2024 •

edited

Loading