Skip to content

Proposal: Support multilingual transcripts #367

@ryan-lp

Description

@ryan-lp

Some podcasts are multilingual, where each episode might use a different language, or even where a single episode may switch between multiple languages.

It is already possible to list multiple languages in the channel of the RSS feed (e.g. <language>en,es</language> on the channel), and perhaps there should also be a similar optional tag on each item that defaults to the channel language, because that may be helpful when each episode is in a different language.

But when a single episode contains multiple languages, we also need a way to tag which text belong to which language within the transcript.

I am not sure if there is an obvious way to do it in every format, but for JSON, we can add an optional language property to each segment which defaults to the item's language in the RSS feed, as follows:

    {
      "speaker": "Darth Vader",
      "startTime": 0.5,
      "endTime": 0.75,
      "body": "I",
      "language": "en"
    }

For WebVTT, maybe this information could be placed in a comment.
For SRT, maybe this information could be encoded in parentheses or some other type of brackets.
For HTML, maybe this can use the lang attribute.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions