PeerTube/packages/transcription
lutangar ef14cf4a5c
feat(transcription): groundwork
chore: fiddling around some more

chore: add ctranslate2 and timestamped

chore: add performance markers

chore: refactor test

chore: change worflow name

chore: ensure Python3

chore(duration): convert to chai/mocha syntahx

chore(transcription): add individual tests for others transcribers

chore(transcription): implement formats test of all implementations

Also compare result of other implementation to the reference implementation

chore(transcription): add more test case with other language and models size and local model

chore(test): wip ctranslate 2 adapat

chore(transcription): wip transcript file and benchmark

chore(test): clean a bit

chore(test): clean a bit

chore(test): refacto timestamed spec

chore(test): update workflow

chore(test): fix glob expansion with sh

chore(test): extract some hw info

chore(test): fix async tests

chore(benchmark): add model info

feat(transcription): allow use of a local mode in timestamped-whisper

feat(transcription): extract run and profiling info in own value object

feat(transcription): extract run concept in own class an run more bench

chore(transcription): somplify run object only a uuid is now needed and add more benchmark scenario

docs(transcription): creates own package readme

docs(transcription): add local model usage

docs(transcription): update README

fix(transcription): use fr video for better comparison

chore(transcription): make openai comparison passed

docs(timestamped): clea

chore(transcription): change transcribers transcribe method signature

Introduce whisper builtin model.

fix(transcription): activate language detection

Forbid transcript creation without a language.
Add `languageDetection` flag to an engine and some assertions.

Fix an issue in `whisper-ctranslate2` :
https://github.com/Softcatala/whisper-ctranslate2/pull/93

chore(transcription): use PeerTube time helpers instead of custom ones

Update existing time function to output an integer number of seconds and add a ms human-readable time formatter with hints of tests.

chore(transcription): use PeerTube UUID helpers

chore(transcription): enable CER evaluation

Thanks to this recent fix in Jiwer <3
https://github.com/jitsi/jiwer/issues/873

chore(jiwer): creates JiWer package

I'm not very happy with the TranscriptFileEvaluator constructor... suggestions ?

chore(JiWer): add usage in README

docs(jiwer): update JiWer readme

chore(transcription): use FunMOOC video in fixtures

chore(transcription): add proper english video fixture

chore(transcription): use os tmp directory where relevant

chore(transcription): fix jiwer cli test reference.txt

chore(transcription): move benchmark out of tests

chore(transcription): remove transcription workflow

docs(transcription): add benchmark info

fix(transcription): use ms precision in other transcribers

chore(transcription): simplify most of the tests

chore(transcription): remove slashes when building path with join

chore(transcription): make fromPath method async

chore(transcription): assert path to model is a directory for CTranslate2 transcriber

chore(transcription): ctranslate2 assertion

chore(transcription): ctranslate2 assertion

chore(transcription): add preinstall script for Python dependencies

chore(transcription): add download and unzip utils functions

chore(transcription): add download and unzip utils functions

chore(transcription): download & unzip models fixtures

chore(transcription): zip

chore(transcription): raise download file test timeout

chore(transcription): simplify download file test

chore(transcription): add transcriptions test to CI

chore(transcription): raise test preconditions timeout

chore(transcription): run preinstall scripts before running ci

chore(transcription): create dedicated tmp folder for transcriber tests

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): use short video for local model test

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): raise timeout some more

chore(transcription): setup verbosity based on NODE_ENV value
2024-06-28 08:43:40 +02:00
..
src feat(transcription): groundwork 2024-06-28 08:43:40 +02:00
README.md feat(transcription): groundwork 2024-06-28 08:43:40 +02:00
package.json feat(transcription): groundwork 2024-06-28 08:43:40 +02:00
requirements.txt feat(transcription): groundwork 2024-06-28 08:43:40 +02:00
tsconfig.json feat(transcription): groundwork 2024-06-28 08:43:40 +02:00
tsconfig.types.json feat(transcription): groundwork 2024-06-28 08:43:40 +02:00

README.md

Transcription

Video transcription consists in transcribing the audio content of a video to a text.

This process might be called Automatic Speech Recognition or Speech to Text in more general context.

Provide a common API to many transcription backend, currently :

  • openai-whisper CLI
  • faster-whisper (via whisper-ctranslate2 CLI)
  • whisper-timestamped

Potential candidates could be: whisper-cpp, vosk, ...

Requirements

  • Python
  • PIP

And at least one of the following transcription backend:

  • Python :
    • openai-whisper
    • whisper-ctranslate2>=0.4.3
    • whisper-timestamped>=1.15.4

And to run the transcript evaluation tests :

  • Python
    • jiwer>=3.04

Usage

Create a transcriber manually :

import { OpenaiTranscriber } from '@peertube/peertube-transcription'

(async () => {
  // create a transcriber powered by OpeanAI Whisper CLI
  const transcriber = new OpenaiTranscriber({
    name: 'openai-whisper',
    binary: 'whisper',
    languageDetection: true
  });

  const transcriptFile = await transcriber.transcribe({
    mediaFilePath: './myVideo.mp4',
    format: 'txt'
  });

  console.log(transcriptFile.path);
  console.log(await transcriptFile.read());
})();

Using a local model file:

import { WhisperBuiltinModel } from '@peertube/peertube-transcription/dist'

const transcriptFile = await transcriber.transcribe({
  mediaFilePath: './myVideo.mp4',
  model: WhisperBuiltinModel.fromPath('./models/large.pt'),
  format: 'txt'
});

You may use the builtin Factory if you're happy with the default configuration:

import { transcriberFactory } from '@peertube/peertube-transcription'
transcriberFactory.createFromEngineName('openai-whisper')

For further usage ../tests/src/transcription/whisper/transcriber/openai-transcriber.spec.ts

Benchmark

A benchmark of available transcribers might be run with:

npm run benchmark
┌────────────────────────┬───────────────────────┬───────────────────────┬──────────┬────────┬───────────────────────┐
│        (index)         │          WER          │          CER          │ duration │ model  │        engine         │
├────────────────────────┼───────────────────────┼───────────────────────┼──────────┼────────┼───────────────────────┤
│ 5yZGBYqojXe7nuhq1TuHvz │ '28.39506172839506%'  │  '9.62457337883959%'  │  '41s'   │ 'tiny' │   'openai-whisper'    │
│ x6qREJ2AkTU4e5YmvfivQN │ '29.75206611570248%'  │ '10.46195652173913%'  │  '15s'   │ 'tiny' │ 'whisper-ctranslate2' │
│ qbt6BekKMVzxq4KCSLCzt3 │ '31.020408163265305%' │ '10.784982935153584%' │  '20s'   │ 'tiny' │ 'whisper-timestamped' │
└────────────────────────┴───────────────────────┴───────────────────────┴──────────┴────────┴───────────────────────┘

The benchmark may be run with multiple model builtin sizes:

MODELS=tiny,small,large npm run benchmark

Lexicon

  • ONNX: Open Neural Network eXchange. A specification, the ONNX Runtime run these models.
  • GPTs: Generative Pre-Trained Transformers
  • LLM: Large Language Models
  • NLP: Natural Language Processing
  • MLP: Multilayer Perceptron
  • ASR: Automatic Speech Recognition
  • WER: Word Error Rate
  • CER: Character Error Rate