Example: Video Dubbing
Build a video dubbing Sieve app
In this example, we will explore a more complex use case of Sieve. We’ll create an application that automatically dubs a video of a person speaking into another language! We’ll take an input video, transcribe it, translate that text to another language, generate text-to-speech of that translated text, and have the original video lipsync to the new audio.
By following this, you will learn how to:
- Use the Sieve client package to call multiple existing functions in Python
- Combine these functions to create a customized app that meets our requirements
Introduction
As mentioned above, to create an app that can dub videos, we need several models to work together:
- WhisperX: an audio transcription model
- SeamlessT2T: a text-to-text translation model
- XTTS-V1: a text-to-speech model
- Sieve Video Retalker: an optimized version of video retalker for lipsyncing
Building the app from scratch
Set up folder and Python file
Create a folder and Python file named video_dubbing.py
with the following command:
mkdir video_dubbing
touch video_dubbing/pipeline.py
Set up pipeline
Paste the following code into pipeline.py
. The higher level logic of this code is as follows:
- Extract audio from the video
- Transcribe the audio
- Translate the transcript
- Generate new audio from the translated text
- Combine audio and video with our lipsyncer
import sieve
# Helper functions
def extract_audio(source_video: sieve.File):
import subprocess
audio_path = 'temp.wav'
subprocess.run(["ffmpeg", "-i", source_video.path, audio_path, "-y", "-loglevel", "error"])
return audio_path
def transcript_to_text(transcript):
text = " ".join([segment["text"] for segment in transcript])
return text
@sieve.function(
name="video_dubbing",
python_packages=[
"numpy>=1.19.0",
],
system_packages=["ffmpeg"],
)
def video_dubbing(source_video: sieve.File, language: str):
"""
:param source_video: The video to dub
:param language: The language to dub the video in
:return: The dubbed video
"""
print("extracting audio from video")
source_audio_path = extract_audio(source_video)
print("done extracting")
# check if language is supported
if language not in ["eng", "spa", "fra", "deu", "ita", "por", "pol", "tur", "rus", "nld", "ces", "arb", "cmn"]:
raise Exception("Language not supported")
# use remote Sieve functions
transcriber = sieve.function.get("sieve/speech_transcriber")
translator = sieve.function.get("sieve/seamless_text2text")
tts = sieve.function.get("sieve/tts")
lipsyncer = sieve.function.get("sieve/video_retalking")
print("transcribing audio")
source_audio = sieve.File(path=source_audio_path)
transcript = list(transcriber.run(source_audio))
text = transcript_to_text(transcript)
text_language = "eng"
print("transcription:", text)
print("translating text")
translated_text = translator.run(text, text_language, language)
print("translated text:", translated_text)
print("generating tts audio")
source_audio = sieve.File(path=source_audio_path)
target_audio = tts.run("elevenlabs-voice-cloning", text=translated_text, reference_audio=source_audio, stability=0.5, style=0.5)
print("done generating audio")
print("starting lipsync")
return lipsyncer.run(source_video, target_audio)
if __name__ == "__main__":
video = sieve.File(url="https://storage.googleapis.com/mango-public-models/david.mp4")
dubbed_video = video_dubbing(video, "spa")
print('dubbed video path: ', dubbed_video.path)
Run the pipeline
video_dubbing
directory before running the pipeline.video_dubbing.run
, where video_dubbing
is the local Sieve function, this will deploy the Sieve function and then run a job.python pipeline.py
You should start seeing some logs streaming in. You can also view the status of this job on the Sieve dashboard. After it has completed running, you’ll see a video file path printed to the console, which has been saved to a temporary directory. You can open this video to see the results.
View your dubbed file!
Open your video in your file explorer with the following command:
Or just view the video on the Sieve dashboard!
Was this page helpful?