%%capture
! pip install diffusers==0.4.0 transformers
Creating AI Music Videos with Stable Diffusion
Today, we’ll talk about how we can leverage Stable Diffusion to generate captivating music videos that move to the beat of a song.
The point of this notebook is to learn how this process works.
If you don’t care about the details, you can just go check out the Stable Diffusion Videos repo instead, where all this work is nicely wrapped up for you already.
If you like this notebook:
- consider giving this repo a star ⭐️
- consider following me on Github @nateraw
Examples
Here are some examples of videos you’ll be able to generate using the techniques we describe below:
Background
You probably remember seeing videos like this, where folks were interpolating the latent space of StyleGAN models:
These were soooo mindblowing at one time 🤯. But, they were limited in that you would have to train a separate model for each concept you wanted to generate images for. Even when class conditioned versions came out, they were still limited by the classes the model knew.
Our goal is to generate videos like the above, but with Stable Diffusion. This should let us interpolate between images of anything we want!
Stable Diffusion Inference ELI5
At the most basic level, all you need to know is that the model we’ll be working with takes in 2 inputs:
- A random noise vector (containing samples drawn from a normal distribution)
- A text prompt that will be used to condition the image generation diffusion process
All else remaining constant, if you pass the same noise vector to the model more than once, it will generate the exact same image.
However, if you change that noise vector a different collection of samples drawn from a normal distribution, you’ll get a different image.
For a more in depth explanation of how diffusion models work, check out this legendary writeup on the Hugging Face blog.
Interpolating the Latent Space ELI5
If we slowly interpolate between two random noise vectors and generate images along the way, we should be able to “dream between” the images generated by the two vectors.
Representing the noise vectors as “Image A” and “Image B”, this is more or less what we’ll do first:
“Interpolating” here basically means mixing between Image A and Image B. At step 0, we have Image A’s noise vector. At the last step, we have Image B’s noise vector.
As we step through, the % mix of Image A’s noise vector decreases, and % mix of Image B’s noise vector increases.
We’ll use Spherical Linear Interpolation to generate vectors in between A
and B
at weights t
, as it seems to work best for these random noise vectors.
First, let’s define our slerp
fn:
import torch
import numpy as np
def slerp(t, v0, v1, DOT_THRESHOLD=0.9995):
"""helper function to spherically interpolate two arrays v1 v2"""
if not isinstance(v0, np.ndarray):
= True
inputs_are_torch = v0.device
input_device = v0.cpu().numpy()
v0 = v1.cpu().numpy()
v1 = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
dot
if np.abs(dot) > DOT_THRESHOLD:
= (1 - t) * v0 + t * v1
v2 else:
= np.arccos(dot)
theta_0 = np.sin(theta_0)
sin_theta_0 = theta_0 * t
theta_t = np.sin(theta_t)
sin_theta_t = np.sin(theta_0 - theta_t) / sin_theta_0
s0 = sin_theta_t / sin_theta_0
s1 = s0 * v0 + s1 * v1
v2
if inputs_are_torch:
= torch.from_numpy(v2).to(input_device)
v2 return v2
We can then create 2 random noise vectors, a and b, and get the vectors between them using our slerp
fn over T
timesteps.
# Here we construct noise vectors with shapesimilar to Stable Diffusion's
= 512
height = 512
width = (1, 4, height // 8, width // 8)
noise_shape = torch.randn(*noise_shape)
noise_a = torch.randn(*noise_shape)
noise_b
# Linspace creates our timestep weights, T.
# We use same # of steps as the image above for sake of demonstration
= torch.linspace(0.0, 1.0, 6)
T T
tensor([0.0000, 0.2000, 0.4000, 0.6000, 0.8000, 1.0000])
for t in T:
= slerp(float(t), noise_a, noise_b)
noise_t # Later, we'll do something like this at each timestep...
# img = diffusion(noise=noise_t)
== noise_a.shape == noise_b.shape noise_t.shape
True
We have a small problem here - torch.randn
is drawing random samples every time. To fix this, we can enforce reproducibility by using a torch generator and supplying an int
seed.
An important thing to note here is we have to generate these noise vectors on the correct device.
# Here we construct noise vectors with shapesimilar to Stable Diffusion's
= 512
height = 512
width = (1, 4, height // 8, width // 8)
noise_shape = 'cuda' if torch.cuda.is_available() else 'cpu'
device
= 42
seed_a = 1234
seed_b
= torch.randn(
noise_a
noise_shape,=torch.Generator(
generator=device
device
).manual_seed(seed_a),=device,
device
)= torch.randn(
noise_b
noise_shape,=torch.Generator(
generator=device
device
).manual_seed(seed_b),=device
device )
Now whenever we slerp
between these, the results will be the same.
Implementation with Diffusers
We’ll use Hugging Face’s diffusers
to download and interface with Stable Diffusion. As of the time of writing this post, we’re using the Stable Diffusion 1.4 checkpoint and diffusers==0.4.0
.
In that codebase, there’s a StableDiffusionPipeline
, which is a handy wrapper for inference.
Before we load up the pipeline, we have to make sure we’re authenticated with the Hugging Face Hub…
from huggingface_hub import notebook_login
notebook_login()
Now let’s load the StableDiffusionPipeline
…
from diffusers import StableDiffusionPipeline
import torch
= StableDiffusionPipeline.from_pretrained(
pipeline "CompVis/stable-diffusion-v1-4",
='fp16'
revision"cuda") ).to(
To generate images with this, we can pass a prompt and watch it go!
'blueberry spaghetti')['sample'][0] pipeline(
We can provide a noise vector of our own using the latents
kwarg
= pipeline('blueberry spaghetti', latents=noise_a)['sample'][0]
im im
If we run that again with the same noise, we’ll get the same image back.
= pipeline('blueberry spaghetti', latents=noise_a)['sample'][0]
im im
If we change our noise to noise_b
, we’ll get a different image
= pipeline('blueberry spaghetti', latents=noise_b)['sample'][0]
im im
Interpolating Between Noise Vectors
Now let’s interpolate between images A and B as we described earlier…
= 512
height = 512
width = (1, 4, height // 8, width // 8)
noise_shape
= 42
seed_a = 5432
seed_b
= torch.randn(
noise_a
noise_shape,=torch.Generator(
generator=pipeline.device
device
).manual_seed(seed_a),=pipeline.device,
device
)= torch.randn(
noise_b
noise_shape,=torch.Generator(
generator=pipeline.device
device
).manual_seed(seed_b),=pipeline.device
device )
# Using same prompt for each image
= 'blueberry spaghetti'
prompt
# Steps to interpolate (i.e. number of images to generate)
# If you change the number here, the visualization below will break
= 6
num_interpolation_steps
= torch.linspace(0.0, 1.0, num_interpolation_steps)
T
= []
images for i, t in enumerate(T):
print(f"Weight at step {i} - {t:2.2f}")
= slerp(float(t), noise_a, noise_b)
noise_t = pipeline(prompt, latents=noise_t, height=height, width=width)['sample'][0]
im images.append(im)
Let’s visualize the frames we just created…
%matplotlib inline
from matplotlib import pyplot as plt
= 2, 3
rows, cols = plt.subplots(nrows=rows, ncols=cols, figsize=(15,15))
fig, axes
for r in range(rows):
for c in range(cols):
= (r * cols) + c
i "off")
axes[r, c].axis(
axes[r, c].imshow(images[i])0, 0, f"{float(T[i]):2.2f}", fontsize=24)
axes[r, c].text(
=.05, hspace=.05)
plt.subplots_adjust(wspaceFalse)
plt.grid( plt.show()
Cool! Looks like we got an image in between A and B that seems to makes sense. Let’s do more steps and see what that looks like…
This time, we’ll save images as we go, and then stitch them back together as a gif with ffmpeg
.
from pathlib import Path
= Path('images')
output_dir =True, parents=True)
output_dir.mkdir(exist_ok
# Using same prompt for each image
= 'blueberry spaghetti'
prompt
# Steps to interpolate (i.e. number of images to generate)
= 10
num_interpolation_steps = torch.linspace(0.0, 1.0, num_interpolation_steps)
T
= []
images for i, t in enumerate(T):
= slerp(float(t), noise_a, noise_b)
noise_t = pipeline(prompt, latents=noise_t, height=height, width=width)['sample'][0]
im / f'frame{i:06d}.png') im.save(output_dir
Using ffmpeg, we can bring these together as a clip.
Here, we’ll lower the frame rate to 5 frames per second, so we should end up with a 2 second clip containing our 10 generating image frames.
! ffmpeg -f image2 -framerate 5 -i images/frame%06d.png -loop 0 output_1.gif
It should look like this - pretty cool!
Interpolating Noise and Text Embedding Vectors
To ramp it up a notch, lets see if we can interpolate between images with different prompts. To do that, we’ll need to modify the diffusers
StableDiffusionPipeline
to accept text embeddings, since we’ll want to provide intermediate text embeddings.
The main bit we’re adding is this snippet, where we allow for a text_embeddings
kwarg that will override the text prompt
input.
if text_embeddings is None:
if isinstance(prompt, str):
= 1
batch_size elif isinstance(prompt, list):
= len(prompt)
batch_size else:
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
# get prompt text embeddings
= self.tokenizer(
text_inputs
prompt,="max_length",
padding=self.tokenizer.model_max_length,
max_length="pt",
return_tensors
)= text_inputs.input_ids
text_input_ids
if text_input_ids.shape[-1] > self.tokenizer.model_max_length:
= self.tokenizer.batch_decode(text_input_ids[:, self.tokenizer.model_max_length :])
removed_text print(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {self.tokenizer.model_max_length} tokens: {removed_text}"
)= text_input_ids[:, : self.tokenizer.model_max_length]
text_input_ids = self.text_encoder(text_input_ids.to(self.device))[0]
text_embeddings else:
= text_embeddings.shape[0] batch_size
Here’s the full code for our StableDiffusionWalkPipeline
:
import inspect
from typing import Optional, Union, List, Callable
from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
class StableDiffusionWalkPipeline(StableDiffusionPipeline):
@torch.no_grad()
def __call__(
self,
str, List[str]]] = None,
prompt: Optional[Union[int = 512,
height: int = 512,
width: int = 50,
num_inference_steps: float = 7.5,
guidance_scale: float = 0.0,
eta: = None,
generator: Optional[torch.Generator] = None,
latents: Optional[torch.FloatTensor] str] = "pil",
output_type: Optional[bool = True,
return_dict: int, int, torch.FloatTensor], None]] = None,
callback: Optional[Callable[[int] = 1,
callback_steps: Optional[= None,
text_embeddings: Optional[torch.FloatTensor]
):
if height % 8 != 0 or width % 8 != 0:
raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
if (callback_steps is None) or (
is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
callback_steps
):raise ValueError(
f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
f" {type(callback_steps)}."
)
if text_embeddings is None:
if isinstance(prompt, str):
= 1
batch_size elif isinstance(prompt, list):
= len(prompt)
batch_size else:
raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
# get prompt text embeddings
= self.tokenizer(
text_inputs
prompt,="max_length",
padding=self.tokenizer.model_max_length,
max_length="pt",
return_tensors
)= text_inputs.input_ids
text_input_ids
if text_input_ids.shape[-1] > self.tokenizer.model_max_length:
= self.tokenizer.batch_decode(text_input_ids[:, self.tokenizer.model_max_length :])
removed_text print(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {self.tokenizer.model_max_length} tokens: {removed_text}"
)= text_input_ids[:, : self.tokenizer.model_max_length]
text_input_ids = self.text_encoder(text_input_ids.to(self.device))[0]
text_embeddings else:
= text_embeddings.shape[0]
batch_size
# here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
# of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
# corresponds to doing no classifier free guidance.
= guidance_scale > 1.0
do_classifier_free_guidance # get unconditional embeddings for classifier free guidance
if do_classifier_free_guidance:
# HACK - Not setting text_input_ids here when walking, so hard coding to max length of tokenizer
# TODO - Determine if this is OK to do
# max_length = text_input_ids.shape[-1]
= self.tokenizer.model_max_length
max_length = self.tokenizer(
uncond_input ""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
[
)= self.text_encoder(uncond_input.input_ids.to(self.device))[0]
uncond_embeddings
# For classifier free guidance, we need to do two forward passes.
# Here we concatenate the unconditional and text embeddings into a single batch
# to avoid doing two forward passes
= torch.cat([uncond_embeddings, text_embeddings])
text_embeddings
# get the initial random noise unless the user supplied it
# Unlike in other pipelines, latents need to be generated in the target device
# for 1-to-1 results reproducibility with the CompVis implementation.
# However this currently doesn't work in `mps`.
= "cpu" if self.device.type == "mps" else self.device
latents_device = (batch_size, self.unet.in_channels, height // 8, width // 8)
latents_shape if latents is None:
= torch.randn(
latents
latents_shape,=generator,
generator=latents_device,
device=text_embeddings.dtype,
dtype
)else:
if latents.shape != latents_shape:
raise ValueError(f"Unexpected latents shape, got {latents.shape}, expected {latents_shape}")
= latents.to(latents_device)
latents
# set timesteps
self.scheduler.set_timesteps(num_inference_steps)
# Some schedulers like PNDM have timesteps as arrays
# It's more optimized to move all timesteps to correct device beforehand
= self.scheduler.timesteps.to(self.device)
timesteps_tensor
# scale the initial noise by the standard deviation required by the scheduler
= latents * self.scheduler.init_noise_sigma
latents
# prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
# eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
# eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
# and should be between [0, 1]
= "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
accepts_eta = {}
extra_step_kwargs if accepts_eta:
"eta"] = eta
extra_step_kwargs[
for i, t in enumerate(self.progress_bar(timesteps_tensor)):
# expand the latents if we are doing classifier free guidance
= torch.cat([latents] * 2) if do_classifier_free_guidance else latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
latent_model_input
# predict the noise residual
= self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
noise_pred
# perform guidance
if do_classifier_free_guidance:
= noise_pred.chunk(2)
noise_pred_uncond, noise_pred_text = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
noise_pred
# compute the previous noisy sample x_t -> x_t-1
= self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
latents
# call the callback, if provided
if callback is not None and i % callback_steps == 0:
callback(i, t, latents)
= 1 / 0.18215 * latents
latents = self.vae.decode(latents).sample
image
= (image / 2 + 0.5).clamp(0, 1)
image = image.cpu().permute(0, 2, 3, 1).numpy()
image
= self.feature_extractor(self.numpy_to_pil(image), return_tensors="pt").to(self.device)
safety_checker_input = self.safety_checker(
image, has_nsfw_concept =image, clip_input=safety_checker_input.pixel_values.to(text_embeddings.dtype)
images
)
if output_type == "pil":
= self.numpy_to_pil(image)
image
if not return_dict:
return (image, has_nsfw_concept)
return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
Remove existing pipeline instance before proceeding…
del pipeline
torch.cuda.empty_cache()
Then, initialize the pipeline just as we did before
= StableDiffusionWalkPipeline.from_pretrained(
pipeline "CompVis/stable-diffusion-v1-4",
='fp16'
revision"cuda") ).to(
Great! Now we’ll interpolate between text embeddings. We use torch.lerp
instead of slerp
, as that’s what we’ve found to be a bit smoother for text.
We’ll start by creating two helper functions, embed_text
and get_noise
so this repetitive code doesn’t muddy up our logic below.
def embed_text(pipeline, text):
"""takes in text and turns it into text embeddings"""
= pipeline.tokenizer(
text_input
text,="max_length",
padding=pipeline.tokenizer.model_max_length,
max_length=True,
truncation="pt",
return_tensors
)with torch.no_grad():
= pipeline.text_encoder(text_input.input_ids.to(pipeline.device))[0]
embed return embed
def get_noise(pipeline, seed, height=512, width=512):
"""Takes in random seed and returns corresponding noise vector"""
return torch.randn(
1, pipeline.unet.in_channels, height // 8, width // 8),
(=torch.Generator(
generator=pipeline.device
device
).manual_seed(seed),=pipeline.device,
device )
# Height and width of image are important for noise vector creation
# Values should be divisible by 8 if less than 512
# Values should be divisible by 64 if greater than 512
= 512, 512
height, width
# Prompts/random seeds for A and B
= 'blueberry spaghetti', 'strawberry spaghetti'
prompt_a, prompt_b = 42, 1337
seed_a, seed_b
# Noise for A and B
= get_noise(pipeline, seed_a, height=height, width=width)
noise_a = get_noise(pipeline, seed_b, height=height, width=width)
noise_b
# Text embeddings for A and B
= embed_text(pipeline, prompt_a)
embed_a = embed_text(pipeline, prompt_b) embed_b
from pathlib import Path
= Path('images_walk_with_text')
output_dir =True, parents=True)
output_dir.mkdir(exist_ok
# Steps to interpolate (i.e. number of images to generate)
= 10
num_interpolation_steps = torch.linspace(0.0, 1.0, num_interpolation_steps).to(pipeline.device)
T
= []
images for i, t in enumerate(T):
= slerp(float(t), noise_a, noise_b)
noise_t = torch.lerp(embed_a, embed_b, t)
embed_t = pipeline(
im =embed_t,
text_embeddings=noise_t,
latents=height,
height=width
width'sample'][0]
)[/ f'frame{i:06d}.png') im.save(output_dir
! ffmpeg -f image2 -framerate 5 -i images_walk_with_text/frame%06d.png -loop 0 output_2.gif
Interpolating to the Beat of a Song
Now we’re talking! But how might we now move this video to the beat of a song?
Above, we were moving between images linearly. What we want to do is:
- move more when the energy of a given audio clip is high (it’s loud)
- move less when the energy is low (it’s quiet).
We can achieve this by manipulating the weights at certain timesteps that we defined above as T
. Instead of using torch.linspace
, we’ll try to set these values based on some audio.
Here we define a helper function to visualize numpy arrays. We’ll use this to help explain what we’re doing.
from matplotlib import pyplot as plt
import numpy as np
def plot_array(y):
= np.arange(y.shape[0])
x
# plotting
"Line graph")
plt.title(
"X axis")
plt.xlabel(
"Y axis")
plt.ylabel(
="red")
plt.plot(x, y, color return plt.show()
Now let’s load in an audio clip. The one we’re using is the choice
example clip from librosa
.
It’s a good one because it has drums and bass in it, so it’s similar to a song you might want to use (but doesn’t involve us using copyrighted music in this notebook 😉).
We’ll slice the audio clip so we are only using audio from 0:11-0:14 in the audio.
import librosa
from IPython.display import Audio
= 512
n_mels = 12
fps = 11
offset = 3
duration
= librosa.load(librosa.example('choice'), offset=offset, duration=duration)
wav, sr =sr) Audio(wav, rate
Let’s take a look at the plot of the waveform:
plot_array(wav)
After much experimentation, I found that extracting the percussive elements from the song and using those for everything moving forward leads to the best results.
We’ll do this using the librosa.effects.hpss
function.
= librosa.effects.hpss(wav, margin=(1.0, 5.0))
wav_harmonic, wav_percussive plot_array(wav_percussive)
As you can see, now the points of percussive impact are more pronounced.
What we’ll do next is:
- Convert that audio to spectrogram
- Normalize the spectrogram
- Rescale the spectrogram to be
(duration * fps)
so we have a vector that’s the same length as the amount of frames we wish to generate. - Plot the resulting array so we can see what it looks like
# Number of audio samples per frame
= int(sr / fps)
frame_duration
# Generate Mel Spectrogram
= librosa.feature.melspectrogram(y=wav_percussive, sr=sr, n_mels=n_mels, hop_length=frame_duration)
spec_raw
# Obtain maximum value per time-frame
= np.amax(spec_raw, axis=0)
spec_max
# Normalize all values between 0 and 1
= (spec_max - np.min(spec_max)) / np.ptp(spec_max)
spec_norm
# rescale so its exactly the number of frames we want to generate
# 3 seconds at 12 fps == 36
= np.resize(spec_norm, int(duration * fps))
amplitude_arr
plot_array(amplitude_arr)
Finally, we’ll construct T
. We could do this in a variety of ways, but the simplest we found was using np.cumsum
to gather a cumulative sum of the “energy” in the audio array.
Hat tip to @teddykoker who helped me figure this out.
# Cumulative sum of audio energy
= np.cumsum(amplitude_arr)
T
# Normalize values of T against last element
/= T[-1]
T
# 0th element not always exactly 0.0. Enforcing that here.
0] = 0.0
T[
plot_array(T)
Compare the above T
with our previous definition of T
…it’s a lot different!
We can see the one above is increasing rapidly at points of high energy, while the one below is simply linear.
0.0, 1.0, fps*duration)) plot_array(np.linspace(
Let’s use our newly defined T
to generate our music video!
📝 Note - This cell will take a little while as it has to do the number of steps you see in X-axis above (36 frames)
from pathlib import Path
= Path('images_walk_with_audio')
output_dir =True, parents=True)
output_dir.mkdir(exist_ok
for i, t in enumerate(T):
= slerp(float(t), noise_a, noise_b)
noise_t = torch.lerp(embed_a, embed_b, t)
embed_t = pipeline(
im =embed_t,
text_embeddings=noise_t,
latents=height,
height=width
width'sample'][0]
)[/ f'frame{i:06d}.png') im.save(output_dir
Let’s stitch together the frames we just made as well as the audio clip that goes with it.
First, we’ll write that audio clip to a new file, audio.wav
.
import soundfile as sf
/ 'audio.wav', wav, samplerate=sr) sf.write(output_dir
Then, we’ll use some ffmpeg
witchcraft to stitch the frames together into an mp4, including the audio clip we just wrote.
! ffmpeg \
-r {fps} \
-i images_walk_with_audio/frame%06d.png \
-i images_walk_with_audio/audio.wav \
-c copy \
-map 0:v:0 \
-map 1:a:0 \
-acodec aac \
-vcodec libx264 \
-pix_fmt yuv420p \
output_walk_with_audio.mp4
Let’s take a look at the video we just made. In Colab, you’ll have to run some code like this. Otherwise, you can just download the file and open it on your computer.
from IPython.display import HTML
from base64 import b64encode
def visualize_video_colab(video_path):
= open(video_path,'rb').read()
mp4 = "data:video/mp4;base64," + b64encode(mp4).decode()
data_url return HTML("""
<video width=400 controls>
<source src="%s" type="video/mp4">
</video>
""" % data_url)
'output_walk_with_audio.mp4') visualize_video_colab(
Success!! 🔥
Parting Tips
- The quality of the interpolation is better with higher
fps
. I’ve been using 30 for most of my serious runs, but 60 is probably even better. - The workflow I tend to use is:
- Generate images with random seeds, saving them along with the seeds I used.
- Pick prompt/seed pairs you like, and use those to generate the videos. That way you know at a high level what the video will look like.
- You can mess with the
margin
kwarg inlibrosa.effects.hpss
to increase/decrease the intensity of the effect we made here. - There are TONs of ways to do what we did here, this is just one way. Feel free to experiment with creating your own
T
. - The NSFW filter can be too liberal at times when generating videos. I tend to remove it when I’m generating videos for myself, but it’s a good idea to keep it in when you’re sharing your videos with others. Please do this responsibly.
Conclusion
Today we saw how to create music videos using Stable Diffusion! If you want a nice interface for doing this, you should check out the Stable Diffusion Videos repo, where all of this is wrapped up nicely for you (Check the README for instructions on how to use it for music videos).
If you found this notebook helpful:
- consider giving this repo a star ⭐️
- consider following me on Github @nateraw
Thanks for reading! Cheers 🍻