What do all recent super-powerful image models like DALLEImagen, or Midjourney have in common? Other than their high computing costs, huge training time, and shared hype, they are all based on the same mechanism: diffusion.
Diffusion models recently achieved state-of-the-art results for most image tasks including text-to-image with DALLE but many other image generation-related tasks too, like image inpainting, style transfer or image super-resolution. But how do they work? Learn more in the video...

References

►Read the full article: https://www.louisbouchard.ai/latent-diffusion-models/
►Rombach, R., Blattmann, A., Lorenz, D., Esser, P. and Ommer, B., 2022.
High-resolution image synthesis with latent diffusion models. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 10684–10695), https://arxiv.org/pdf/2112.10752.pdf
►Latent Diffusion Code: https://github.com/CompVis/latent-diffusion
►Stable Diffusion Code (text-to-image based on LD): https://github.com/CompVis/stable-diffusion
►Try it yourself: https://huggingface.co/spaces/stabilityai/stable-diffusion
►Web application:
https://stabilityai.us.auth0.com/u/login?state=hKFo2SA4MFJLR1M4cVhJcllLVmlsSV9vcXNYYy11Q25rRkVzZaFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIFRjV2p5dHkzNGQzdkFKZUdyUEprRnhGeFl6ZVdVUDRZo2NpZNkgS3ZZWkpLU2htVW9PalhwY2xRbEtZVXh1Y0FWZXNsSE4
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/

Video Transcript

0:00
what do all recent super powerful image
0:02
models like delhi imagine or mid journey
0:05
have in common other than high computing
0:08
cost huge training time and shared hype
0:10
they are all based on the same mechanism
0:13
diffusion the fusion models recently
0:15
achieved state-of-the-art results for
0:17
most image tasks including text to image
0:19
with delhi but many other image
0:21
generation related tasks like image and
0:23
painting style transfer or image super
0:25
resolution though there are a few
0:27
downsides they work sequentially on the
0:30
whole image meaning that both the
0:31
training and inference times are super
0:34
expensive this is why you need hundreds
0:36
of gpus to train such a model and why
0:38
you wait a few minutes to get your
0:40
results it's no surprise that only the
0:42
biggest companies like google or openai
0:45
are releasing those models
0:47
but what are they i've covered diffusion
0:49
models in a couple of videos which i
0:51
invite you to check for a better
0:52
understanding they are iterative models
0:55
that take random noise as inputs which
0:57
can be conditioned with a text or an
0:59
image so it's not completely random it
1:02
iteratively learns to remove this noise
1:04
by learning what parameters the models
1:06
should apply to this noise to end up
1:08
with a final image so the basic
1:10
diffusion models will take a random
1:12
noise with the size of the image and
1:14
learn to apply even further noise until
1:17
we get back to a real image this is
1:19
possible because the model will have
1:21
access to the real images during
1:23
training and will be able to learn the
1:25
right parameters by applying such noise
1:27
to the image iteratively until it
1:29
reaches complete noise and is
1:31
unrecognizable
1:33
then when we are satisfied with the
1:35
noise we get from all our images meaning
1:37
that they are similar and generate noise
1:40
from a similar distribution we are ready
1:42
to use our model in reverse and feed it
1:45
similar noise in the reverse order to
1:48
expect an image similar to the ones used
1:50
during training so the main problem here
1:53
is that you are working directly with
1:54
the pixels and large data input like
1:57
images let's see how we can fix this
1:59
computation issue while keeping the
2:02
quality of the results the same as shown
2:04
here compared with delhi but first give
2:07
me a few seconds to introduce you to my
2:09
friends at quack sponsoring this video
2:11
as you most certainly know the majority
2:13
of businesses now report ai and ml
2:15
adoption in their processes but complex
2:18
operations such as modal deployment
2:20
training testing and feature store
2:22
management seem to stand in the way of
2:24
progress ml model deployment is one of
2:26
the most complex processes it is such a
2:29
rigorous process that data scientist
2:31
teams spend way too much time on solving
2:33
back-end and engineering tasks before
2:35
being able to push the model into
2:37
production something i personally
2:39
experienced it also requires very
2:42
different skill sets often requiring two
2:44
different teams working closely together
2:46
fortunately for us quack delivers a
2:48
fully managed platform that unifies ml
2:50
engineering and data operations
2:53
providing agile infrastructure that
2:55
enables the continuous productization of
2:57
ml models at scale you don't have to
2:59
learn how to do everything end-to-end
3:01
anymore thanks to them quack empowers
3:04
organizations to deliver machine
3:06
learning models into production at scale
3:08
if you want to speed up your model
3:10
delivery to production please take a few
3:12
minutes and click the first link below
3:14
to check what they offer as i'm sure it
3:16
will be worthwhile thanks to anyone
3:18
taking a look and supporting me and my
3:20
friends at quack
3:23
how can these powerful diffusion models
3:25
be computationally efficient by
3:27
transforming them into latent diffusion
3:30
models this means that robin rumback and
3:32
his colleagues implemented this
3:34
diffusion approach we just covered
3:36
within a compressed image representation
3:38
instead of the image itself and then
3:41
worked to reconstruct the image so they
3:43
are not working with the pixel space or
3:45
regular images anymore working in such a
3:48
compressed space does not only allow for
3:50
more efficient and faster generations as
3:52
the data size is much smaller but also
3:54
allows for working with different
3:56
modalities since they are encoding the
3:58
inputs you can feed it any kind of input
4:00
like images or text and the model will
4:03
learn to encode these inputs in the same
4:05
sub space that the diffusion model will
4:07
use to generate an image so yes just
4:10
like the clip model one model will work
4:13
with text or images to guide generations
4:16
the overall model will look like this
4:18
you will have your initial image here x
4:21
and encode it into an information then
4:23
space called the latent space or z this
4:26
is very similar to a gun where you will
4:29
use an encoder model to take the image
4:31
and extract the most relevant
4:32
information about it in a subspace which
4:35
you can see as a down sampling task
4:37
reducing its size while keeping as much
4:39
information as possible you are now in
4:42
the latent space with your condensed
4:44
input you then do the same thing with
4:46
your condition inputs either text images
4:49
or anything else and merge them with
4:50
your current image representation using
4:53
attention which i described in another
4:55
video this attention mechanism will
4:57
learn the best way to combine the input
4:59
and conditioning inputs in this latent
5:01
space adding attention a transformer
5:04
feature to diffusion models these merged
5:07
inputs are now your initial noise for
5:09
the diffusion process
5:11
then you have the same diffusion model i
5:13
covered in my image and video but still
5:16
in this subspace finally you reconstruct
5:19
the image using a decoder which you can
5:21
see as the reverse step of your initial
5:23
encoder taking this modified and
5:25
denoised input in the latent space to
5:28
construct a final high resolution image
5:31
basically upsampling your results and
5:34
voila this is how you can use diffusion
5:36
models for a wide variety of tasks like
5:39
super resolution in painting and even
5:41
text to image with the recent stable
5:44
diffusion open sourced model through the
5:46
conditioning process while being much
5:49
more efficient and allowing you to run
5:51
them on your gpus instead of requiring
5:54
hundreds of them you heard that right
5:56
for all devs out there wanting to have
5:58
their own text to image and image
6:00
synthesis model running on their own
6:02
gpus the code is available with
6:04
pre-turned models all the links are
6:06
below if you do use the model please
6:08
share your tests ids and results or any
6:10
feedback you have with me i'd love to
6:13
chat about that of course this was just
6:15
an overview of the latent diffusion
6:17
model and i invite you to read their
6:19
great paper linked below as well to
6:21
learn more about the model and approach
6:24
huge thanks to my friends at quack for
6:26
sponsoring this video and even bigger
6:28
thanks to you for watching the whole
6:30
video i will see you next week with
6:33
another amazing paper