text and control input align #46

hongsukchoi · 2023-12-31T23:49:13Z

Thank you for your great work!

I have a question about the controlnet extension. It seems the text is spatially aligend witt the latemt embeddings orginally from SD, but how is the spatially alilgn between text and geometric control (ex. scribble) done?

Reading througth the code here, I think there is no alignment between the text embeddings and geometric control embeddings. Am I right?

Thank you!

lwchen6309 · 2024-01-13T20:24:46Z

Yes, you're right that there is no explicit alignment in Controlnet. What it does is just read the geometric control embeddings into features and add them to the SD intermediate feature.

t00350320 · 2024-03-11T06:42:59Z

hi, @lwchen6309 ,
by the way, i have another question:
your test codes in runner_inpait.py,
" input_prompt": "A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed.","
now i have printed the value of the color cross_attention_weight_64 corresponding to token="aurora" like this:

        [0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000

so we guess the aurora's cross location will near upper right, same with token="full moon".
But why should we also put another mask image file pointing out the moon's real postion into latent space like

latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)

will this be duplicated with previous color cross_attention_weight?
PTAL!
thank you !!!

lwchen6309 · 2024-03-18T15:53:38Z

Hi, I think the image_mask is just to specify the region for inpainting. The object segmentation is still controlled by cross attention weight.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text and control input align #46

text and control input align #46

hongsukchoi commented Dec 31, 2023

lwchen6309 commented Jan 13, 2024

t00350320 commented Mar 11, 2024 •

edited

Loading

lwchen6309 commented Mar 18, 2024

text and control input align #46

text and control input align #46

Comments

hongsukchoi commented Dec 31, 2023

lwchen6309 commented Jan 13, 2024

t00350320 commented Mar 11, 2024 • edited Loading

lwchen6309 commented Mar 18, 2024

t00350320 commented Mar 11, 2024 •

edited

Loading