Device and Method for Generating Images
Abstract
A method for generating an image. The method includes: providing a randomly drawn image or representation thereof as input of a sequence of layers of a neural network which includes a cross-attention layer; providing a first input, which is a representation of the input of the sequence determined by layers preceding the cross-attention layer or is the input to the sequence of layers, and providing a text embedding characterizing a description of the image to be generated as a second input; determining, by the cross-attention layer, an attention map based on the first and second inputs; optimizing the input provided to the sequence of layers based on a loss function which includes a term characterizing a negative total variation of the attention map; determining an output of the sequence of layers using the optimized input; and determining the image based on the determined output.