Description
Make-A-Scene is a text-to-image method that (i) enables a simple control mechanism complementary to text in the form of a scene, (ii) introduces elements that improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapts classifier-free guidance for the transformer use case.