It is a mission, as Robin Rombach, who is CEO of German artificial intelligence (AI) company Black Forest Labs calls it. Little could Rombach or Black Forest Labs may have realised as they pressed the trigger, about the seismic shift the company’s FLUX.1 models would bring to the generative AI space. Or perhaps they did, because it is the work of the same researchers who’d previously developed the Stable Diffusion models. The FLUX.1 suite, which is a collection of three text to image models, now competes not just with Stable Diffusion 3, but also OpenAI’s popular DALL-E 3, Google Imagen 3.0 and Midjourney 6.
Rombach insists their plan is to build high-quality generative deep learning models for images and video, and “make them available to the broadest audience possible.” For now, there are three models that make up the suite. FLUX.1 [pro] is the benchmark model for image generation, with creative professionals being a demographic the company hopes to get onside. Then there is the FLUX.1 [schnell], where the “schnell” naming should be a giveaway to its structure and positioning – in German, this means fast. It is smaller than the “pro” model, with the long-term vision for personal use and an ability to access on a wider variety of devices. The third is the FLUX.1 [dev] model, which is configured and trained for non-commercial use.
The availability of FLUX.1 is via select platforms and tools for now, including AI cloud hosting platforms Replicate, fal.ai and Hugging Face. These platforms work on the pay to generate method with FLUX.1, for now. HT accessed FLUX.1, including the [schnell] model via NightCafe (the credits, if you’ve accumulated over time, come handy), which allowed us a chance to compare with the likes of DALL-E 3, for prompt fidelity as well as realism of generations.
The first thing we checked, considering this is where many an AI generation model has stumbled in the past, was how human hands are replicated. FLUX.1 in the pro and schnell mode seems to be doing more than an acceptable job of generating human hands, and if you look closely at the featured image with this article, skin has a nice texture and even the nails come through with their own unique detailing. DALL-E 3 in its generation, seems to be looking at angles that take attention away from complex detailing such as nails and reflections off nails, and the skin tones seem overaggressive with minimising textures which it may be deeming as imperfections or noise.
Also Read: Can Windows 11 underpin a gaming console? Asus’ ROG Ally X template suggests so
That said, the rest of the detailing across prompts tends to be quite even. We tested both FLUX.1 models as well as DALL-E 3 with another prompt, “A girl walking on a serene beach as the sun sets”, and all three models were able to deliver on the sense of place, finer details such as the subtle reflectiveness of wet sand and the sunset sky. The DALL-E 3 generation looked a little more futuristic in terms of textures, a generation language that may be appreciated by some.
Black Forest Labs says FLUX.1 [pro], FLUX.1 [schnell] and FLUX.1 [dev] work with a “hybrid” structure, which is combining diffusion as well as transformer methods, scaled to 12 billion parameters. “We improve over previous state-of-the-art diffusion models by building on flow matching, a general and conceptually simple method for training generative models, which includes diffusion as a special case. In addition, we increase model performance and improve hardware efficiency by incorporating rotary positional embeddings and parallel attention layers,” the company explains. They promise to release more technical details, in the coming weeks.
At this time, Black Forest Labs hasn’t detailed the specifics of how and where the FLUX.1 model training data was collected from. If their training data does include copyrighted content, something HT cannot confirm for now, they wouldn’t be the first AI company to have done so. Microsoft, OpenAI, Stability AI, Meta and Google are all facing legal action, including by artists and content creators, for using their data without consent for training AI models for image generators, chatbots and written content.
They aren’t done yet. The company is mentioning SOTA, a suite of competitive generative text-to-video systems, which arrives sometime in the coming months. An interesting plot when it finally unfolds, will be SOTA standing up against OpenAI’s incredibly impressive video generation tool Sora.