FLUX.1-Krea: A New Approach to Natural-Looking AI Image Generation

on 2 months ago

BFL & Krea Release Major Open-Source Image Model Focused on Ultra-Realistic Details and Eliminating AI Aesthetics

Wow! Black Forest Labs and Krea have jointly open-sourced a new image model FLUX.1-Krea [dev]

This model focuses on creating images with unique aesthetics. No "AI effects," no overexposed highlights, just natural details.

Moreover, this model is fully compatible with the existing FLUX open-source model ecosystem, which is extremely important.

They also released a technical report detailing the model's implementation approach and training process, and explained the reasons behind the emergence of AI aesthetics. This part is more important, so let me summarize and analyze it.

Case Examples First

Analyzing "AI Style"

"When a measure becomes a target, it ceases to be a good measure" —Charles Goodhart

Recently, there have been many complaints about AI faces and AI textures. When using AI to generate images, an obvious trend is their distinctive appearance: overly blurred backgrounds, waxy skin textures, bland compositions, etc. These issues collectively constitute what is now called "AI style."

People often focus on how "smart" a model is. We frequently see users testing complex prompts. Can it make a horse ride an astronaut? Can it fill a wine glass? Can it render text correctly?

Over the years, we've designed various benchmarks to formalize these questions into specific metrics. The research community has made significant achievements in advancing generative models.

However, in pursuing technical capabilities and benchmark optimization, the messy realism, stylistic diversity, and creative fusion found in early image models have been overlooked.

The goal at the beginning of training this model was simple: "Create AI images that don't look like AI." The hope was to create a model that could solve these problems. Unfortunately, many academic benchmarks and metrics don't align with actual user needs.

In the pre-training phase, metrics like Fréchet Inception Distance (FID) and CLIP Score are very useful for measuring overall model performance, as most images at this stage are incoherent. Beyond the pre-training phase, evaluation benchmarks like DPG, GenEval, T2I-Compbench, and GenAI-Bench are widely used to evaluate models in academia and industry.

However, these benchmarks are primarily limited to measuring prompt adherence, focusing on spatial relationships, attribute binding, object counting, etc.

For aesthetic evaluation, commonly used models include LAION-Aesthetics, Pickscore, ImageReward, HPSv2, etc., but many of these models are fine-tuned versions of CLIP, which processes low-resolution (224×224 pixels) images with limited parameters.

For example, the LAION aesthetic model—commonly used to obtain high-quality training images—shows high bias toward depicting women, blurred backgrounds, overly soft textures, and bright images. While these models are useful for improving aesthetic quality scores, relying on them to obtain high-quality training images introduces implicit bias into the model's priors.

Although better aesthetic scorers based on vision-language models are emerging, the problem remains: human preferences and aesthetics are highly personal and cannot be easily reduced to a single numerical value. Enhancing model capabilities without degrading into "AI style" requires careful data curation and thorough calibration of model outputs.

The Art of Mode Collapse

Training image generation models can be roughly divided into two phases, similar to LLMs: pre-training and post-training. Most of the model's aesthetic characteristics are learned during the post-training phase, but before explaining our post-training methods, let's first understand our intuitive understanding of these training phases.

Pre-training

The pre-training phase should focus on "mode coverage" and "world understanding." At this stage, the model is provided with rich visual world knowledge: styles, objects, places, people. The goal here is to maximize diversity.

They even believe that pre-training models should be trained on "poor quality" data, as long as the problematic aspects in the data can be accurately reflected in their conditional inputs. In fact, besides telling the model what we want, we usually also want to tell it what we don't want.

Many image generation workflows use negative prompts like "too many fingers, facial deformities, blur, oversaturation" to improve image quality. For negative prompts to guide the model away from undesirable parts of the data distribution, the model must first learn what these undesirable parts look like. If the model has never seen examples of "poor quality images," negative prompts won't be effective.

Post-training has the greatest impact on the model's final quality, but it's important to remember that the model's quality ceiling and stylistic diversity come from the pre-trained model.

Post-training

In the post-training phase, the focus should be on shifting and gradually reducing undesirable parts of the distribution. Pre-trained models can output diverse images and understand broad concepts, but due to insufficient bias toward generating aesthetic outputs, they struggle to reliably output high-quality images.

This is where mode collapse comes in handy: we begin to bias the model toward parts of the distribution we consider ideal.

Starting from Raw Foundations

To begin post-training, we need a "raw" model. They wanted a highly malleable foundation model with a diverse output distribution that could be easily reshaped into a model with more subjective aesthetic tendencies.

Unfortunately, many existing open-source weight models have already undergone extensive fine-tuning and post-training. In other words, they've been "baked" too much to be used as foundation models.

To be able to focus entirely on aesthetics, Krea partnered with world-class foundation model lab Black Forest Labs, who provided flux-dev-raw, a pre-trained and instruction-distilled 1.2 billion parameter diffusion transformer model.

As a pre-trained foundation model, flux-dev-raw's image quality is far from state-of-the-art foundation models. However, it serves as a foundation for subsequent training for three important reasons:

flux-dev-raw contains extensive world knowledge—it already understands common objects, animals, people, photographic angles, media, etc.
Although flux-dev-raw is a raw model, it already provides convincing quality: it can generate coherent structures, basic compositions, and can render text.
flux-dev-raw is not "baked"—it's an uncontaminated model without "AI aesthetics." It can generate very diverse images, ranging from raw to beautiful.

Post-training Pipeline

The post-training pipeline is divided into two phases. One is the Supervised Fine-Tuning (SFT) phase, and the other is the Reinforcement Learning from Human Feedback (RLHF) phase.

In the supervised fine-tuning phase, they hand-picked a batch of high-quality image datasets that met their aesthetic standards. To train FLUX.1 Krea [dev], they also added high-quality synthetic samples from Krea-1 in the SFT phase. Synthetic images help stabilize model checkpoint performance.

Since flux-dev-raw is an instruction-distilled model, they designed a custom loss function to fine-tune the model directly on the classifier-free guidance (CFG) distribution. After the SFT phase, the model's image quality output improved significantly. However, further work was needed to make the model more robust and achieve the desired aesthetic effects. This is where RLHF comes into play.

In the RLHF process, they applied a variant of a preference optimization technique called TPO to further enhance the model's aesthetic and stylistic effects.

They used high-quality internal preference data that was rigorously screened to ensure data quality. In many cases, they conducted multiple rounds of preference optimization to further calibrate the model's outputs.

In exploring various post-training techniques, Krea discovered some key findings they wanted to share.

Quality over Quantity:

The amount of data needed for good post-training is surprisingly small (< 1M). Data volume helps with stability and bias reduction, but data quality is most important. This observation is consistent with previous literature reporting effectiveness when training on small-scale, carefully curated datasets.

Preference labels were carefully collected by annotators with deep understanding of the current model's limitations, areas for improvement, strengths, and weaknesses. In particular, they ensured that images in the preference annotation interface contained diverse sets to obtain targeted annotations.

Taking an Opinionated Approach:

There are many open-source preference datasets used as benchmarks for preference fine-tuning techniques. During the exploration phase, these datasets are very useful for testing various techniques. However, they found that training with existing datasets led to some unexpected behaviors.

They believe models fine-tuned on "global" user preferences are suboptimal. For objectives with objective truth standards like text rendering, anatomy, structure, and prompt following, diversity and scale of data are helpful. However, for subjective objectives like aesthetics, mixing different aesthetic preferences together is almost adversarial.

Based on this intuition, they decided to collect preference data in a very opinionated way that aligns with aesthetic taste and clear artistic direction. Often, overfitting the model to certain specific styles works better and is easier to manage.

Model Download: https://huggingface.co/black-forest-labs/FLUX.1-Krea-dev

Full Announcement: https://bfl.ai/announcements/flux-1-krea-dev

Origin Artical:：https://mp.weixin.qq.com/s/NfQZ0DrL03Tnrjjl_fYElQ