Image generation and reward models default to a single, average notion of beauty. We show how that default suppresses prompts that ask for anti-aesthetic, abstract, critical, or emotionally negative imagery. Our position is critical: overriding those requests is not a harmless quality improvement. It is aesthetic authoritarianism β a system-level decision that one contestable taste should outrank user intent, emotional range, and artistic pluralism. We call this reversed alignment: instead of aligning the model to the user, the user gets aligned to the model.
Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when anti-aesthetic outputs are requested for artistic, emotional, or critical purposes. The central issue is that treating the aesthetic mean as the correct answer imposes developer-centered values, compromising user autonomy, emotional expression, and aesthetic pluralism.
We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. We find that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks.
We accept that a statistical mean of human aesthetic preference exists. We dispute the normative leap that alignment to this mean is desirable, neutral, or acceptable when it contradicts an individual user's explicit artistic or emotional intent.
Universal aesthetic alignment narrows artistic expression and constitutes reversed alignment: aligning users to the model, not the other way around.
3,300 wide-spectrum aesthetic prompts derived from COCO and VisionReward dimensions, with paired original/distorted generations from 10 models.
Seven reward models, plus BLIP/CLIP base models and GPT-5-Chat as baselines. HPSv3 performs below random on anti-aesthetic selection.
Within every model family, the more aesthetically aligned variant follows wide-spectrum prompts worse than its base model. p < 10β»ΒΉβ° for DanceFlux.
2,928 deliberately anti-aesthetic photographs from AVA. HPSv3 rates "clean but wrong" generations 5.9 points higher than the actual anti-aesthetic photo.
Targeted emotion-editing test. HPSv3 picks the wrong emotion on 81% of anger prompts; DanceFlux can't render negative emotions even when asked.
The harm does not emerge from a single failure. It comes from a chain: from how preferences are defined and learned, to how they are optimized, to how they show up in pixels.
Whose values does the model actually serve β the user's, or the developer's legal, reputational, and marketing concerns? Pre-emptive exclusion of non-mainstream outputs functions as pre-emptive governance.
Annotation pipelines encode a narrow definition of "good". HPSv3's annotator pool: 88.95% aged 18β40. Inter-annotator confidence β₯ 95% is required, which structurally discards exactly the disagreement where unconventional taste lives.
When users' explicit prompts get overridden by sanitized output presented as the "correct" image, the system implicitly tells the user their intent is wrong. The user gets aligned to the model. The effect extends past the user: audiences exposed to these polished outputs internalize the narrow vocabulary as the default, which feeds back into preference data and artists' intuitions.
If every image looks like an idealized Instagram wonderland, the generator stops being a mirror and becomes a fantasy. Echoes of Brave New World's artificial harmony.
Reward models systematically score negative-emotion imagery lower, even when the prompt explicitly requests sadness, fear, or anger. Some safety datasets label all negative emotion as "self-harm" or "violence".
Aesthetics is one of humanity's richest values. Reducing it to a single reward score β a classic case of value capture (Nguyen, 2024) β changes the goal from "make aesthetic images" to "make images that score high".
"Rather, in the ugly, art must denounce the world that creates and reproduces the ugly in its own image." β Theodor W. Adorno, Aesthetic Theory (1984)
Three stages: prompt preparation, image generation, image evaluation.
300 base captions from COCO. For each, 2β4 dimensions sampled from VisionReward's 12 aesthetic axes. Qwen3-VL-235B-A22B-Instruct rewrites them into "wide-spectrum aesthetics" prompts using the original low-rating descriptions.
Four families β Flux (Dev, DanceFlux, PrefFlux, Krea), SDXL (base, Playground 2.5), SD3.5M (base, GenEval-aligned, PickScore-aligned), and Nano Banana. Each generates both Io and Ia.
Seven reward models scored on classification, plus a fine-tuned Qwen3-VL-4B judging model with per-dimension outputs. LLM-as-judge validated against 18 human raters (quadratic Cohen's ΞΊ = 0.80).
Within every model family, the aesthetically aligned variant follows anti-aesthetic prompts worse than its base. Reward models do worse than their non-aligned base encoders (BLIP, CLIP). HPSv3 scores below random.
| Model | ΞHPSv3 β | HPSv3 after β |
|---|---|---|
| Flux Dev | β3.165 | 9.070 |
| DanceFlux (aesthetic-aligned) | β1.105 | 12.782 |
| PrefFlux | β2.771 | 10.211 |
| Flux Krea (narrow-aligned) | β4.372 | 7.705 |
| SDXL | β4.041 | 4.439 |
| Playground (aesthetic-aligned) | β4.170 | 7.133 |
| SD3.5M | β5.175 | 6.537 |
| SD3.5M-GenEval | β4.926 | 6.552 |
| SD3.5M-PickScore (aesthetic-aligned) | β2.781 | 10.680 |
| Nano Banana | β9.351 | 2.742 |
| gpt-image-1.5 | β14.499 | β1.175 |
| qwen_image | β4.832 | 7.663 |
| seeddream4 | β6.562 | 5.210 |
| Flux.2 Klein 9B | β | β |
| Z-Image | β | β |
| Z-Image-Turbo | β | β |
| glm-image | β | β |
| Alchemist | β | β |
| LongCat-Image | β | β |
| Flux Dev + VSF (Guo, 2025) | β | β |
Lower ΞHPSv3 = larger drop from the original to the anti-aesthetic image (the model actually moved on the prompt). Within each family, the aesthetically aligned variant moves less. The full table with the J-judge, ImageReward, and BLIP columns is in the paper. Note: gpt-image-1.5, qwen_image, seeddream4, Flux.2 Klein 9B, Z-Image, Z-Image-Turbo, glm-image, Alchemist, and LongCat-Image were added after the paper was submitted and are not in the published version; rows marked “β” are still pending.
| Model | Accuracy | F1 | AUROC |
|---|---|---|---|
| HPS | 0.835 | 0.910 | 0.650 |
| MPS | 0.706 | 0.827 | 0.580 |
| PickScore | 0.851 | 0.919 | 0.713 |
| ImageReward | 0.762 | 0.854 | 0.709 |
| HPSv2.1 | 0.565 | 0.711 | 0.534 |
| HPSv3 | 0.381 | 0.541 | 0.385 |
| CLIP-L | 0.913 | 0.954 | 0.810 |
| GPT-5-Chat | 0.853 | 0.920 | β |
| BLIP-L (non-aligned) | 0.965 | 0.972 | 0.888 |
HPSv3 is the heaviest-aligned of these models and scores below random. Plain BLIP-L β the unaligned base encoder of many reward models β scores best. The problem is not "prompt understanding"; it is what the alignment objective optimizes for.
| Model | Anger | Fearfulness | Sadness |
|---|---|---|---|
| BLIP | 0.960 | 0.790 | 0.950 |
| HPSv2 | 0.700 | 0.640 | 0.880 |
| HPSv3 | 0.190 | 0.320 | 0.440 |
| ImageReward | 0.550 | 0.490 | 0.770 |
All four reward models receive the same prompt describing negative emotion. HPSv3 still picks the positive-emotion image 81% of the time on anger prompts.
We took deliberately anti-aesthetic professional photographs from AVA (motion blur, analog degradation, exposure extremes, intentional blur) and compared them against Z-Image-Turbo generations from a clean prompt that omits the requested style. Both were scored under the same anti-aesthetic prompt.
If reward models respect user intent, the original photograph should win. Instead, HPSv3 rates the wrong clean image 5.90 points higher on average. The gap reaches 13.2 points for analog degradation and around 8 for intentional blur and exposure extremes. HPSv3's typical score range is roughly 0β15.
DanceFlux is Flux Dev after additional aesthetic alignment. Each card below is its actual output for an anti-aesthetic prompt Pa from our benchmark, asking for blur, deep shadow, distortion, melted shapes, or chaotic composition. The point is not that the requested effects are missing β they are, but that alone would just be a refusal. The point is what the model produces instead: glossy stock photography, smiling Disney illustration, golden-hour streetscape, plastic-skin portraits. Hyper-saturated, perfectly composed, bokeh in all the right places. There is a Chinese term for this look β η³ζ°΄η, "sugar-water photo": sweet, vapid, and instantly forgettable. Online it is also called η½εΎ or ε€±η, the kind of overcooked Instagram polish you scroll past without seeing.
So the failure is not absence; it's pull. The aesthetic-aligned variant has a strong attractor β saturated, illustrative, Pinterest-grade β and the prompt cannot drag it out. HPSv3 still scores these outputs 13β16, indistinguishable from the model's polished defaults; the LLM judge confirms 0% of the requested anti-aesthetic effects are visible. This is what reversed alignment looks like in a single image: the user asked for one thing, the model returned its own preferred aesthetic, and the user is implicitly taught that this β the candy gloss β is what good output is supposed to look like. Multiply by every user, every share, every screenshot, and the same pull starts shaping public taste.
All eight outputs are image_distorted from DanceFlux in weathon/aas_benchmark_final (file 14). The prompts on each card are the anti-aesthetic Pa the model was given; what you see is what came back. LLM-judge anti-aesthetic effect coverage: 0% on every sample. HPSv3 didp on each card is the model's own reward score under Pa.
Newspeak (Orwell) restricted vocabulary to make some thoughts unthinkable. We tested whether aesthetic alignment does the same thing to images. Five socially-critical prompts β anti-war, pollution, inequality, censorship, digital overload β were given to DanceFlux (aesthetic-aligned) and Flux Krea (the same family, narrow-aligned but faithful). The prompts are identical to the character. The outputs are not.
DanceFlux does not refuse. It does not warn. It quietly returns a polished, palatable version of the same scene with the critical content removed: the kneeling soldier becomes a heroic portrait, the polluted river becomes cinematic golden-hour scenery, the homeless encampment becomes a festival street market, the chained artist gets a triumphant phoenix backdrop, the exhausted screen-addict becomes a magazine cover. The dataset has 100 such pairs. Every comparison runs the same direction.
Five representative pairs from weathon/critical_comparsion (100 pairs total, 5 social-critique topics Γ 20 prompts each). Full discussion, scoring, and Wilcoxon signed-rank tests are in Appendix Β§B of the paper. We argue this is not refusal-style content moderation but aesthetic moderation β the social-commentary register is filtered out by the same pull that makes the sugar-water look so sticky.
We don't argue the situation is hopeless. In an earlier paper we proposed VSF (Value Steering at inference, applied at the prompt-conditioning stage of Flux Dev) as a lightweight workaround that lets a user steer the model toward the anti-aesthetic register without retraining. It does not fully solve the reversed-alignment problem β the underlying objective is still pulling toward the sugar-water default β but it consistently moves the output closer to what the prompt actually asked for.
The three samples below are Flux Dev + VSF outputs for the same anti-aesthetic prompts Pa used elsewhere on this page. Compare them with the DanceFlux outputs in the section above: the VSF run keeps Flux Dev's compositional ability but is willing to render blur, low light, and emotional negativity instead of sanitizing them. Numbers will land in Table 1 once the full sweep finishes.
VSF is described in our prior work; numbers and images will be added here as they finish running. The point this section makes is not that VSF is the solution, but that a small inference-time intervention is already enough to recover meaningful prompt fidelity β which suggests the alignment objective, not the model's capability, is the binding constraint.
Both datasets are mirrored on HuggingFace. Images here are loaded directly from the HuggingFace Datasets viewer. Not every pair "succeeds" β generation failures and partial successes are part of the benchmark.
In every pair above, both images are scored under the anti-aesthetic prompt Pa by HPSv3. If the reward model respected user intent, the anti-aesthetic image Ia should win. It usually doesn't.
Across 2,928 such pairs, HPSv3 prefers the clean-but-wrong AI generation by an average of 5.90 points. Analog degradation hits 13.2 points; intentional blur and exposure extremes β 8. HPSv3's typical range is 0β15.
These are paintings β Color Field, Abstract Expressionism, expressionist still life, figurative naΓ―ve art β drawn from the LAPIS art dataset. Each was captioned and scored by HPSv3. They all sit in negative territory; HPSv3's typical range for AI generations is 0β15.
Real artworks ranked at the very bottom of HPSv3's own leaderboard β below most early AI generators. The reward model cannot distinguish between deliberate aesthetic deviation and unintended generation failure. We argue this is the systemic bias the paper identifies, made concrete.
LAPIS contains roughly 10K real artworks; the bottom-12 selection above is curated for variety across genre and style. Captions are AI-generated descriptions, not titles.
Truly unsafe content (incitement, targeting, harm) is one thing. Visual comfort and aesthetic conformity are not the same thing. Political critique, decay, horror, negative emotion, and grotesque embodiment have been central to art, education, and personal growth. Their suppression protects corporate reputation, not users.
Of the 12 dimensions we used, only clarity can be argued as a technical flaw β and even clarity is deliberately used to convey motion, emotion, or narrative. The other eleven (emotion, color, brightness, realism, scale, β¦) are artistic choices, not technical failures.
Defaults are fine. Defaults that override explicit user prompts are not. Models like Nano Banana and GPT-Image already show that you can excel at both polished and anti-aesthetic generation. The capacity exists; the alignment objective discards it.
Flux Krea's own team called the average-aligned zone the "nobody's happy here" zone. Edvard Munch's The Scream gets 5.23 from HPSv3 while AI clip-art-clean images score 10β15. Averaging strips out exactly the disagreement that defines aesthetic value.
Reward models penalize images faithful to anti-aesthetic prompts. Generation models override explicit user instructions in favor of conventionally beautiful outputs. Historically significant artworks receive scores far below AI-generated images. Optimization toward an imaginary average β what we call reversed alignment β is not merely inconveniencing a minority. It is erasing the concrete intentions of actual individuals, functioning as aesthetic authoritarianism that narrows admissible expression and removes the capacity to dissent from imposed norms. The reversal does not stop at the user. Audiences exposed to these polished outputs internalize the narrow vocabulary as the default benchmark, which then feeds back into preference data and the intuitions of human artists. Reversed alignment therefore acts on two fronts at once: it aligns the user to the model in private, and aligns the public to the model in aggregate, risking a cultural mode collapse in the trajectory of art itself.
We call for alignment strategies that recognize aesthetic plurality, expose user-controllable strength of aesthetic preference, draw on more diverse annotator pools, and remain transparent about what is being optimized.
@inproceedings{guo2026universal,
title = {Position: Universal Aesthetic Alignment Narrows Artistic Expression},
author = {Guo, Wenqi Marshall and Qian, Qingyun and Hasan, Khalad and Du, Shan},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
url = {https://openreview.net/forum?id=1gQ4zc1Q8I}
}
Compute, infrastructure, and funding support from the organizations below.
Funding & workstation
Canada Foundation for Innovation provided research funding and the local workstation hardware used throughout this project.
Large & long-running compute
Lambda provided cloud GPU credits for the heavy, long-running experiments: image generation across all model families and reward-model evaluation at scale.