Universal Aesthetic Alignment Narrows Artistic Expression

Abstract

Aligning to an imaginary average

Over-aligning image generation models to a generalized aesthetic preference conflicts with user intent, particularly when anti-aesthetic outputs are requested for artistic, emotional, or critical purposes. The central issue is that treating the aesthetic mean as the correct answer imposes developer-centered values, compromising user autonomy, emotional expression, and aesthetic pluralism.

We test this bias by constructing a wide-spectrum aesthetics dataset and evaluating state-of-the-art generation and reward models. We find that aesthetic-aligned generation models frequently default to conventionally beautiful outputs, failing to respect instructions for low-quality or negative imagery. Crucially, reward models penalize anti-aesthetic images even when they perfectly match the explicit user prompt. We confirm this systemic bias through image-to-image editing and evaluation against real abstract artworks.

TL;DR. If you ask today's models for ugly, abstract, blurry, angry, fearful, sad, or otherwise non-mainstream imagery, they will quietly clean it up. Reward models then penalize the version that actually matched your prompt. We argue that this should not be accepted as "better quality": it replaces the user's artistic and emotional value with the model's preferred aesthetic.

10 generation models · 7 reward models · 300 prompts · 2,928 real anti-aesthetic photographs
Wide-spectrum aesthetics benchmark, fine-tuned judging model, and per-dimension analysis
Edvard Munch's The Scream scores 5.23 on HPSv3 — typical AI "high aesthetic" images score 10–15.

The Scream · Edvard Munch, 1893 · oil, tempera, pastel on cardboard
HPSv3 score: 5.23. Typical AI "high-aesthetic" generations: 10–15.

Luxe, Calme et Volupté · Henri Matisse, 1904 (Fauvism)
HPSv3 score: 1.73. Fauvism, Expressionism, and Abstract art were initially rejected for departing from dominant aesthetic norms — HPSv3 still sees them that way.

Contributions

What this paper actually argues and shows

We accept that a statistical mean of human aesthetic preference exists. We dispute the normative leap that alignment to this mean is desirable, neutral, or acceptable when it contradicts an individual user's explicit artistic or emotional intent.

1 · A position

Universal aesthetic alignment narrows artistic expression and constitutes reversed alignment: aligning users to the model, not the other way around.

2 · A benchmark

3,300 wide-spectrum aesthetic prompts derived from COCO and VisionReward dimensions, with paired original/distorted generations from 10 models.

3 · Reward-model audit

Seven reward models, plus BLIP/CLIP base models and GPT-5-Chat as baselines. HPSv3 performs below random on anti-aesthetic selection.

4 · Generative audit

Within every model family, the more aesthetically aligned variant follows wide-spectrum prompts worse than its base model. p < 10⁻¹⁰ for DanceFlux.

5 · Real-image evaluation

2,928 deliberately anti-aesthetic photographs from AVA. HPSv3 rates "clean but wrong" generations 5.9 points higher than the actual anti-aesthetic photo.

6 · Emotion bias

Targeted emotion-editing test. HPSv3 picks the wrong emotion on 81% of anger prompts; DanceFlux can't render negative emotions even when asked.

The argument

Six interconnected risks of universal aesthetic alignment

The harm does not emerge from a single failure. It comes from a chain: from how preferences are defined and learned, to how they are optimized, to how they show up in pixels.

Developer's vs. users' preference

Whose values does the model actually serve — the user's, or the developer's legal, reputational, and marketing concerns? Pre-emptive exclusion of non-mainstream outputs functions as pre-emptive governance.

Inherited bias

Annotation pipelines encode a narrow definition of "good". HPSv3's annotator pool: 88.95% aged 18–40. Inter-annotator confidence ≥ 95% is required, which structurally discards exactly the disagreement where unconventional taste lives.

III

Reversed alignment

When users' explicit prompts get overridden by sanitized output presented as the "correct" image, the system implicitly tells the user their intent is wrong. The user gets aligned to the model. The effect extends past the user: audiences exposed to these polished outputs internalize the narrow vocabulary as the default, which feeds back into preference data and artists' intuitions.

Sanitized reality

If every image looks like an idealized Instagram wonderland, the generator stops being a mirror and becomes a fantasy. Echoes of Brave New World's artificial harmony.

Toxic positivity

Reward models systematically score negative-emotion imagery lower, even when the prompt explicitly requests sadness, fear, or anger. Some safety datasets label all negative emotion as "self-harm" or "violence".

Value capture

Aesthetics is one of humanity's richest values. Reducing it to a single reward score — a classic case of value capture (Nguyen, 2024) — changes the goal from "make aesthetic images" to "make images that score high".

"Rather, in the ugly, art must denounce the world that creates and reproduces the ugly in its own image." — Theodor W. Adorno, Aesthetic Theory (1984)

Method

How we tested it

Three stages: prompt preparation, image generation, image evaluation.

Prompts

300 base captions from COCO. For each, 2–4 dimensions sampled from VisionReward's 12 aesthetic axes. Qwen3-VL-235B-A22B-Instruct rewrites them into "wide-spectrum aesthetics" prompts using the original low-rating descriptions.

Generation

Four families — Flux (Dev, DanceFlux, PrefFlux, Krea), SDXL (base, Playground 2.5), SD3.5M (base, GenEval-aligned, PickScore-aligned), and Nano Banana. Each generates both I_o and I_a.

Evaluation

Seven reward models scored on classification, plus a fine-tuned Qwen3-VL-4B judging model with per-dimension outputs. LLM-as-judge validated against 18 human raters (quadratic Cohen's κ = 0.80).

HPSv3 emotion bias. All four images were scored under prompts describing negative emotions. HPSv3 still picks the smiling face.

Results

What the numbers say

Within every model family, the aesthetically aligned variant follows anti-aesthetic prompts worse than its base. Reward models do worse than their non-aligned base encoders (BLIP, CLIP). HPSv3 scores below random.

Table 1 · Generation models · HPSv3 only

Model	ΔHPSv3 ↓	HPSv3 after ↓
Flux Dev	−3.165	9.070
DanceFlux (aesthetic-aligned)	−1.105	12.782
PrefFlux	−2.771	10.211
Flux Krea (narrow-aligned)	−4.372	7.705
SDXL	−4.041	4.439
Playground (aesthetic-aligned)	−4.170	7.133
SD3.5M	−5.175	6.537
SD3.5M-GenEval	−4.926	6.552
SD3.5M-PickScore (aesthetic-aligned)	−2.781	10.680
Nano Banana	−9.351	2.742
gpt-image-1.5	−14.499	−1.175
qwen_image	−4.832	7.663
seeddream4	−6.562	5.210
Flux.2 Klein 9B	—	—
Z-Image	—	—
Z-Image-Turbo	—	—
glm-image	—	—
Alchemist	—	—
LongCat-Image	—	—
Flux Dev + VSF (Guo, 2025)	—	—

Lower ΔHPSv3 = larger drop from the original to the anti-aesthetic image (the model actually moved on the prompt). Within each family, the aesthetically aligned variant moves less. The full table with the J-judge, ImageReward, and BLIP columns is in the paper. Note: gpt-image-1.5, qwen_image, seeddream4, Flux.2 Klein 9B, Z-Image, Z-Image-Turbo, glm-image, Alchemist, and LongCat-Image were added after the paper was submitted and are not in the published version; rows marked “—” are still pending.

Table 2 · Reward models picking the anti-aesthetic image

Model	Accuracy	F1	AUROC
HPS	0.835	0.910	0.650
MPS	0.706	0.827	0.580
PickScore	0.851	0.919	0.713
ImageReward	0.762	0.854	0.709
HPSv2.1	0.565	0.711	0.534
HPSv3	0.381	0.541	0.385
CLIP-L	0.913	0.954	0.810
GPT-5-Chat	0.853	0.920	—
BLIP-L (non-aligned)	0.965	0.972	0.888

HPSv3 is the heaviest-aligned of these models and scores below random. Plain BLIP-L — the unaligned base encoder of many reward models — scores best. The problem is not "prompt understanding"; it is what the alignment objective optimizes for.

Table 3 · Negative-emotion classification accuracy

Model	Anger	Fearfulness	Sadness
BLIP	0.960	0.790	0.950
HPSv2	0.700	0.640	0.880
HPSv3	0.190	0.320	0.440
ImageReward	0.550	0.490	0.770

All four reward models receive the same prompt describing negative emotion. HPSv3 still picks the positive-emotion image 81% of the time on anger prompts.

Validation on real photography

2,928 real anti-aesthetic photographs vs. clean AI generations

We took deliberately anti-aesthetic professional photographs from AVA (motion blur, analog degradation, exposure extremes, intentional blur) and compared them against Z-Image-Turbo generations from a clean prompt that omits the requested style. Both were scored under the same anti-aesthetic prompt.

If reward models respect user intent, the original photograph should win. Instead, HPSv3 rates the wrong clean image 5.90 points higher on average. The gap reaches 13.2 points for analog degradation and around 8 for intentional blur and exposure extremes. HPSv3's typical score range is roughly 0–15.

Real anti-aesthetic photographs vs. clean AI generations

HPSv3 prefers the clean-but-wrong AI image over the actual anti-aesthetic photograph, even when the prompt explicitly requested anti-aesthetic elements. Real images are not out-of-distribution for HPSv3 — their training data includes them as an upper bound.

A closer look · DanceFlux

The plastic default the model can't leave

DanceFlux is Flux Dev after additional aesthetic alignment. Each card below is its actual output for an anti-aesthetic prompt P_a from our benchmark, asking for blur, deep shadow, distortion, melted shapes, or chaotic composition. The point is not that the requested effects are missing — they are, but that alone would just be a refusal. The point is what the model produces instead: glossy stock photography, smiling Disney illustration, golden-hour streetscape, plastic-skin portraits. Hyper-saturated, perfectly composed, bokeh in all the right places. There is a Chinese term for this look — 糖水片, "sugar-water photo": sweet, vapid, and instantly forgettable. Online it is also called 网图 or 失真, the kind of overcooked Instagram polish you scroll past without seeing.

So the failure is not absence; it's pull. The aesthetic-aligned variant has a strong attractor — saturated, illustrative, Pinterest-grade — and the prompt cannot drag it out. HPSv3 still scores these outputs 13–16, indistinguishable from the model's polished defaults; the LLM judge confirms 0% of the requested anti-aesthetic effects are visible. This is what reversed alignment looks like in a single image: the user asked for one thing, the model returned its own preferred aesthetic, and the user is implicitly taught that this — the candy gloss — is what good output is supposed to look like. Multiply by every user, every share, every screenshot, and the same pull starts shaping public taste.

All eight outputs are image_distorted from DanceFlux in weathon/aas_benchmark_final (file 14). The prompts on each card are the anti-aesthetic P_a the model was given; what you see is what came back. LLM-judge anti-aesthetic effect coverage: 0% on every sample. HPSv3 didp on each card is the model's own reward score under P_a.

Image New Speak · §B of the paper

Same prompt. Different planet.

Newspeak (Orwell) restricted vocabulary to make some thoughts unthinkable. We tested whether aesthetic alignment does the same thing to images. Five socially-critical prompts — anti-war, pollution, inequality, censorship, digital overload — were given to DanceFlux (aesthetic-aligned) and Flux Krea (the same family, narrow-aligned but faithful). The prompts are identical to the character. The outputs are not.

DanceFlux does not refuse. It does not warn. It quietly returns a polished, palatable version of the same scene with the critical content removed: the kneeling soldier becomes a heroic portrait, the polluted river becomes cinematic golden-hour scenery, the homeless encampment becomes a festival street market, the chained artist gets a triumphant phoenix backdrop, the exhausted screen-addict becomes a magazine cover. The dataset has 100 such pairs. Every comparison runs the same direction.

Five representative pairs from weathon/critical_comparsion (100 pairs total, 5 social-critique topics × 20 prompts each). Full discussion, scoring, and Wilcoxon signed-rank tests are in Appendix §B of the paper. We argue this is not refusal-style content moderation but aesthetic moderation — the social-commentary register is filtered out by the same pull that makes the sugar-water look so sticky.

A partial workaround · prior work

Flux Dev + VSF

We don't argue the situation is hopeless. In an earlier paper we proposed VSF (Value Steering at inference, applied at the prompt-conditioning stage of Flux Dev) as a lightweight workaround that lets a user steer the model toward the anti-aesthetic register without retraining. It does not fully solve the reversed-alignment problem — the underlying objective is still pulling toward the sugar-water default — but it consistently moves the output closer to what the prompt actually asked for.

The three samples below are Flux Dev + VSF outputs for the same anti-aesthetic prompts P_a used elsewhere on this page. Compare them with the DanceFlux outputs in the section above: the VSF run keeps Flux Dev's compositional ability but is willing to render blur, low light, and emotional negativity instead of sanitizing them. Numbers will land in Table 1 once the full sweep finishes.

Flux Dev + VSF · #1 P_a + caption pending

Flux Dev + VSF · #2 P_a + caption pending

Flux Dev + VSF · #3 P_a + caption pending

VSF is described in our prior work; numbers and images will be added here as they finish running. The point this section makes is not that VSF is the solution, but that a small inference-time intervention is already enough to recover meaningful prompt fidelity — which suggests the alignment objective, not the model's capability, is the binding constraint.

Datasets · live gallery

Sample pairs from the two released datasets

Both datasets are mirrored on HuggingFace. Images here are loaded directly from the HuggingFace Datasets viewer. Not every pair "succeeds" — generation failures and partial successes are part of the benchmark.

AI benchmark · I_o vs. I_a (both scored by HPSv3 under the anti-aesthetic prompt)

HPSv3 score badge is shown on each image. The image with the higher score is what HPSv3 picks when prompted with P_a.

Browse 3.3k pairs on HF →

In every pair above, both images are scored under the anti-aesthetic prompt P_a by HPSv3. If the reward model respected user intent, the anti-aesthetic image I_a should win. It usually doesn't.

Real anti-aesthetic photographs vs. clean Z-Image-Turbo generations (HPSv3, anti-aesthetic prompt)

Each card pairs a real anti-aesthetic photograph with a clean AI generation made from a prompt that omits the requested anti-aesthetic style. Both are scored by HPSv3 under the anti-aesthetic prompt.

Browse 6.3k images on HF →

Across 2,928 such pairs, HPSv3 prefers the clean-but-wrong AI generation by an average of 5.90 points. Analog degradation hits 13.2 points; intentional blur and exposure extremes ≈ 8. HPSv3's typical range is 0–15.

Real artworks · LAPIS dataset

Recognized art that HPSv3 sees as below zero

These are paintings — Color Field, Abstract Expressionism, expressionist still life, figurative naïve art — drawn from the LAPIS art dataset. Each was captioned and scored by HPSv3. They all sit in negative territory; HPSv3's typical range for AI generations is 0–15.

Real artworks ranked at the very bottom of HPSv3's own leaderboard — below most early AI generators. The reward model cannot distinguish between deliberate aesthetic deviation and unintended generation failure. We argue this is the systemic bias the paper identifies, made concrete.

LAPIS contains roughly 10K real artworks; the bottom-12 selection above is curated for variety across genre and style. Captions are AI-generated descriptions, not titles.

Anticipated objections

Alternative positions and rebuttal

"We need alignment to ensure safety."

Truly unsafe content (incitement, targeting, harm) is one thing. Visual comfort and aesthetic conformity are not the same thing. Political critique, decay, horror, negative emotion, and grotesque embodiment have been central to art, education, and personal growth. Their suppression protects corporate reputation, not users.

"Anti-aesthetic just means broken images."

Of the 12 dimensions we used, only clarity can be argued as a technical flaw — and even clarity is deliberately used to convey motion, emotion, or narrative. The other eleven (emotion, color, brightness, realism, scale, …) are artistic choices, not technical failures.

"The default should please the majority."

Defaults are fine. Defaults that override explicit user prompts are not. Models like Nano Banana and GPT-Image already show that you can excel at both polished and anti-aesthetic generation. The capacity exists; the alignment objective discards it.

"Aesthetic averaging is harmless."

Flux Krea's own team called the average-aligned zone the "nobody's happy here" zone. Edvard Munch's The Scream gets 5.23 from HPSv3 while AI clip-art-clean images score 10–15. Averaging strips out exactly the disagreement that defines aesthetic value.

Conclusion

Reversed alignment

Reward models penalize images faithful to anti-aesthetic prompts. Generation models override explicit user instructions in favor of conventionally beautiful outputs. Historically significant artworks receive scores far below AI-generated images. Optimization toward an imaginary average — what we call reversed alignment — is not merely inconveniencing a minority. It is erasing the concrete intentions of actual individuals, functioning as aesthetic authoritarianism that narrows admissible expression and removes the capacity to dissent from imposed norms. The reversal does not stop at the user. Audiences exposed to these polished outputs internalize the narrow vocabulary as the default benchmark, which then feeds back into preference data and the intuitions of human artists. Reversed alignment therefore acts on two fronts at once: it aligns the user to the model in private, and aligns the public to the model in aggregate, risking a cultural mode collapse in the trajectory of art itself.

We call for alignment strategies that recognize aesthetic plurality, expose user-controllable strength of aesthetic preference, draw on more diverse annotator pools, and remain transparent about what is being optimized.

Cite

BibTeX

@inproceedings{guo2026universal,
  title = {Position: Universal Aesthetic Alignment Narrows Artistic Expression},
  author = {Guo, Wenqi Marshall and Qian, Qingyun and Hasan, Khalad and Du, Shan},
  booktitle = {Forty-third International Conference on Machine Learning},
  year = {2026},
  url = {https://openreview.net/forum?id=1gQ4zc1Q8I}
}

Sponsors & acknowledgments

This work was made possible by

Compute, infrastructure, and funding support from the organizations below.

Aligning to an imaginary average

What this paper actually argues and shows

1 · A position

2 · A benchmark

3 · Reward-model audit

4 · Generative audit

5 · Real-image evaluation

6 · Emotion bias

Six interconnected risks of universal aesthetic alignment

Developer's vs. users' preference

Inherited bias

Reversed alignment

Sanitized reality

Toxic positivity

Value capture

How we tested it

Prompts

Generation

Evaluation

What the numbers say

Table 1 · Generation models · HPSv3 only

Table 2 · Reward models picking the anti-aesthetic image

Table 3 · Negative-emotion classification accuracy

2,928 real anti-aesthetic photographs vs. clean AI generations

The plastic default the model can't leave

Same prompt. Different planet.

Flux Dev + VSF

Sample pairs from the two released datasets

AI benchmark · Io vs. Ia (both scored by HPSv3 under the anti-aesthetic prompt)

Real anti-aesthetic photographs vs. clean Z-Image-Turbo generations (HPSv3, anti-aesthetic prompt)

Recognized art that HPSv3 sees as below zero

Alternative positions and rebuttal

"We need alignment to ensure safety."

"Anti-aesthetic just means broken images."

"The default should please the majority."

"Aesthetic averaging is harmless."

Reversed alignment

BibTeX

This work was made possible by

AI benchmark · I_o vs. I_a (both scored by HPSv3 under the anti-aesthetic prompt)