Files
diffusers/scripts/convert_ace_step_to_diffusers.py
Gong Junmin 1a8a17b71b Add ACE-Step pipeline for text-to-music generation (#13095)
* Add ACE-Step pipeline for text-to-music generation

Rebased on origin/main from the original pr-13095 branch (3 commits squashed).

- AceStepDiTModel: Diffusion Transformer with RoPE, GQA, sliding window,
  AdaLN timestep conditioning, and cross-attention.
- AceStepConditionEncoder: fuses text / lyric / timbre into a single
  cross-attention sequence.
- AceStepPipeline: text2music / cover / repaint / extract / lego / complete.
- Conversion script for the original checkpoint layout.
- Docs + tests.

* Fix ACE-Step pipeline audio quality and auto-detect turbo/base/sft variants

The PR's original inference produced low-quality audio on turbo because the
pipeline (a) mangled the SFT prompt format, (b) applied classifier-free guidance
with the wrong unconditional embedding (empty-string encoded vs. the learned
`null_condition_emb`), and (c) hardcoded turbo defaults even when loading a
base/SFT checkpoint.

Changes:

* Converter preserves `null_condition_emb` (stored under the condition encoder)
  and propagates `is_turbo`/`model_version` into the transformer config so the
  pipeline can route per-variant defaults.
* `AceStepConditionEncoder` registers `null_condition_emb` as a learned
  parameter matching the original module.
* Pipeline auto-detects variant via `is_turbo`/`model_version` and picks
  defaults that match `acestep/inference.py`:
    * turbo:  steps=8,  shift=3.0, guidance_scale=1.0 (no CFG)
    * base/SFT: steps=27, shift=1.0, guidance_scale=7.0
* Base/SFT timestep schedule uses the linear+shift transform from
  `acestep/models/base/modeling_acestep_v15_base.py`; turbo still uses the
  hardcoded 8-step `SHIFT_TIMESTEPS` table.
* CFG reuses the learned `null_condition_emb` and batches the
  conditional+unconditional forwards into a single transformer call.
* `SFT_GEN_PROMPT` matches the newline layout in `acestep/constants.py` so the
  text encoder sees the same prompt distribution it was trained on.

DiT parity vs. the original ACE-Step 1.5 turbo DiT is bit-identical
(max_abs=0.0 in fp32 eager/SDPA across 4 seed/shape cases) — see
scripts/dit_parity_test.py.

* Add ACE-Step parity test scripts

Two developer-facing parity harnesses live under scripts/:

* dit_parity_test.py — loads the same converted turbo weights into the
  original AceStepDiTModel and the diffusers AceStepDiTModel, drives
  identical (hidden_states, timestep, timestep_r, encoder_hidden_states,
  context_latents) inputs, and asserts max-abs-diff ≤ 1e-5 in fp32
  eager/SDPA. Currently passes bit-identical (max_abs=0) across four
  shape/seed cases including batched + odd-length paths.

* audio_parity_jieyue.py — full end-to-end audio parity. Given the same
  JSON example, runs both the original ACE-Step 1.5 pipeline and the
  diffusers AceStepPipeline at matched seed/precision (bf16 + FA2 by
  default) and saves side-by-side .wav files for listening verification.
  Supports text2music / cover / repaint × turbo / base / sft via a
  --matrix mode that writes 18 wavs named
  {variant}_{task}_{official,diffusers}.wav.

* Route SFT parity to acestep-v15-sft checkpoint

On jieyue the release tree has a dedicated SFT checkpoint at
checkpoints/acestep-v15-sft with its own modeling_acestep_v15_base.py
shipped under acestep/models/sft/. Point the SFT row of the parity matrix
at that checkpoint / module so we're testing the actual SFT weights, not
the plain base ones.

* audio_parity_jieyue: fix doubled 'acestep-' in cache path; --converted-root flag

Previously the converted-pipeline cache dir was
`/tmp/acestep-<variant>-diffusers` but <variant> already starts with
"acestep-", giving `/tmp/acestep-acestep-v15-turbo-diffusers`. Drop the
prefix.

On jieyue the overlay rootfs (including /tmp) only has a few GB free; a
full turbo conversion needs ~5 GB per variant. Add --converted-root (env
ACESTEP_CONVERTED_ROOT) so the cache can live on vepfs.

* audio_parity_jieyue: two-phase matrix bootstraps cover/repaint from text2music

The ACE-Step release bundle on jieyue doesn't ship sample .wav/.mp3
files, so matrix mode had no default --src-audio and would skip
cover/repaint entirely. Run text2music first for every variant, then
reuse the TURBO official text2music output as the shared source for the
cover/repaint rows. Users can still override with --src-audio.

* audio_parity_jieyue: seed the diffusers generator on the pipeline device

The ORIGINAL ACE-Step pipeline seeds on the execution device
(`torch.Generator(device=device).manual_seed(seed)`), i.e. the CUDA RNG
stream when running on GPU. Previously the parity harness seeded the
diffusers side with a CPU generator, so even though the seed integer
matched, the two sides drew different noise from the outset and the
final outputs were essentially uncorrelated. Use the execution-device
generator on both sides for a fair comparison.

* Fix ACE-Step pipeline: switch to APG guidance + peak normalization

Two issues found after the first jieyue audio parity run:

1. The original base/SFT pipeline uses APG (Adaptive Projected Guidance,
   acestep/models/common/apg_guidance.py) with a stateful momentum
   buffer and norm/projection steps — NOT vanilla CFG. Using vanilla CFG
   produced uncorrelated outputs vs. the reference (pearson ~0.0 on
   20 s samples); this PR ports `_apg_forward` + `_APGMomentumBuffer`
   and plugs them into the denoising loop when `guidance_scale > 1`.
   Momentum is instantiated once per pipeline call (persists across
   denoising steps) to match the reference semantics.

2. The post-VAE "anti-clipping normalization" in this pipeline was
   `audio /= std * 5` with a `std<1 -> std=1` guard. The original
   post-processing in
   acestep/core/generation/handler/generate_music_decode.py is simple
   peak normalization: `if audio.abs().max() > 1: audio /= peak`. The
   std-based proxy both (a) let clips with peak < 1 leak through
   unchanged (over-quiet) and (b) failed to bring clipping peaks to
   exactly 1 in a bunch of base/SFT cases (observed max=1.000, std=0.200
   repeatedly in the first parity run). Switch to peak normalization on
   both sides.

Tested via scripts/audio_parity_jieyue.py on A800; re-run pending to
confirm the base/SFT correlation improvements.

* Fix ACE-Step chunk mask values to match the original pipeline

The DiT receives `context_latents = concat(src_latents, chunk_mask)` on the
channel dim, and was trained with chunk_mask values drawn from the three
sentinels documented in acestep/inference.py:

  2.0 -> model-decided (default for text2music / cover / full-generation)
  1.0 -> keep this latent frame from src_latents (repaint preserved region)
  0.0 -> explicitly repaint this frame (only inside the repaint window)

Previously _build_chunk_mask returned all-1.0 for text2music (and cover /
lego), and an inverted 0/1 mask for repaint (1 inside the window, 0 outside).
Either case puts context_latents out of distribution. Switch text2music /
cover to the 2.0 sentinel and flip the repaint mask so it's 1.0 outside /
0.0 inside. Update the repaint src_latents zero-out to multiply by the new
mask (was `1 - chunk_mask`) so the zero region still lines up with the
repaint window.

* Add direct invoker for ACE-Step generate_music (ground truth)

Our earlier audio_parity_jieyue.py reconstructs the original pipeline by
calling AceStepConditionGenerationModel.generate_audio() directly, which
silently skips a lot of the real handler plumbing (conditioning masks,
silence-latent tiling, cover/repaint pre-processing, etc.). That made the
'official' wavs we saved sound wrong — flat, drone-like, not music.

This new script calls acestep.inference.generate_music end-to-end through
the real AceStepHandler, with LM + CoT explicitly disabled so we still have
a deterministic comparison. Use it to generate the ground-truth 'official'
wav for a given JSON example, then separately run the diffusers pipeline
with the same inputs and diff the two.

* run_official_generate_music: call initialize_service to bind a DiT variant

AceStepHandler() is a shell — you have to call handler.initialize_service(
project_root=..., config_path=..., device=..., use_flash_attention=..., ...)
before generate_music will work. Mirror what cli.py does at the equivalent
spot (around cli.py:1400).

* Fix silence-reference for ACE-Step timbre encoder

The root cause for the flat / drone-like outputs I was seeing (including
in my 'official' reconstruction): when no reference_audio is provided the
pipeline was feeding literal zeros to the timbre encoder. The real
handler feeds a slice of the learned `silence_latent` tensor.

The handler also transposes silence_latent on load (see
acestep/core/generation/handler/init_service_loader.py:214:
  self.silence_latent = torch.load(...).transpose(1, 2)
) converting [1, 64, 15000] -> [1, 15000, 64] so that
`silence_latent[:, :750, :]` yields the expected [1, 750, 64] shape.

Changes:

* Converter: load silence_latent.pt, transpose to [1, T, C], bake into
  the condition_encoder safetensors under key `silence_latent`.
  (Also keeps the raw .pt file at the pipeline root for debugging.)
* AceStepConditionEncoder: register `silence_latent` as a persistent
  buffer so from_pretrained loads it alongside the trained weights.
* Pipeline: when reference_audio is None, slice
  `condition_encoder.silence_latent[:, :timbre_fix_frame, :]` and
  broadcast across the batch instead of zeros. Emits a loud warning
  (and falls back to zeros) if the buffer is all-zero — that means the
  checkpoint was produced by an older converter and should be rebuilt.
* audio_parity_jieyue.py: the reference path now matches the handler's
  silence-latent slicing.

Without this fix, every variant/task combo produced drone-like audio
even when my numeric DiT-forward parity claimed they were identical.

* Fix three more ACE-Step pipeline bugs I found by dumping real inputs

Instrumented the live generate_audio call in the real ACE-Step handler and
observed the exact tensors it sees — my diffusers pipeline was wrong in
three independent ways:

1. src_latents for text2music should be silence_latent tiled to
   latent_length, NOT zeros. The handler fills no-target cases from
   silence_latent_tiled (observed std=0.96). Zeros are OOD for the DiT
   context_latents concat and produce drone-like outputs.

2. chunk_mask values cap at 1.0 (not 2.0). The handler starts with a
   bool tensor (True inside the generate span, False outside); the
   chunk_mask_modes=auto -> 2.0 override does NOT take effect because
   the underlying tensor is bool, so setting entry = 2.0 casts to True.
   After the later .to(dtype) float cast, the DiT sees 1.0/0.0 — exactly
   what I observed in the captured tensor (unique values = [True]).

3. Default shift is 1.0 for ALL variants, including turbo. I was
   defaulting turbo to shift=3.0 which picks a different SHIFT_TIMESTEPS
   table (the 8-step schedule is keyed by shift, not variant).

Also:
* Added _silence_latent_tiled() helper that slices / tiles the learned
  silence_latent (now loaded as a buffer on the condition encoder) to
  the requested latent length.
* Repaint path now substitutes silence_latent (not raw zeros) inside
  the repaint window — matches conditioning_masks.py.
* audio_parity_jieyue.py mirrors the same src/chunk/shift choices on
  its 'original' leg for apples-to-apples parity once the buggy
  reconstruction is removed from the picture.

* Add peak+loudness post-normalization to AceStepPipeline

The real pipeline normalizes audio in two stages (see
acestep/audio_utils.py:72 normalize_audio + generate_music_decode.py):
  1. if peak > 1: audio /= peak  (anti-clip)
  2. audio *= target_amp / peak   (target_amp = 10 ** (-1/20) ~ 0.891)

Step 2 is loudness normalization to -1 dBFS. Without it diffusers outputs
had peak=1.0 vs the real 0.891 — same music content (pearson was ~0.86
already), just 1.12x louder. Add step 2 after the existing anti-clip step.

* Match acestep/inference.py inference_steps=8 for ALL variants

GenerationParams.inference_steps default is 8 — turbo AND base/SFT. I had
base/SFT defaulting to 27 here, so every base/SFT parity run was comparing
a 27-step diffusers trajectory against an 8-step real trajectory. Different
number of denoising steps means different audio even at fixed seed.

This likely explains the lower base/SFT correlation in my earlier jieyue
runs (turbo was 0.86, base/SFT were 0.32-0.34). Aligning step counts
should bring base/SFT closer to turbo parity.

* Address PR #13095 review: rename classes + reuse diffusers primitives

Response to dg845's PR comments batch 1+2. DiT parity harness still bit-identical
(max_abs=0 on fp32 / SDPA across 4 shape cases).

Transformer file:
* Rename AceStepDiTModel -> AceStepTransformer1DModel (alias kept).
* Rename AceStepDiTLayer -> AceStepTransformerBlock (alias kept).
* Inherit AttentionMixin + CacheMixin on the DiT model.
* Swap in diffusers.models.normalization.RMSNorm for the hand-rolled
  AceStepRMSNorm (weight-key-compatible).
* Swap the hand-rolled rotary embedding + apply_rotary for diffusers'
  get_1d_rotary_pos_embed + apply_rotary_emb (use_real_unbind_dim=-2 to
  match the cat-half convention ACE-Step inherits from Qwen3).
* Use get_timestep_embedding with flip_sin_to_cos=True — keeps the
  (cos, sin) ordering of the original sinusoidal. State-dict-compatible.
* Drop max_position_embeddings arg from DiT config (RoPE computes freqs
  per call based on seq_len); converter drops it.
* Gradient-checkpoint call now takes just the layer module (matches the
  Flux2 idiom).

Pipeline modeling file (pipelines/ace_step/modeling_ace_step.py):
* Moved _pack_sequences + AceStepEncoderLayer here — they aren't used
  by the DiT, so they shouldn't live in the transformer file.
* AceStepLyricEncoder + AceStepTimbreEncoder set
  _supports_gradient_checkpointing = True and wrap encoder-layer calls
  through the checkpointing func when enabled.
* Use diffusers RMSNorm + the RoPE helper from the transformer file
  (shared single implementation).

Converter (scripts/convert_ace_step_to_diffusers.py):
* model_index.json now carries AceStepTransformer1DModel.
* Drop max_position_embeddings / use_sliding_window from the emitted
  configs.

No numerical regressions: scripts/dit_parity_test.py PASSES with
max_abs=0.0 on fp32/SDPA across short, long, batched, and
padding-path shape variants.

* Address PR #13095 review: pipeline polish + converter HF-hub support

Response to dg845 review comments on the pipeline side. DiT parity still
bit-identical (max_abs=0 across 4 shape cases).

Pipeline (pipelines/ace_step/pipeline_ace_step.py):
* Add `sample_rate` + `latents_per_second` properties sourced from the
  VAE config so the pipeline no longer hardcodes 48000 / 25 / 1920.
  Propagates through prepare_latents, chunk_mask window math, and the
  audio-duration round-trip.
* Add `do_classifier_free_guidance` property (matches LTX2 et al.).
* Add `check_inputs(...)` called from `__call__` before allocating noise.
  Validates prompt type, lyrics type, task_type, step count, guidance
  scale, shift, cfg interval bounds and repaint window ordering.
* Add `callback_on_step_end` + `callback_on_step_end_tensor_inputs` —
  the modern callback form. The legacy `callback` / `callback_steps`
  pair is kept for back-compat. Setting `pipe._interrupt = True` inside
  the callback stops the loop early.
* Expose `encode_audio(audio)` as a public helper that wraps the tiled
  VAE encode + (B, T, D) transpose the pipeline performs internally.

Converter (scripts/convert_ace_step_to_diffusers.py):
* Accept a Hugging Face Hub repo id for `--checkpoint_dir`; resolves it
  via `huggingface_hub.snapshot_download` when the argument isn't a
  local path.

Exports:
* Register `AceStepTransformer1DModel` in the top-level __init__,
  models/__init__, models/transformers/__init__, and dummy_pt_objects so
  `from diffusers import AceStepTransformer1DModel` works and the
  pipeline loader resolves the new class name from model_index.json.

Deferred for a follow-up (commented inline in the PR): full
`Attention + AttnProcessor + dispatch_attention_fn` refactor and
`FlowMatchEulerDiscreteScheduler` migration — both would benefit from a
dedicated parity re-run and review.

* Fix stale ACE-Step 1.0-era docs / class names in the 1.5 integration

Docs and docstrings still carried a mix of 1.0 paper title, non-existent
`ACE-Step/ACE-Step-v1-5-turbo` hub id, `shift=3.0` turbo default, and
the old `AceStepDiTModel` class name. Cleaned up to match the actual
1.5 release:

* pipelines/ace_step.md: correct citation title ("ACE-Step 1.5: Pushing
  the Boundaries of Open-Source Music Generation"), correct repo
  (`ace-step/ACE-Step-1.5`), new variants table with real HF ids
  (`Ace-Step1.5` / `acestep-v15-base` / `acestep-v15-sft`) and their
  per-variant step/CFG defaults, drop the wrong `shift=3.0` tip.
* models/ace_step_transformer.md: page renamed to
  `AceStepTransformer1DModel` with a short 1.5-specific description;
  `AceStepDiTModel` noted as a backwards-compat alias.
* pipeline_ace_step.py: import, docstring, `Args`, and `__init__`
  annotation reference `AceStepTransformer1DModel`; example model id
  now `ACE-Step/Ace-Step1.5`; `_variant_defaults` docstring and the
  `__call__` variant-fallback comment no longer claim `shift=3.0` /
  `27 steps` — real defaults are 8 steps / shift=1.0 across all
  variants, guidance=1.0 (turbo) vs 7.0 (base+sft).

* Address PR #13095 review: VAE tiling on AutoencoderOobleck + Timesteps class

Two more deferred review threads from dg845 addressed:

* Move tiled encode/decode onto AutoencoderOobleck
  (https://github.com/huggingface/diffusers/pull/13095#discussion_r2785513647).
  AutoencoderOobleck now carries `use_tiling` + `tile_sample_min_length` /
  `tile_sample_overlap` / `tile_latent_min_length` / `tile_latent_overlap`
  attributes and private `_tiled_encode` / `_tiled_decode` methods; the
  existing `encode` / `_decode` dispatch to them when tiling is enabled and
  the input exceeds the threshold. `AutoencoderMixin.enable_tiling()` is
  already inherited.

  AceStepPipeline's private `_tiled_encode` / `_tiled_decode` and the
  `use_tiled_decode` `__call__` arg are gone; `__init__` now calls
  `self.vae.enable_tiling()` so the long-audio memory behaviour is preserved
  by default. Users can opt out with `pipe.vae.disable_tiling()`.

  Note: the VAE-side tiling concatenates encoder features (h) and samples
  the posterior once, instead of the old per-tile `.sample()` calls. This
  is the standard diffusers pattern; numerically differs only in the
  structure of the noise across tile boundaries.

* Use the Timesteps nn.Module for the sinusoid
  (https://github.com/huggingface/diffusers/pull/13095#discussion_r2785420234).
  `AceStepTimestepEmbedding` wraps `Timesteps(in_channels, flip_sin_to_cos=
  True, downscale_freq_shift=0)` instead of calling `get_timestep_embedding`
  directly — reviewer asked for the Module form.

* Address PR #13095 review: refactor AceStepAttention to Attention + AttnProcessor

Splits the monolithic AceStepAttention into the diffusers standard
Attention + AttnProcessor layout:
  - AceStepAttention (torch.nn.Module, AttentionModuleMixin) holds the
    to_q/to_k/to_v/to_out projections and norm_q/norm_k RMSNorms.
  - AceStepAttnProcessor2_0 runs the attention dispatch through
    dispatch_attention_fn so users can pick flash / sage / native backends
    via model.set_attention_backend(...) or the attention_backend context
    manager.

GQA (Q has 16 heads / K,V have 8) is preserved by passing enable_gqa=True
to dispatch_attention_fn instead of repeat_interleave; fusion is disabled
(_supports_qkv_fusion = False) because Q and K,V have different output
sizes.

The converter is updated to rename the six attention sub-keys
(q_proj -> to_q, k_proj -> to_k, v_proj -> to_v, o_proj -> to_out.0,
q_norm -> norm_q, k_norm -> norm_k) on both the DiT decoder path and the
condition encoder path, since AceStepLyricEncoder / AceStepTimbreEncoder
share the same AceStepAttention class.

Addresses review comments r2785433213 and r2785450463.

* Address PR #13095 review: migrate to FlowMatchEulerDiscreteScheduler

Replace the hand-rolled flow-matching Euler loop with
`FlowMatchEulerDiscreteScheduler`. ACE-Step still computes its own shifted /
turbo sigma schedule via `_get_timestep_schedule`, but now passes it to
`scheduler.set_timesteps(sigmas=...)` and delegates the ODE step to
`scheduler.step()`. The scheduler is configured with `num_train_timesteps=1`
and `shift=1.0` so `scheduler.timesteps` stays in `[0, 1]` (the convention the
DiT was trained on) and the scheduler doesn't re-shift already-shifted sigmas.

The scheduler's appended terminal `sigma=0` reproduces the old loop's
final-step "project to x0" case exactly: `prev = x + (0 - t_curr) * v`.

Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music, 8 steps):
  waveform Pearson = 0.999999
  spectral Pearson = 1.000000
  max |diff|       = 2.5e-3  (fp32 step-math vs previous bf16 step-math)

fp32 Euler-loop A/B against the hand-rolled path: max |diff| = 3.6e-7.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Address PR #13095 review: move DiT tests + drop stale test kwargs

- Move the DiT transformer tests out of the pipeline test file into a new
  tests/models/transformers/test_models_transformer_ace_step.py that follows
  the standard BaseModelTesterConfig + ModelTesterMixin scaffold (matches
  test_models_transformer_longcat_audio_dit.py).
- Drop `max_position_embeddings` from the remaining AceStepDiTModel and
  AceStepConditionEncoder test fixtures — neither constructor accepts that
  argument anymore.
- Drop `use_sliding_window` from the same fixtures — also no longer a
  constructor argument (the actual `sliding_window` int kwarg is kept).
- Wire `FlowMatchEulerDiscreteScheduler(num_train_timesteps=1, shift=1.0)`
  into `get_dummy_components()` now that the pipeline requires it.

Resolves https://github.com/huggingface/diffusers/pull/13095#discussion_r3115653554,
r3115664850, r3115673059, r3115676580, r3115680700.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Address PR #13095 review from dg845 (2026-04-23)

Fixes 5 review threads + style:

1. Converter now builds `AceStepPipeline` in memory and calls
   `save_pretrained`. Previously the hand-written `model_index.json` was
   missing the `scheduler` entry — fresh converter output couldn't be loaded
   by `AceStepPipeline.from_pretrained` (r3127767785). This also makes the
   converter robust to future `__init__` signature changes.

2. `latent_length` uses `math.ceil(...)` instead of `int(...)` so non-integer
   products (e.g. `latents_per_second=2.0, audio_duration=0.4 → 0.8`) round up
   to `1` instead of truncating to `0` and crashing shape checks (r3127790939).

3. Add `_callback_tensor_inputs = ["latents"]` on `AceStepPipeline` so the
   standard diffusers callback tests pick up the right tensor (r3127795954).

4. `AceStepConditionEncoder.silence_latent` no longer hard-codes the channel
   dim to 64. The placeholder buffer now uses the `timbre_hidden_dim`
   constructor argument, so smaller test configs with `timbre_hidden_dim != 64`
   load without shape errors (r3127812932).

5. Revert `self.vae.enable_tiling()` from `AceStepPipeline.__init__`. Users can
   call `pipe.vae.enable_tiling()` themselves for long-form generation; that
   matches the opt-in convention used by the rest of diffusers (r3127777296).

6. `ruff check --fix` + `ruff format` over all ACE-Step sources (the style fix
   dg845 asked for via `@bot /style`).

Also: converter now accepts sharded `model.safetensors.index.json` layouts
alongside the single-file `model.safetensors`, so the 5B XL turbo variant
converts without a pre-processing step.

Parity on jieyue (seed=42, bf16 + flash-attn, turbo text2music 160s, fresh
converter output loaded via `from_pretrained`):
  waveform Pearson  = 0.999954
  spectral Pearson  = 0.999977
  max |a-b| bf16    = 4.3e-02  (dominated by the VAE tiling default flip)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Address PR #13095 review from yiyixuxu (2026-04-23)

Code-level (22 threads):

1. Delete 3 dev/parity scripts (`scripts/audio_parity_jieyue.py`,
   `scripts/dit_parity_test.py`, `scripts/run_official_generate_music.py`)
   that shouldn't have been committed.
2. Rename `AutoencoderOobleck._encode_one` → `_encode` to match the convention
   used by other diffusers VAEs.
3. Delete the hard-coded `SHIFT_TIMESTEPS` / `VALID_SHIFTS` table in
   `pipeline_ace_step.py`: the per-shift turbo schedules are recovered
   exactly by `linspace(1, 0, N+1)[:-1]` plus the flow-match shift formula
   that the non-turbo branch already uses, so a single code path covers both.
4. Drop the backwards-compat `AceStepDiTModel` / `AceStepDiTLayer` aliases
   and every reference (top-level `__init__`, `models/__init__`,
   `transformers/__init__`, dummy objects, tests, docs toctree, model card).
   `AceStepTransformer1DModel` is the only exported name now.
5. Remove the unused `attention_mask` / `encoder_attention_mask` args from
   `AceStepTransformer1DModel.forward`; the model rebuilds its masks from
   the sequence shape and never consumed them.
6. In the DiT forward and both encoders, pass `None` instead of an all-zero
   `full_attn_mask` / `encoder_4d_mask` to non-sliding attention layers — SDPA
   dispatches to a faster kernel when the mask is None.
7. Inline the shared `_run_encoder_layers` helper directly into
   `AceStepLyricEncoder.forward` / `AceStepTimbreEncoder.forward` so layer
   calls are visible at the forward boundary (diffusers style).
8. Move `is_turbo` / `sample_rate` / `latents_per_second` from `@property`s
   that re-read module configs each call to cached attributes populated in
   `__init__` (Flux2-style), with a default-ACE-Step fallback when
   `self.vae` is offloaded. Drop the now-unused `SAMPLE_RATE = 48000`
   module-level constant and the three property definitions.
9. Warn + coerce `guidance_scale` to 1.0 on turbo (guidance-distilled)
   checkpoints, following `pipeline_flux2_klein`. Prevents over-guided
   audio when users forward their base/sft CFG settings to a turbo pipe.
10. Remove the `logger.warning(...)` paths that triggered on
    `silence_latent` missing/zero — those only fired for author-side
    unconverted checkpoints and tests; end users always load converted
    weights where the buffer is baked in.
11. Drop the redundant `with torch.no_grad():` wrappers inside
    `encode_prompt` — the pipeline's `__call__` runs under `torch.no_grad`
    already.
12. Strip "reviewer comment on PR #13095" attribution comments from three
    docstrings (here and everywhere).

Parity on jieyue (seed=42, bf16 + flash-attn, XL turbo 160s text2music):
  waveform Pearson = 0.9747
  spectral Pearson = 0.9895

The shift comes from full-attention layers switching `attn_mask=0_tensor` →
`attn_mask=None`, which dispatches to a different SDPA kernel on bf16. The
two outputs are algebraically equivalent for fp32 eager; on bf16+FA the
delta is dominated by kernel-level ULPs, well within the sampler-noise
band (ear-check on the 160s example confirms no audible regression).

Still open — AudioTokenizer/Detokenizer (deferred) + APG guider follow-up
(dims differ from `diffusers.guiders.adaptive_projected_guidance`, not a
drop-in; worth a separate PR).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Address ACE-Step audio token and APG review

* Fix ACE-Step docs CI

* Address ACE-Step pipeline cleanup review

* Fix ACE-Step flash attention sliding windows

* Add ACE-Step callback properties

* Address ACE-Step final review comments

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: YiYi Xu <yixu310@gmail.com>
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
2026-04-30 18:30:44 -10:00

455 lines
21 KiB
Python

# Run this script to convert ACE-Step model weights to a diffusers pipeline.
#
# Usage:
# python scripts/convert_ace_step_to_diffusers.py \
# --checkpoint_dir /path/to/ACE-Step-1.5/checkpoints \
# --dit_config acestep-v15-turbo \
# --output_dir /path/to/output/ACE-Step-v1-5-turbo \
# --dtype bf16
import argparse
import json
import os
import shutil
import torch
from safetensors.torch import load_file
def convert_ace_step_weights(checkpoint_dir, dit_config, output_dir, dtype_str="bf16"):
"""
Convert ACE-Step checkpoint weights into a Diffusers-compatible pipeline layout.
The original ACE-Step model stores all weights in a single `model.safetensors` file
under `checkpoints/<dit_config>/`. This script splits the weights into separate
sub-model directories that can be loaded by `AceStepPipeline.from_pretrained()`.
Expected input layout:
checkpoint_dir/
<dit_config>/ # e.g., acestep-v15-turbo
config.json
model.safetensors
silence_latent.pt
vae/
config.json
diffusion_pytorch_model.safetensors
Qwen3-Embedding-0.6B/
config.json
model.safetensors
tokenizer.json
...
Output layout:
output_dir/
model_index.json
transformer/
config.json
diffusion_pytorch_model.safetensors
condition_encoder/
config.json
diffusion_pytorch_model.safetensors
vae/
config.json
diffusion_pytorch_model.safetensors
text_encoder/
config.json
model.safetensors
...
tokenizer/
tokenizer.json
...
"""
# Support `--checkpoint_dir <repo-id>` by snapshot-downloading it first. A
# local path that happens not to exist still raises the clearer FileNotFoundError
# below, so we only fall through to the Hub if the path is missing AND looks like
# a repo id (namespace/name).
if not os.path.exists(checkpoint_dir) and "/" in checkpoint_dir and not checkpoint_dir.startswith((".", "~", "/")):
try:
from huggingface_hub import snapshot_download
print(f"Downloading `{checkpoint_dir}` from the Hugging Face Hub ...")
checkpoint_dir = snapshot_download(repo_id=checkpoint_dir)
print(f" -> local snapshot at {checkpoint_dir}")
except ImportError as e:
raise ImportError(
"To use a Hugging Face Hub repo id for --checkpoint_dir, install `huggingface_hub`."
) from e
# Resolve paths
dit_dir = os.path.join(checkpoint_dir, dit_config)
vae_dir = os.path.join(checkpoint_dir, "vae")
text_encoder_dir = os.path.join(checkpoint_dir, "Qwen3-Embedding-0.6B")
# The DiT weights ship either as a single `model.safetensors` (the smaller turbo
# variant) or as sharded safetensors keyed by `model.safetensors.index.json`
# (the 5B XL variant). Resolve both layouts to `dit_weight_files` and load below.
single_model_path = os.path.join(dit_dir, "model.safetensors")
sharded_index_path = os.path.join(dit_dir, "model.safetensors.index.json")
config_path = os.path.join(dit_dir, "config.json")
if os.path.exists(single_model_path):
dit_weight_files = [single_model_path]
elif os.path.exists(sharded_index_path):
with open(sharded_index_path) as f:
shard_index = json.load(f)
dit_weight_files = [os.path.join(dit_dir, s) for s in sorted(set(shard_index["weight_map"].values()))]
for p in dit_weight_files:
if not os.path.exists(p):
raise FileNotFoundError(f"sharded DiT weight missing: {p}")
else:
raise FileNotFoundError(
f"DiT weights not found at: {single_model_path} or {sharded_index_path}. "
"Expected either a single `model.safetensors` or a sharded "
"`model.safetensors.index.json` + per-shard files."
)
for path, name in [
(config_path, "config"),
(vae_dir, "VAE"),
(text_encoder_dir, "text encoder"),
]:
if not os.path.exists(path):
raise FileNotFoundError(f"{name} not found at: {path}")
# Select dtype
dtype_map = {"fp32": torch.float32, "fp16": torch.float16, "bf16": torch.bfloat16}
if dtype_str not in dtype_map:
raise ValueError(f"Unsupported dtype: {dtype_str}. Choose from {list(dtype_map.keys())}")
target_dtype = dtype_map[dtype_str]
# Load original config
with open(config_path) as f:
original_config = json.load(f)
print(f"Loading DiT weights from {len(dit_weight_files)} file(s) ...")
state_dict = {}
for p in dit_weight_files:
print(f" loading {os.path.basename(p)}")
state_dict.update(load_file(p))
print(f" Total keys: {len(state_dict)}")
# =========================================================================
# 1. Split weights by prefix
# =========================================================================
transformer_sd = {}
condition_encoder_sd = {}
audio_tokenizer_sd = {}
audio_token_detokenizer_sd = {}
other_sd = {}
# Rename original ACE-Step attention keys to the diffusers `Attention` +
# `AttnProcessor` convention (`to_q`/`to_k`/`to_v`/`to_out.0`/`norm_q`/`norm_k`).
# Applies uniformly to both the DiT (self-attn and cross-attn) and the
# condition-encoder self-attention, since both use `AceStepAttention`.
_ATTN_KEY_RENAMES = [
(".q_proj.", ".to_q."),
(".k_proj.", ".to_k."),
(".v_proj.", ".to_v."),
(".o_proj.", ".to_out.0."),
(".q_norm.", ".norm_q."),
(".k_norm.", ".norm_k."),
]
def _rename_attn_keys(key: str) -> str:
for old, new in _ATTN_KEY_RENAMES:
key = key.replace(old, new)
return key
for key, value in state_dict.items():
if key.startswith("decoder."):
# Strip "decoder." prefix for the transformer
new_key = key[len("decoder.") :]
# The original model uses nn.Sequential for proj_in/proj_out:
# proj_in = Sequential(Lambda, Conv1d, Lambda)
# proj_out = Sequential(Lambda, ConvTranspose1d, Lambda)
# Only the Conv1d/ConvTranspose1d (index 1) has parameters.
# In diffusers, we use standalone Conv1d/ConvTranspose1d named proj_in_conv/proj_out_conv.
new_key = new_key.replace("proj_in.1.", "proj_in_conv.")
new_key = new_key.replace("proj_out.1.", "proj_out_conv.")
new_key = _rename_attn_keys(new_key)
transformer_sd[new_key] = value.to(target_dtype)
elif key.startswith("encoder."):
# Strip "encoder." prefix for the condition encoder
new_key = key[len("encoder.") :]
new_key = _rename_attn_keys(new_key)
condition_encoder_sd[new_key] = value.to(target_dtype)
elif key == "null_condition_emb":
# Learned unconditional embedding (used by the base/SFT CFG path).
# Keep it co-located with the condition encoder since that is where the
# pipeline pulls unconditional sequences from.
condition_encoder_sd["null_condition_emb"] = value.to(target_dtype)
elif key.startswith("tokenizer."):
new_key = key[len("tokenizer.") :]
new_key = _rename_attn_keys(new_key)
audio_tokenizer_sd[new_key] = value.to(target_dtype)
elif key.startswith("detokenizer."):
new_key = key[len("detokenizer.") :]
new_key = _rename_attn_keys(new_key)
audio_token_detokenizer_sd[new_key] = value.to(target_dtype)
else:
other_sd[key] = value.to(target_dtype)
print(f" Transformer keys: {len(transformer_sd)}")
print(f" Condition encoder keys: {len(condition_encoder_sd)}")
print(f" Audio tokenizer keys: {len(audio_tokenizer_sd)}")
print(f" Audio token detokenizer keys: {len(audio_token_detokenizer_sd)}")
print(f" Other keys: {len(other_sd)} ({list(other_sd.keys())[:5]}...)")
# =========================================================================
# 2. Build configs for each sub-model
# =========================================================================
# On the 5B XL turbo the condition encoder is narrower than the DiT
# (`encoder_hidden_size=2048` feeding a `hidden_size=2560` DiT). Non-XL
# turbo / base checkpoints don't set this field, so fall back to
# `hidden_size` — that makes the DiT's `condition_embedder` an identity-width
# Linear as before. Similarly `encoder_intermediate_size` /
# `encoder_num_attention_heads` / `encoder_num_key_value_heads` describe the
# condition encoder on XL only.
encoder_hidden_size = original_config.get("encoder_hidden_size", original_config["hidden_size"])
encoder_intermediate_size = original_config.get("encoder_intermediate_size", original_config["intermediate_size"])
encoder_num_attention_heads = original_config.get(
"encoder_num_attention_heads", original_config["num_attention_heads"]
)
encoder_num_key_value_heads = original_config.get(
"encoder_num_key_value_heads", original_config["num_key_value_heads"]
)
# Transformer (DiT) config. `is_turbo` / `model_version` propagate the variant so
# the pipeline can pick the right CFG / shift / step-count defaults at inference.
# Note: `max_position_embeddings` is dropped (RoPE computes freqs on-the-fly per call),
# and `use_sliding_window` is implied by the mix of `layer_types`.
transformer_config = {
"_class_name": "AceStepTransformer1DModel",
"_diffusers_version": "0.33.0.dev0",
"hidden_size": original_config["hidden_size"],
"intermediate_size": original_config["intermediate_size"],
"num_hidden_layers": original_config["num_hidden_layers"],
"num_attention_heads": original_config["num_attention_heads"],
"num_key_value_heads": original_config["num_key_value_heads"],
"head_dim": original_config["head_dim"],
"in_channels": original_config["in_channels"],
"audio_acoustic_hidden_dim": original_config["audio_acoustic_hidden_dim"],
"patch_size": original_config["patch_size"],
"rope_theta": original_config["rope_theta"],
"attention_bias": original_config["attention_bias"],
"attention_dropout": original_config["attention_dropout"],
"rms_norm_eps": original_config["rms_norm_eps"],
"sliding_window": original_config["sliding_window"],
"layer_types": original_config["layer_types"],
"encoder_hidden_size": encoder_hidden_size,
"is_turbo": bool(original_config.get("is_turbo", False)),
"model_version": original_config.get("model_version"),
}
# Condition encoder config
condition_encoder_config = {
"_class_name": "AceStepConditionEncoder",
"_diffusers_version": "0.33.0.dev0",
"hidden_size": encoder_hidden_size,
"intermediate_size": encoder_intermediate_size,
"text_hidden_dim": original_config["text_hidden_dim"],
"timbre_hidden_dim": original_config["timbre_hidden_dim"],
"num_lyric_encoder_hidden_layers": original_config["num_lyric_encoder_hidden_layers"],
"num_timbre_encoder_hidden_layers": original_config["num_timbre_encoder_hidden_layers"],
"num_attention_heads": encoder_num_attention_heads,
"num_key_value_heads": encoder_num_key_value_heads,
"head_dim": original_config["head_dim"],
"rope_theta": original_config["rope_theta"],
"attention_bias": original_config["attention_bias"],
"attention_dropout": original_config["attention_dropout"],
"rms_norm_eps": original_config["rms_norm_eps"],
"sliding_window": original_config["sliding_window"],
}
audio_tokenizer_config = {
"_class_name": "AceStepAudioTokenizer",
"_diffusers_version": "0.33.0.dev0",
"hidden_size": encoder_hidden_size,
"intermediate_size": encoder_intermediate_size,
"audio_acoustic_hidden_dim": original_config["audio_acoustic_hidden_dim"],
"pool_window_size": original_config.get("pool_window_size", 5),
"fsq_dim": original_config.get("fsq_dim", encoder_hidden_size),
"fsq_input_levels": original_config.get("fsq_input_levels", [8, 8, 8, 5, 5, 5]),
"fsq_input_num_quantizers": original_config.get("fsq_input_num_quantizers", 1),
"num_attention_pooler_hidden_layers": original_config.get("num_attention_pooler_hidden_layers", 2),
"num_attention_heads": encoder_num_attention_heads,
"num_key_value_heads": encoder_num_key_value_heads,
"head_dim": original_config["head_dim"],
"rope_theta": original_config["rope_theta"],
"attention_bias": original_config["attention_bias"],
"attention_dropout": original_config["attention_dropout"],
"rms_norm_eps": original_config["rms_norm_eps"],
"sliding_window": original_config["sliding_window"],
"layer_types": original_config["layer_types"][: original_config.get("num_attention_pooler_hidden_layers", 2)],
}
audio_token_detokenizer_config = {
"_class_name": "AceStepAudioTokenDetokenizer",
"_diffusers_version": "0.33.0.dev0",
"hidden_size": encoder_hidden_size,
"intermediate_size": encoder_intermediate_size,
"audio_acoustic_hidden_dim": original_config["audio_acoustic_hidden_dim"],
"pool_window_size": original_config.get("pool_window_size", 5),
"num_attention_pooler_hidden_layers": original_config.get("num_attention_pooler_hidden_layers", 2),
"num_attention_heads": encoder_num_attention_heads,
"num_key_value_heads": encoder_num_key_value_heads,
"head_dim": original_config["head_dim"],
"rope_theta": original_config["rope_theta"],
"attention_bias": original_config["attention_bias"],
"attention_dropout": original_config["attention_dropout"],
"rms_norm_eps": original_config["rms_norm_eps"],
"sliding_window": original_config["sliding_window"],
"layer_types": original_config["layer_types"][: original_config.get("num_attention_pooler_hidden_layers", 2)],
}
# =========================================================================
# 3. Bake silence_latent into the condition_encoder state dict.
#
# The original loader in
# acestep/core/generation/handler/init_service_loader.py:214 does
# self.silence_latent = torch.load(...).transpose(1, 2)
# converting the stored [B, C=64, T=15000] tensor to [B, T, C=64] before any
# downstream slicing. Do the same transpose here and register it as the
# `silence_latent` buffer on AceStepConditionEncoder — the pipeline slices
# `silence_latent[:, :timbre_fix_frame, :]` to build the "silence" input to the
# timbre encoder when no reference audio is supplied. Passing literal zeros
# produces drone-like audio.
silence_latent_src = os.path.join(dit_dir, "silence_latent.pt")
if os.path.exists(silence_latent_src):
silence_raw = torch.load(silence_latent_src, weights_only=True, map_location="cpu")
silence_latent = silence_raw.transpose(1, 2).to(target_dtype).contiguous()
print(f" silence_latent raw shape: {tuple(silence_raw.shape)} -> baked shape: {tuple(silence_latent.shape)}")
condition_encoder_sd["silence_latent"] = silence_latent
# =========================================================================
# 4. Build the AceStepPipeline in memory and save via `save_pretrained`.
# Assembling the pipeline directly (rather than hand-writing model_index.json)
# ensures the saved repo stays in sync with the `AceStepPipeline.__init__`
# signature — e.g. a future sub-module added to the pipeline can't silently
# drift out of `model_index.json`.
# =========================================================================
from transformers import AutoModel, AutoTokenizer
from diffusers import (
AceStepPipeline,
AceStepTransformer1DModel,
AutoencoderOobleck,
FlowMatchEulerDiscreteScheduler,
)
from diffusers.pipelines.ace_step import (
AceStepAudioTokenDetokenizer,
AceStepAudioTokenizer,
AceStepConditionEncoder,
)
# Drop metadata keys — they're re-populated by `save_pretrained` at save time.
transformer_init_kwargs = {k: v for k, v in transformer_config.items() if not k.startswith("_")}
condition_encoder_init_kwargs = {k: v for k, v in condition_encoder_config.items() if not k.startswith("_")}
audio_tokenizer_init_kwargs = {k: v for k, v in audio_tokenizer_config.items() if not k.startswith("_")}
audio_token_detokenizer_init_kwargs = {
k: v for k, v in audio_token_detokenizer_config.items() if not k.startswith("_")
}
print("\nConstructing transformer ...")
transformer = AceStepTransformer1DModel(**transformer_init_kwargs).to(target_dtype)
transformer.load_state_dict(transformer_sd, strict=True)
print("Constructing condition_encoder ...")
condition_encoder = AceStepConditionEncoder(**condition_encoder_init_kwargs).to(target_dtype)
condition_encoder.load_state_dict(condition_encoder_sd, strict=True)
print("Constructing audio_tokenizer ...")
audio_tokenizer = AceStepAudioTokenizer(**audio_tokenizer_init_kwargs).to(target_dtype)
audio_tokenizer.load_state_dict(audio_tokenizer_sd, strict=True)
print("Constructing audio_token_detokenizer ...")
audio_token_detokenizer = AceStepAudioTokenDetokenizer(**audio_token_detokenizer_init_kwargs).to(target_dtype)
audio_token_detokenizer.load_state_dict(audio_token_detokenizer_sd, strict=True)
print("Loading VAE ...")
vae = AutoencoderOobleck.from_pretrained(vae_dir).to(target_dtype)
print("Loading text encoder ...")
text_encoder = AutoModel.from_pretrained(text_encoder_dir, torch_dtype=target_dtype)
print("Loading tokenizer ...")
tokenizer = AutoTokenizer.from_pretrained(text_encoder_dir)
# ACE-Step drives the DiT with t ∈ [0, 1] and computes its own shifted / turbo
# sigma schedule, which it passes to `scheduler.set_timesteps(sigmas=...)` at
# sampling time. So the scheduler needs `num_train_timesteps=1` (so
# `scheduler.timesteps == sigmas`) and `shift=1.0` (so it doesn't re-shift
# already-shifted sigmas). All other defaults are fine.
scheduler = FlowMatchEulerDiscreteScheduler(num_train_timesteps=1, shift=1.0)
pipe = AceStepPipeline(
vae=vae,
text_encoder=text_encoder,
tokenizer=tokenizer,
transformer=transformer,
condition_encoder=condition_encoder,
scheduler=scheduler,
audio_tokenizer=audio_tokenizer,
audio_token_detokenizer=audio_token_detokenizer,
)
print(f"\nSaving pipeline -> {output_dir}")
pipe.save_pretrained(output_dir, safe_serialization=True, max_shard_size="5GB")
# Keep the raw silence_latent.pt at the pipeline root for debugging — not
# required by `from_pretrained`, but makes it easy to re-derive the buffer
# without re-running the full conversion.
if os.path.exists(silence_latent_src):
shutil.copy2(silence_latent_src, os.path.join(output_dir, "silence_latent.pt"))
print(f" kept raw silence_latent copy at {output_dir}/silence_latent.pt")
# Report any keys that were not saved to registered pipeline modules.
if other_sd:
print(f"\nNote: {len(other_sd)} keys were dropped:")
for key in sorted(other_sd.keys())[:10]:
print(f" {key}")
if len(other_sd) > 10:
print(f" ... ({len(other_sd) - 10} more)")
print(f"\nConversion complete! Output saved to: {output_dir}")
print("\nTo load the pipeline:")
print(" from diffusers import AceStepPipeline")
print(f" pipe = AceStepPipeline.from_pretrained('{output_dir}', torch_dtype=torch.bfloat16)")
print(" pipe = pipe.to('cuda')")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Convert ACE-Step model weights to Diffusers pipeline format")
parser.add_argument(
"--checkpoint_dir",
type=str,
required=True,
help="Path to the ACE-Step checkpoints directory (containing vae/, Qwen3-Embedding-0.6B/, and dit config dirs)",
)
parser.add_argument(
"--dit_config",
type=str,
default="acestep-v15-turbo",
help="Name of the DiT config directory (default: acestep-v15-turbo)",
)
parser.add_argument(
"--output_dir",
type=str,
required=True,
help="Path to save the converted Diffusers pipeline",
)
parser.add_argument(
"--dtype",
type=str,
default="bf16",
choices=["fp32", "fp16", "bf16"],
help="Data type for saved weights (default: bf16)",
)
args = parser.parse_args()
convert_ace_step_weights(
checkpoint_dir=args.checkpoint_dir,
dit_config=args.dit_config,
output_dir=args.output_dir,
dtype_str=args.dtype,
)