mirror of
https://github.com/huggingface/diffusers.git
synced 2026-05-28 00:39:35 +08:00
Some checks failed
CodeQL Security Analysis For Github Actions / CodeQL Analysis (push) Failing after 10m51s
Build documentation / build (push) Failing after 13m21s
Run dependency tests / check_dependencies (push) Has been cancelled
Run Torch dependency tests / check_torch_dependencies (push) Has been cancelled
Fast GPU Tests on main / Setup Torch Pipelines CUDA Slow Tests Matrix (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (lora) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (models) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (others) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (schedulers) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (single_file) (push) Has been cancelled
Fast GPU Tests on main / PyTorch Compile CUDA tests (push) Has been cancelled
Fast GPU Tests on main / PyTorch xformers CUDA tests (push) Has been cancelled
Fast GPU Tests on main / Examples PyTorch CUDA tests on Ubuntu (push) Has been cancelled
Fast tests on main / Fast PyTorch CPU tests on Ubuntu (push) Has been cancelled
Fast tests on main / PyTorch Example CPU tests on Ubuntu (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Diffusers metadata / update_metadata (push) Has been cancelled
Fast GPU Tests on main / Torch Pipelines CUDA Tests (push) Has been cancelled
Nightly and release tests on main/release branch / Setup Torch Pipelines CUDA Slow Tests Matrix (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch Pipelines CUDA Tests (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (examples) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (lora) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (models) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (others) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (schedulers) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (single_file) (push) Has been cancelled
Nightly and release tests on main/release branch / PyTorch Compile CUDA tests (push) Has been cancelled
Nightly and release tests on main/release branch / Torch tests on big GPU (push) Has been cancelled
Nightly and release tests on main/release branch / Torch Minimum Version CUDA Tests (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[] backend:nvidia_modelopt test_location:modelopt]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[] backend:optimum_quanto test_location:quanto]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[] backend:torchao test_location:torchao]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[peft kernels] backend:gguf test_location:gguf]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[peft] backend:bitsandbytes test_location:bnb]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (push) Has been cancelled
Nightly and release tests on main/release branch / Generate Consolidated Test Report (push) Has been cancelled
Test, build, and push Docker images / test-build-docker-images (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-doc-builder) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-cpu) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-cuda) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-minimum-cuda) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-xformers-cuda) (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
* [Pipelines] AnyFlow: scaffold pipelines/anyflow + register all top-level imports
This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py,
pipeline_anyflow_causal.py, transformer_anyflow.py,
scheduling_flow_map_euler_discrete.py) come in subsequent commits.
* [Schedulers] AnyFlow: add FlowMapEulerDiscreteScheduler
The flow-map scheduler advances samples from timestep t to caller-provided
target r in a single Euler step, supporting any-step sampling on flow-map-
distilled checkpoints. It is a general-purpose scheduler — not specific to the
AnyFlow checkpoints.
Tests: 12 standalone tests covering instantiation, set_timesteps endpoints,
shift identity/monotonicity, step shape preservation, zero-interval identity,
one-shot sampling, train weight schemes, scale_noise endpoints.
Docs: api/schedulers/flow_map_euler_discrete.md
* [Models] AnyFlow: add AnyFlowTransformer3DModel
A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules:
* FAR causal blocks (init_far_model=True): block-sparse causal attention via
flex_attention + compressed-frame patch embedding for frame-level
autoregressive generation (Gu et al., 2025, arXiv:2503.19325).
* Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta
timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary
intervals (AnyFlow).
With both flags off, the model reduces to stock Wan2.1.
The class is intentionally self-contained rather than annotated with
'# Copied from diffusers.models.transformers.transformer_wan' because upstream
Wan has been refactored extensively since v0.35.1 (new WanAttention class,
different processor architecture).
Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and
determinism, return_dict variants, save/load round-trip with and without
init_far_model, gradient checkpointing toggle.
Docs: api/models/anyflow_transformer3d.md
* [Pipelines] AnyFlow: add AnyFlowPipeline and AnyFlowCausalPipeline
* AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using
flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}.
* AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based
causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints
from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers.
Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel,
and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel
introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler.
Tests:
* tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests +
slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers.
* tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant.
Reference slices for slow integration tests are deferred to Phase 7
(Final quality pass) where the user runs them on a real GPU.
* [Docs] AnyFlow: add main pipeline documentation page
Modeled on the Helios pipeline doc (PR #13208). Sections: paper link + abstract,
supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V
examples for both bidirectional and causal variants, autodoc trailers.
* [Auto/Scripts] AnyFlow: register AutoPipelineForText2Video + add conversion script
* Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING.
* AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because
its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key.
* scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints
(with 'ema' state dict) into a diffusers save_pretrained layout. Supports all
4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the
upstream repo with argparse to match other diffusers conversion scripts.
* [Quality] AnyFlow: ruff-format + regenerated dummy stubs
* ruff format pass on all 5 source files (long lines + trailing comma fixes)
* check_dummies.py --fix_and_overwrite regenerated:
- dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler
- dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline
Local fast tests: 21/21 passed
- 12 scheduler tests (FlowMapEulerDiscreteScheduler)
- 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load)
The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install
that matches the diffusers main branch's transformers >= compatibility floor.
The reference slices for slow integration tests (real GPU + 1.3B/14B
checkpoints) are intentionally left as TODO stubs to be captured by the user
on a real GPU machine before opening the PR.
* [AnyFlow] address review feedback: bug fixes + DMD wording + EN/ZH tutorials
Critical bug fixes (verified against precision-validation review):
* pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded
transformer_dtype = torch.bfloat16 with self.transformer.dtype, so
pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a
dtype mismatch in the patch_embedding conv3d.
* transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in
_build_causal_mask (was a copy-paste typo carried over from FAR-Dev).
* transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals
and the `# noqa: F841` markers that were silencing the dead-store warning.
* transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the
pipeline manages KV cache directly, the mixin's interface is unused.
* transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)`
with try/except so the file imports cleanly on CPU CI / no-Triton machines.
* convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the
stdlib logger (warning_once-style) and a module-level basicConfig.
Documentation accuracy:
* AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial:
drop the fictitious `task_type` / `image` / `video` arguments and document
the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`)
to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes.
* Pipeline class docstrings + main doc: explicitly describe AnyFlow's
two-stage LoRA distillation including DMD reverse-divergence supervision
with Flow-Map backward simulation in stage 2 (was previously implicit).
* training_rollout: add detailed docstring explaining its role as the
3-segment Flow-Map backward simulation entry point used during DMD training.
* Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and
Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added
and registered in both `_toctree.yml` files.
Tests:
* Skip `test_attention_slicing_forward_pass` in both pipeline test classes
with a clear rationale (custom attention processor does not support slicing).
* All 21 standalone tests still pass (12 scheduler + 9 transformer).
Quality gates:
* `ruff check` clean across all AnyFlow files.
* `ruff format --check` reports 6 files already formatted.
* `python utils/check_copies.py` reports no diff.
Out of scope for this commit (deferred until reviewer feedback):
* Splitting AnyFlowTransformer3DModel into bidi + causal subclasses
* Unifying _forward_inference / _forward_cache return types
* Migrating model tests from plain unittest to BaseModelTesterConfig + mixins
* HF model card / config.json metadata updates on the nvidia/* repos
(push to Hub manually before opening the PR)
* [AnyFlow] rename Causal->FAR + explicit forward signature + dataclass output
Round 2 of review feedback. Three groups of changes; transformer state-dict
keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact
validation remains valid.
A. Pipeline rename (mechanical, no behavior change):
* Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers
usually means an attention mask; AnyFlow's variant is FAR autoregressive,
so the FAR name is more specific and matches the paper).
* File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv).
* Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv).
* All references updated in src/, tests/, docs/, scripts/, plus stale
anyflowcausalpipeline anchor links in tutorial markdown.
B. Pipeline test bug fixes (closes 19 fast-test failures reported by
precision-validation reviewer):
* pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets
self._num_timesteps = num_inference_steps before the rollout, so the
PipelineTesterMixin callback tests can read pipe.num_timesteps.
* tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious
task_type="t2v" kwarg that crashed every causal fast test (the FAR
pipeline selects mode via context_sequence, not a task_type arg).
C. Transformer architecture cleanups (review-driven, no tensor changes):
* Replace forward(*args, **kwargs) dispatcher with an explicit signature
listing every supported kwarg (hidden_states, timestep, r_timestep,
encoder_hidden_states, encoder_hidden_states_image, chunk_partition,
clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal,
attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile
tracing.
* Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput
(BaseOutput dataclass with sample + kv_cache fields) for the two causal
paths that need to also propagate kv_cache (_forward_inference and the
newly return_dict-aware _forward_cache). _forward_train and
_forward_bidirection now consistently return Transformer2DModelOutput.
Pipeline call sites already use return_dict=False with positional
unpacking, so the fix is transparent there.
Out of scope (deferred until canonical-org HF metadata sync):
* Splitting AnyFlowTransformer3DModel into a bidi class plus an
AnyFlowFARTransformer3DModel subclass — touches register_to_config keys
and would require updating model_index.json on every released checkpoint.
* Promoting chunk_partition from register_to_config to a forward-time
argument (same reason).
* Renaming training_rollout to _denoise — would break callers in the
FAR-Dev on-policy trainer that produced the released checkpoints.
Local fast tests: 21/21 still pass (12 scheduler + 9 transformer).
ruff check, ruff format, and check_copies.py are all clean.
* [AnyFlow] wire callback_on_step_end through inference_range + add chunk_partition to FAR fast-test fixture
Two root causes for the 19 remaining PipelineTesterMixin failures, identified
by the H200 reviewer:
1. callback_on_step_end was accepted by __call__ but never invoked. Both
pipelines pass it through to training_rollout (and FAR additionally through
inference()), and inference_range now fires it after scheduler.step in
the standard inference branch:
if callback_on_step_end is not None:
callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs}
callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
latents = callback_outputs.pop("latents", latents)
prompt_embeds = ...
negative_prompt_embeds = ...
`nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite
the closure-captured embeddings, matching upstream WanPipeline semantics.
The 3-segment grad_timestep training rollout does not invoke the callback;
it is intentionally training-only.
2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built
the dummy transformer without a `chunk_partition`, leaving it None on the
model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`.
Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame
each, matching the test's num_frames=9 -> 3 latent frames).
Local fast tests: 21/21 still pass.
ruff check, ruff format, and check_copies.py are all clean.
* [AnyFlow] Phase 2: split transformer + drop chunk_partition from config + rename helpers
Major architectural refactor that aligns the integration with diffusers conventions
ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and
tensor flow are unchanged so the H200 bit-exact validation remains valid; only
the on-disk transformer/config.json fields move.
Changes:
1. **Sibling transformer classes** replace the flag-driven single class:
* AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size /
full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition
kwargs (always-on for AnyFlow distilled checkpoints).
* AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward
paths (train / cache-prefill / autoregressive inference).
* AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by
the old setup_flowmap_model bootstrap) is removed; both classes now build
AnyFlowDualTimestepTextImageEmbedding directly in __init__.
* setup_flowmap_model / setup_far_model methods are removed; weight warm-start
for far_patch_embedding (trilinear interpolation from patch_embedding) moves
into AnyFlowFARTransformer3DModel.__init__.
2. **chunk_partition** is no longer a model config field. The FAR pipeline owns
the schedule:
* AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]
matches the released 81-frame NVIDIA checkpoints.
* AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition
argument that overrides the default for non-default num_frames.
3. **training_rollout -> _denoise_rollout** rename across both pipelines and all
English / Chinese docs that referenced it. Signals the method is internal to
the pipeline driver, not a public training API.
4. **Conversion script + tests + docs + registries**:
* scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right
transformer class per variant; init_far_model / init_flowmap_model /
chunk_partition kwargs are removed from the from_pretrained call.
* Transformer test file split into AnyFlowTransformer3DModelTest and
AnyFlowFARTransformer3DModelTest classes.
* Pipeline test fixtures use the right class and pass chunk_partition via
get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test).
* New docs page docs/source/en/api/models/anyflow_far_transformer3d.md;
anyflow_transformer3d.md rewritten for the bidi-only class.
* AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py,
src/diffusers/models/__init__.py, models/transformers/__init__.py and the
dummy_pt_objects.py stubs.
* docs/source/en/_toctree.yml: new entry for the FAR transformer page.
5. **Cleanups**:
* Pipeline __call__ no longer passes is_causal=False to the bidi forward (the
bidi class doesn't accept it).
* Pipeline class docstrings drop stale references to init_*_model flags.
Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes).
ruff check / format / check_copies clean.
Hub artifacts (model_index.json, transformer/config.json, scheduler config) need
to be regenerated for the released checkpoints; the HF update guide will be
delivered separately.
* [AnyFlow] Phase 3: convention compliance against .ai/AGENTS.md + .ai/models.md
Hard violations (per official diffusers guidelines):
* drop einops dependency — replace 25+ rearrange() calls with native
permute/reshape/unflatten in transformer + both pipelines
* device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now
fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt
per-device via _build_freqs (matches transformer_wan / transformer_flux
pattern)
* migrate attention to dispatch_attention_fn — replace direct
F.scaled_dot_product_attention calls with dispatch_attention_fn (works
with sage / flash / native backends); introduce AnyFlowAttention(
AttentionModuleMixin) with _default_processor_cls / _available_processors;
rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and
declare _attention_backend / _parallel_config class attrs
* drop dead config fields — qk_norm and added_kv_proj_dim are pruned from
both transformer __init__ signatures and AnyFlowTransformerBlock;
AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme
the released checkpoints use) and has no add_k_proj path (T2V only)
* add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer
classes for compile_repeated_blocks() support (matches Wan)
* annotate prepare_latents with `# Copied from diffusers.pipelines.wan.
pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange
to (B, T, C, H, W) layout is moved to the call site
State-dict keys are preserved (legacy Attention had identical to_q / to_k /
to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load
bit-exactly into the new AnyFlowAttention class.
The HF Hub config-update guide is updated correspondingly: transformer/
config.json now drops qk_norm and added_kv_proj_dim alongside the previous
init_far_model / init_flowmap_model / chunk_partition removals.
22 fast CPU tests still pass; ruff format / ruff check / check_copies all
clean.
* [AnyFlow] FAR fast-test compat: rope 0-dim guard + flex_attention CPU/head-dim fallbacks + KV-cache dtype + num_timesteps
Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR
causal path still calls flex_attention directly, which has hard requirements
(CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy
components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact
numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward
0.00e+00, backward kernel-nondet only, ratio 1.000).
Code fixes:
1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now
short-circuit to an empty tensor when num_frames / height / width is 0.
PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw
spatial input becomes a 2x2 latent which then floors to 0 against
compressed_patch_size=(1, 4, 4); the original
`freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime.
2. flex_attention dispatch: split the module-load
`torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager`
(always available) plus `_flex_attention_compiled`, with a tiny wrapper
that picks compiled for CUDA tensors and eager for CPU. Avoids
torch._inductor C++ codegen failures that broke fast tests after
`pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on
bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd).
3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16
(flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass
`scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows
contribute 0, so trimming the output back is mathematically equivalent.
Released ckpts use head_dim=128 so the branch is never taken in production.
4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded
`latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded
bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and
bias type (float) should be the same"); real bf16 ckpts are unaffected.
5. pipeline_anyflow_far._denoise_rollout sets
`self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps`
before the chunk loop, so PipelineTesterMixin.test_callback_cfg's
`pipe.num_timesteps`-based assertion matches the actual number of callback
fires (chunks * NFE) instead of the previous hardcoded num_inference_steps.
Tests:
* test_callback_inputs cannot pass without changing FAR's chunk-wise output
semantics — it zeroes latents on the final step and asserts the *entire*
output buffer is zero, but only the active chunk's slice is overwritten in
a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale;
callback functionality itself is still covered by test_callback_cfg.
* Full pytest run on tests/pipelines/anyflow/ +
tests/models/transformers/test_models_transformer_anyflow.py +
tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed,
0 failed, 11 skipped.
Quality gates:
* `ruff check` and `ruff format --check` clean across all AnyFlow files.
* `python utils/check_copies.py` clean.
* `python utils/check_dummies.py` clean.
* [AnyFlow] docs/code: paper-release tidy-up
User-facing alignment with the official HF Hub model card and the day-of-announcement
materials at https://huggingface.co/collections/nvidia/anyflow.
* Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries).
* Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers
copy uses the same Video-to-Video terminology as the official model card.
* Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow)
HF collection link to the three tutorial intros.
* Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page
/ ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live.
* Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project
page) in place of the prior <github-org> / <project-page-url> placeholders.
* Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA
affiliation in the main tutorial, API pipeline page, and both transformer
model pages; BibTeX uses the standard `and others` to elide the full list
until the next pass.
Working tree, CI gates, and tests after the change:
ruff format --check ✓
ruff check ✓
python utils/check_copies.py ✓
python utils/check_dummies.py ✓
pytest tests/models + tests/schedulers (22 fast) ✓
No production code logic changes — only docstring wording inside pipeline
files (TV2V → V2V).
* [AnyFlow] docs: drop in official BibTeX (full author list)
Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and
Fang, Guian and others}, ...}`` block in both the English and Chinese
tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion,
...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors:
Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai,
Mike Zheng Shou.
Docs-only.
* [AnyFlow] align with diffusers conventions + drop training-only code
Scheduler
- FlowMapEulerDiscreteScheduler.step now returns a
FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False)
and uses the conventional positional order (model_output, timestep, sample,
r_timestep).
- Drop training-only helpers: adaptive_weighting, set_train_weight,
get_train_weight, linear_timesteps_weights, and the weight_type config field.
- Add scale_model_input no-op for API parity; raise ValueError on missing
r_timestep.
Transformer
- Remove gate_track debug write inside
AnyFlowDualTimestepTextImageEmbedding.forward_timestep.
- Compile flex_attention lazily on first CUDA call instead of at import time.
- Replace assert with ValueError in build_block_mask.
- Resolve <arxiv-id> placeholders to 2605.13724.
Pipelines (AnyFlowPipeline + AnyFlowFARPipeline)
- Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__
docstrings covering every argument.
- Move use_mean_velocity from __init__ to __call__ so save/load round-trips.
- Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout),
the inner inference_range closure, and the redundant negative-prompt concat.
- Replace asserts with ValueError; wire show_progress to tqdm; rename inference
-> _inference; remove dead current_timestep property.
- Update scheduler.step call sites to the new signature.
- Trim class docstrings to inference-only language.
Pipeline output
- Add Apache 2.0 license header; switch to relative import.
Auto pipeline / conversion script
- Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and
AUTO_VIDEO2VIDEO_PIPELINES_MAPPING.
- Document the weights_only=False requirement in the conversion script.
Tests
- Scheduler tests use the new step signature and verify the Output dataclass
contract.
- Drop the four obsolete training-weight tests; drop weight_type kwarg from
pipeline test fixtures; remove internal milestone names from TODO comments.
Docs
- Resolve <arxiv-id> in the scheduler docs page.
- Trim DMD / on-policy distillation language in EN/ZH tutorials and the
pipelines page; the paper abstract quote is preserved verbatim.
* [AnyFlow] split FAR causal transformer into transformer_anyflow_far.py
Per @dg845's review on #13745: extract FAR causal modules into a dedicated
sibling file so each transformer variant reads in isolation. Shared submodules
are duplicated via `# Copied from` so `make fix-copies` keeps both in sync.
- `transformer_anyflow.py`: bidi-only. `AnyFlowAttnProcessor` no longer carries
the flex/KV-cache branch (was: dispatch in one branch, bare flex_attention in
the other); `AnyFlowRotaryPosEmbed` drops the compressed-frame helpers and
the `is_causal` arg; `AnyFlowDualTimestepTextImageEmbedding` drops its causal
branch. `AnyFlowTransformerBlock` keeps a single class with a new
`is_causal: bool = False` ctor flag that selects the self-attn processor —
the forward path is identical in both modes, only the processor differs.
- `transformer_anyflow_far.py`: new. Contains `AnyFlowFARTransformerOutput`,
`AnyFlowCausalAttnProcessor` (routed through `dispatch_attention_fn(backend=
"flex")` with a clear ValueError when a non-flex backend is configured; the
BlockMask is consumed only by the flex backend in `_native_flex_attention`),
`AnyFlowDualTimestepTextImageEmbeddingCausal`, `AnyFlowCausalRotaryPosEmbed`,
`AnyFlowFARTransformer3DModel`, and `# Copied from` clones of the shared
shared `AnyFlowAttention`/`AnyFlowCrossAttnProcessor`/`AnyFlowImageEmbedding`/
`AnyFlowTransformerBlock`/`AnyFlowAttnProcessor` modules.
Verified bit-exact against the pre-refactor branch on H200 (float32):
- bidi: L2 = 0.000e+00, max|Δ| = 0.000e+00
- FAR : L2 = 4.772e-06, max|Δ| = 3.576e-07
The FAR delta is fp32 accumulation noise from the dispatch path permuting
(B,L,H,D) ↔ (B,H,L,D) around the same `flex_attention` kernel.
Addresses review comments at transformer_anyflow.py:215, :261, :450, :622,
:671, :958.
* [AnyFlow] pipeline cleanup: video_processor, encode_video, inline rollout, kwarg rename
Per @dg845's review on #13745, applied to both bidi `AnyFlowPipeline` and
causal `AnyFlowFARPipeline`:
- Use `self.video_processor.preprocess_video(...)` instead of the manual
`* 2 - 1` normalize.
- Merge `vae_encode` + `encode_latents` + `_normalize_latents` into a single
`encode_video` method, mirroring `WanImageToVideoPipeline.encode_image`'s
flat structure.
- Inline `_denoise_rollout` into `AnyFlowPipeline.__call__`. For the FAR
pipeline, inline both `_denoise_rollout` and `_inference` as a nested loop
(outer over chunks, inner over denoising steps), mirroring
`WanAnimatePipeline.__call__`. `encode_kv_cache` is intentionally kept as a
method — it is one transformer call with a different `kv_cache_flag` mode
(cache-write), and inlining it would interleave two distinct forward
semantics in the same loop body and lose readability.
- Rename `context_sequence` → `video` (pixel-space) + `video_latents`
(pre-encoded), matching `WanVideoToVideoPipeline`. For the FAR pipeline,
the old `{"raw"/"latent"}` dict form is replaced by the two kwargs.
Mutually-exclusive validation raises `ValueError`.
Addresses review comments at pipeline_anyflow.py:358, :372, :393, :473 and
pipeline_anyflow_far.py:395, :489, :675.
* [AnyFlow] scheduler: N-length timesteps + step defaults r_timestep
Per @dg845's review on #13745:
- `set_timesteps(N)` now produces `N` timesteps backed by an internal
`sigmas[N+1]` linspace, matching `FlowMatchEulerDiscreteScheduler.set_
timesteps`. The final sigma (== 0) is the implicit r-endpoint of the last
step; the pipeline rollouts iterate `for i, t in enumerate(timesteps)`
without the old `[:-1]` slicing.
- `step(r_timestep=None)` now defaults to the next timestep on the schedule
(resolved via fp-tolerant `argmin` over `sigmas[:-1]`), instead of raising.
Any-step sampling is preserved when `r_timestep` is explicit. The raise
stays only for the case where the caller passes a `timestep` value that
isn't on the schedule and provides no `r_timestep` — there's no sensible
default in that case.
- Build sigmas in float64 on CPU then move to the target device, with a
float32 downcast for MPS / NPU (float64 isn't supported on those backends).
Pipeline rollout loops updated to compute `r = sigmas[i + 1] * num_train_
timesteps` for the model's `r_timestep` input and pass `r_timestep=None` to
`scheduler.step` (which resolves it from the schedule internally).
Addresses review comments at scheduling_flow_map_euler_discrete.py:107 and
:148.
* [AnyFlow] tests: regenerate via generate_model_tests.py; split bidi/FAR files
Per @dg845's review on #13745: replaced the hand-rolled transformer tests
with the standard mixin-based suite produced by `utils/generate_model_tests
.py`, and split the FAR causal model tests into their own file to mirror the
transformer file split.
- `tests/models/transformers/test_models_transformer_anyflow.py`: regenerated
bidi suite. Pulls in `ModelTesterMixin`, `MemoryTesterMixin`,
`TrainingTesterMixin`, `AttentionTesterMixin`, `TorchCompileTesterMixin` via
`BaseModelTesterConfig`, with `get_init_dict()` / `get_dummy_inputs()`
filled in for the small bidi config used in CI.
- `tests/models/transformers/test_models_transformer_anyflow_far.py`: new.
Same mixin set (TorchCompile is intentionally skipped — FAR's
`_build_causal_mask` uses `flex_attention.create_block_mask(_compile=False)`
which conflicts with the standard compile tester's assumptions; the bidi
file covers compile, FAR is bit-exact-validated end-to-end on H200 via the
pipeline replay). Also carries an `AnyFlowCausalAttnProcessor` smoke test
that exercises the backend gate (non-flex backends must raise) and asserts
the `AnyFlowFARTransformerOutput` dataclass exposes the expected fields.
Addresses review comments at test_models_transformer_anyflow.py:71 and :128.
* [AnyFlow] docs: update for video / video_latents kwarg rename
Following the pipeline kwarg refactor in e9d50b2, sweep the user-facing docs
to reflect the new API:
- `docs/source/en/api/pipelines/anyflow.md`: T2V / I2V / V2V code examples now
use `video=` instead of `context_sequence={"raw": ...}`. The "Generation
with AnyFlow (FAR Causal)" intro describes the new mutually-exclusive
`video` / `video_latents` selector.
- `docs/source/en/using-diffusers/anyflow.md`: the scenario selector table,
the "Image-to-video and video-to-video" walkthrough, and the closing note
about pre-encoded latents are all updated. `vae_encode` references are
replaced with `encode_video`.
* [AnyFlow] tests: skip FAR training tests on CPU (flex backward); align scheduler tests with N-length timesteps
- TestAnyFlowFARTransformer3DTraining: skip test_training / test_training_with_ema /
test_gradient_checkpointing_equivalence on CPU. FAR causal self-attn uses
torch.nn.attention.flex_attention whose backward kernel is GPU-only.
- test_scheduler_flow_map_euler_discrete: assert timesteps is N-length (not N+1) and
the sigma=0 r-endpoint lives in self.sigmas[-1]; test_step_one_shot_sampling now
exercises r_timestep=None (resolved from sigmas) since N=1 has no timesteps[1].
* [AnyFlow] docs: complete forward() Args: sections for check_forward_call_docstrings
main #13758 added utils/check_forward_call_docstrings.py which requires every signature
arg to appear as its own `name (...):` entry under Args:. Expand the bidi and FAR
transformer forward docstrings to list each parameter individually.
* [AnyFlow] apply 5/21 review suggestions (A: 1-click)
FAR transformer:
- AnyFlowCausalAttnProcessor: default _attention_backend = 'flex' (was None);
remove None from _SUPPORTED_BACKENDS. None previously fell through to SDPA
which silently ignored the BlockMask; failing loudly is the right default.
- dispatch_attention_fn call: read self._attention_backend instead of hardcoded
'flex', so '_native_flex' selection works.
- _build_freqs / _forward_full_frame: add '# Copied from' to bidi RoPE.
Pipelines:
- bidi + FAR docstrings: video shape (B, C, T, H, W) -> (B, T, C, H, W) to
match VideoProcessor.preprocess_video.
- FAR EXAMPLE_DOC_STRING: single-frame I2V tensor wrap uses unsqueeze(1) for the
T axis instead of unsqueeze(2).
- FAR encode_video: drop duplicated @torch.no_grad() decorator.
Tests:
- test_anyflow / test_anyflow_far: lift the test_save_load_optional_components
skip (the test actually passes).
- FAR processor smoke test: assert default backend is 'flex' (was 'None').
* [AnyFlow] apply 5/21 review suggestions (B: refactors)
Pipelines:
- check_inputs accepts video / video_latents and raises early on:
(a) mutual exclusion (was checked late in __call__);
(b) FAR's (num_frames - 1) % 4 == 0 constraint.
__call__ no longer carries duplicate validation.
- FAR pipeline: drop the show_progress kwarg and replace the single tqdm with
nested progress bars in the LLaDA-2 pattern — outer 'Chunks' (position=0)
and per-chunk inner 'Inference Steps' (position=1, leave=False) — both
picking up DiffusionPipeline._progress_bar_config (so set_progress_bar_config
controls them, including disable=None).
Scheduler:
- step() resolves source and target sigmas by indexing self.sigmas via the new
index_for_timestep(), instead of dividing the input timesteps by
num_train_timesteps. This keeps the math correct for any future schedule
whose timestep/sigma relationship is non-linear. For an off-schedule
r_timestep the code falls back to r / num_train_timesteps, so explicit
any-step sampling outside the schedule still works (and t off-schedule with
r=None still raises a clear ValueError, as before).
Numerical equivalence: for the shipped linspace+shift schedule the two
formulations are bit-identical (verified: max abs diff = 0.0 over an N=8,
shift=5 schedule).
* [AnyFlow] apply Claude bot review (5/21): 8 findings beyond dg845's list
Finding #1 — attention_kwargs plumbing:
Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs')
(matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache,
and the unused parameter is dropped from the inner _forward_train / _forward_cache /
_forward_inference signatures. Pipeline docstrings updated to the standard wording.
Finding #2 — naming:
Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the
FAR transformer keeps far_cfg, which is accurate there).
Finding #3 — scheduler state machine:
Add _step_index, _begin_index, step_index property, begin_index property,
set_begin_index(), _init_step_index(). step() lazily initializes and advances the
counter so downstream callbacks / composable schedulers can observe rollout progress.
Sigma resolution remains a pure function of (timestep, r_timestep) — calling step()
twice with identical args still returns identical prev_sample (idempotent).
Finding #4 — redundant @torch.no_grad():
Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's
encode_kv_cache (callers are already in __call__'s no-grad scope).
Finding #5 — dead code:
Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's
output-norm path (condition_embedder.forward always returns a 3D temb).
Finding #6 — private rename:
forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called
internally by _forward_train / _forward_cache / _forward_inference).
Finding #7 — pipeline comment numbering:
Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped.
Finding #8 — mask-mod comment numbering:
_build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...).
Tests:
- New test_step_index_advances + test_set_begin_index_anchors_step_index in the
scheduler test file exercise the new state machine.
- All existing pipeline / transformer / scheduler tests still pass (85 passed,
85 skipped on CPU).
Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new
sigma-lookup is byte-identical to t/num_train_timesteps on this schedule).
* [AnyFlow] scheduler: honour off-schedule any-step in _init_step_index; drop dead _resolve_next_timestep
Audit caught two issues in the previous scheduler commit:
1. The new state machine raised in _init_step_index whenever the first timestep
wasn't on the active schedule, contradicting the documented contract that
step() falls back to t/num_train_timesteps for off-schedule any-step
sampling. The fall-back numerics were intact but they were unreachable —
the init check fired first.
Fix: _init_step_index now initializes _step_index to 0 when the timestep is
off-schedule (still a valid observable counter for callbacks). step()'s
sigma resolution is untouched, so on-schedule rollouts stay bit-exact and
off-schedule any-step sampling actually runs again. Regression test:
test_step_off_schedule_anystep_supported.
2. _resolve_next_timestep had no remaining callers after the step() rewrite
inlined the same lookup. Removed (private helper, no external API).
* [AnyFlow] docs: align user guides with video shape + kwarg fixes
- en api/pipelines/anyflow.md: video shape (B, C, T, H, W) -> (B, T, C, H, W);
example tensor wrap uses unsqueeze(0).unsqueeze(1) and permute(0, 3, 1, 2)
to match VideoProcessor.preprocess_video's 5D contract.
- zh using-diffusers/anyflow.md: same shape fixes; also flip the I2V / V2V
examples from the obsolete context_sequence={...} dict to the current
video= / video_latents= kwargs; helper to_video_tensor returns (1, T, C, H, W);
add a note about mutual exclusion.
* [AnyFlow] tests: drop @slow integration test scaffolds for initial PR
.ai/skills/model-integration/SKILL.md is explicit: 'No integration / slow
tests in the initial PR — don't add anything gated on @slow / RUN_SLOW=1
yet.' Our two integration test classes were shape-only assertions with TODOs
for a future numeric reference, so dropping them loses no actual coverage —
the relevant rollouts are covered by H200 bit-exact replay outside the
pytest suite. Can land a follow-up PR after merge with proper numeric
reference slices once the maintainer is comfortable enabling slow tests.
* Apply style fixes
* [AnyFlow] apply 5/22 dg845 review: comment cleanups + custom sigmas/timesteps schedule
dg845 third pass — 7 of 9 comments applied; the 8th (custom sigmas/timesteps support)
matches FlowMatchEulerDiscreteScheduler conventions; the 9th (_build_causal_mask
refactor) is explicitly marked non-blocking and deferred to a follow-up that also
re-enables TorchCompileTesterMixin.
Comment cleanups:
- transformer_anyflow.py:704 temb output-norm comment: drop redundant 'no ndim==2 branch'.
- pipeline_anyflow.py:550 denoise loop comment: '# 6. Denoising loop'.
- pipeline_anyflow_far.py:684 denoise loop comment: '# 8. Denoising loop (outer over
chunks, inner over timesteps).'.
- pipeline_anyflow_far.py:702 drop trailing inline comment on `timesteps = scheduler.timesteps`.
- scheduling_flow_map_euler_discrete.py: clearer wording on the off-schedule `r_timestep`
error.
Custom schedule support:
- FlowMapEulerDiscreteScheduler.set_timesteps gains `sigmas` and `timesteps` kwargs
mirroring FlowMatchEulerDiscreteScheduler. Default behaviour is unchanged
(linspace + shift); the validation + length-N → length-N+1 terminal-0 append are
shared with the default path so on-schedule rollouts stay bit-exact.
- AnyFlowPipeline.__call__ and AnyFlowFARPipeline.__call__ accept `sigmas` and
`timesteps` kwargs, override num_inference_steps from their length, and forward
to set_timesteps (matches LTX2Pipeline pattern).
- New scheduler tests: test_set_timesteps_custom_sigmas and
test_set_timesteps_custom_timesteps cover both override paths.
Dtype skip on save/load:
- TestAnyFlowTransformer3D and TestAnyFlowFARTransformer3D now skip
test_from_save_pretrained_dtype_inference (parametrized over fp16/bf16), mirroring
WanTransformer3DModel's skip — the test's tolerance requirements are too high for
meaningful signal under AnyFlow's flow-map mixed-precision sampling.
* [AnyFlow] docs: apply hf-doc-builder line wrap (max_len 119)
CI doc-builder style check flagged 3 files with docstring lines >119 chars.
Ran 'doc-builder style src/diffusers docs/source --max_len 119' to autoformat;
content unchanged, line wrapping only.
* [AnyFlow] apply 5/22 follow-up review: new_zeros terminal sigma + cleanup
dg845 blocking suggestion (r3287274209):
- scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)`
instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits
both device and dtype from working_sigmas. The current working_sigmas always
starts on CPU so the device mismatch is latent, but new_zeros is the correct
defensive pattern and matches how the published FAR test fixtures run on CUDA.
Claude bot final-review follow-ups:
- transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask`
comments left over from the original numbered-step structure (bot #6).
- pipeline_anyflow_far.py: annotate `encode_video` with
`# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video`
and align docstring + inline comment so `make fix-copies` keeps them in sync (bot #3).
Skipped (not real / judgment-call):
- bot #2 (private rename of `_forward_far_patchify*`) — already done in 84605d5;
bot was looking at a stale snapshot.
- bot #4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra
`(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version,
so a clean `# Copied from` link would require restructuring. Bot called it a
consistency nit; leaving as-is.
- bot #5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as
judgment-call territory; the helper is a coherent operation that advanced
inference callers may want to invoke directly.
---------
Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>