Files
Guian Fang e39aecff57
Some checks failed
CodeQL Security Analysis For Github Actions / CodeQL Analysis (push) Failing after 10m51s
Build documentation / build (push) Failing after 13m21s
Run dependency tests / check_dependencies (push) Has been cancelled
Run Torch dependency tests / check_torch_dependencies (push) Has been cancelled
Fast GPU Tests on main / Setup Torch Pipelines CUDA Slow Tests Matrix (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (lora) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (models) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (others) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (schedulers) (push) Has been cancelled
Fast GPU Tests on main / Torch CUDA Tests (single_file) (push) Has been cancelled
Fast GPU Tests on main / PyTorch Compile CUDA tests (push) Has been cancelled
Fast GPU Tests on main / PyTorch xformers CUDA tests (push) Has been cancelled
Fast GPU Tests on main / Examples PyTorch CUDA tests on Ubuntu (push) Has been cancelled
Fast tests on main / Fast PyTorch CPU tests on Ubuntu (push) Has been cancelled
Fast tests on main / PyTorch Example CPU tests on Ubuntu (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
Update Diffusers metadata / update_metadata (push) Has been cancelled
Fast GPU Tests on main / Torch Pipelines CUDA Tests (push) Has been cancelled
Nightly and release tests on main/release branch / Setup Torch Pipelines CUDA Slow Tests Matrix (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch Pipelines CUDA Tests (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (examples) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (lora) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (models) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (others) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (schedulers) (push) Has been cancelled
Nightly and release tests on main/release branch / Nightly Torch CUDA Tests (single_file) (push) Has been cancelled
Nightly and release tests on main/release branch / PyTorch Compile CUDA tests (push) Has been cancelled
Nightly and release tests on main/release branch / Torch tests on big GPU (push) Has been cancelled
Nightly and release tests on main/release branch / Torch Minimum Version CUDA Tests (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[] backend:nvidia_modelopt test_location:modelopt]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[] backend:optimum_quanto test_location:quanto]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[] backend:torchao test_location:torchao]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[peft kernels] backend:gguf test_location:gguf]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (map[additional_deps:[peft] backend:bitsandbytes test_location:bnb]) (push) Has been cancelled
Nightly and release tests on main/release branch / Torch quantization nightly tests (push) Has been cancelled
Nightly and release tests on main/release branch / Generate Consolidated Test Report (push) Has been cancelled
Test, build, and push Docker images / test-build-docker-images (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-doc-builder) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-cpu) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-cuda) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-minimum-cuda) (push) Has been cancelled
Test, build, and push Docker images / build-and-push-docker-images (diffusers-pytorch-xformers-cuda) (push) Has been cancelled
Stale Bot / Close Stale Issues (push) Has been cancelled
Add AnyFlow Any-Step Video Diffusion Pipelines (Bidirectional + FAR Causal) (#13745)
* [Pipelines] AnyFlow: scaffold pipelines/anyflow + register all top-level imports

This is the lazy-loader scaffolding only. Body files (pipeline_anyflow.py,
pipeline_anyflow_causal.py, transformer_anyflow.py,
scheduling_flow_map_euler_discrete.py) come in subsequent commits.

* [Schedulers] AnyFlow: add FlowMapEulerDiscreteScheduler

The flow-map scheduler advances samples from timestep t to caller-provided
target r in a single Euler step, supporting any-step sampling on flow-map-
distilled checkpoints. It is a general-purpose scheduler — not specific to the
AnyFlow checkpoints.

Tests: 12 standalone tests covering instantiation, set_timesteps endpoints,
shift identity/monotonicity, step shape preservation, zero-interval identity,
one-shot sampling, train weight schemes, scale_noise endpoints.

Docs: api/schedulers/flow_map_euler_discrete.md

* [Models] AnyFlow: add AnyFlowTransformer3DModel

A 3D DiT extending the v0.35.1 Wan2.1 backbone with two config-toggled modules:
* FAR causal blocks (init_far_model=True): block-sparse causal attention via
  flex_attention + compressed-frame patch embedding for frame-level
  autoregressive generation (Gu et al., 2025, arXiv:2503.19325).
* Dual-timestep flow-map embedding (init_flowmap_model=True): adds a delta
  timestep embedder enabling flow-map sampling z_t -> z_r over arbitrary
  intervals (AnyFlow).

With both flags off, the model reduces to stock Wan2.1.

The class is intentionally self-contained rather than annotated with
'# Copied from diffusers.models.transformers.transformer_wan' because upstream
Wan has been refactored extensively since v0.35.1 (new WanAttention class,
different processor architecture).

Tests: 9 unit tests covering construction in 3 modes, bidi forward shape and
determinism, return_dict variants, save/load round-trip with and without
init_far_model, gradient checkpointing toggle.

Docs: api/models/anyflow_transformer3d.md

* [Pipelines] AnyFlow: add AnyFlowPipeline and AnyFlowCausalPipeline

* AnyFlowPipeline (pipeline_anyflow.py, ~590 LOC): bidirectional T2V using
  flow-map sampling. Loads checkpoints from nvidia/AnyFlow-Wan2.1-T2V-{1.3B,14B}.
* AnyFlowCausalPipeline (pipeline_anyflow_causal.py, ~700 LOC): FAR-based
  causal pipeline supporting T2V/I2V/TV2V via task_type kwarg. Loads checkpoints
  from nvidia/AnyFlow-FAR-Wan2.1-{1.3B,14B}-Diffusers.

Both pipelines reuse stock WanLoraLoaderMixin, AutoencoderKLWan, UMT5EncoderModel,
and AutoTokenizer from upstream. The transformer is the AnyFlowTransformer3DModel
introduced in the previous commit. The scheduler is FlowMapEulerDiscreteScheduler.

Tests:
* tests/pipelines/anyflow/test_anyflow.py: PipelineTesterMixin fast tests +
  slow integration test against nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers.
* tests/pipelines/anyflow/test_anyflow_causal.py: same structure for FAR variant.

Reference slices for slow integration tests are deferred to Phase 7
(Final quality pass) where the user runs them on a real GPU.

* [Docs] AnyFlow: add main pipeline documentation page

Modeled on the Helios pipeline doc (PR #13208). Sections: paper link + abstract,
supported checkpoints table, memory/speed optimization tabs, T2V/I2V/TV2V
examples for both bidirectional and causal variants, autodoc trailers.

* [Auto/Scripts] AnyFlow: register AutoPipelineForText2Video + add conversion script

* Register AnyFlowPipeline in AUTO_TEXT2VIDEO_PIPELINES_MAPPING.
* AnyFlowCausalPipeline is intentionally NOT registered for AutoPipeline because
  its task switch (t2v / i2v / tv2v) is too rich for a single auto-resolve key.
* scripts/convert_anyflow_to_diffusers.py: convert .pt training checkpoints
  (with 'ema' state dict) into a diffusers save_pretrained layout. Supports all
  4 released NVIDIA AnyFlow variants. Replaces the omegaconf-based config in the
  upstream repo with argparse to match other diffusers conversion scripts.

* [Quality] AnyFlow: ruff-format + regenerated dummy stubs

* ruff format pass on all 5 source files (long lines + trailing comma fixes)
* check_dummies.py --fix_and_overwrite regenerated:
  - dummy_pt_objects.py: AnyFlowTransformer3DModel + FlowMapEulerDiscreteScheduler
  - dummy_torch_and_transformers_objects.py: AnyFlowPipeline + AnyFlowCausalPipeline

Local fast tests: 21/21 passed
  - 12 scheduler tests (FlowMapEulerDiscreteScheduler)
  - 9 transformer tests (AnyFlowTransformer3DModel construction + bidi forward + save/load)

The pipeline fast tests in tests/pipelines/anyflow/ require a local dev install
that matches the diffusers main branch's transformers >= compatibility floor.
The reference slices for slow integration tests (real GPU + 1.3B/14B
checkpoints) are intentionally left as TODO stubs to be captured by the user
on a real GPU machine before opening the PR.

* [AnyFlow] address review feedback: bug fixes + DMD wording + EN/ZH tutorials

Critical bug fixes (verified against precision-validation review):
* pipeline_anyflow.py / pipeline_anyflow_causal.py: replace hardcoded
  transformer_dtype = torch.bfloat16 with self.transformer.dtype, so
  pipe.to("cpu") and PipelineTesterMixin save/load tests do not crash on a
  dtype mismatch in the patch_embedding conv3d.
* transformer_anyflow.py: drop the duplicate `base = base = ...` assignment in
  _build_causal_mask (was a copy-paste typo carried over from FAR-Dev).
* transformer_anyflow.py: drop unused `q_is_context` / `k_is_context` locals
  and the `# noqa: F841` markers that were silencing the dead-store warning.
* transformer_anyflow.py: remove `CacheMixin` from the inheritance list — the
  pipeline manages KV cache directly, the mixin's interface is unused.
* transformer_anyflow.py: guard the module-level `torch.compile(flex_attention)`
  with try/except so the file imports cleanly on CPU CI / no-Triton machines.
* convert_anyflow_to_diffusers.py: replace ad-hoc print warnings with the
  stdlib logger (warning_once-style) and a module-level basicConfig.

Documentation accuracy:
* AnyFlowCausalPipeline class docstring + main pipeline doc + EN/ZH tutorial:
  drop the fictitious `task_type` / `image` / `video` arguments and document
  the real API: pass `context_sequence={"raw": tensor}` (or `{"latent": ...}`)
  to switch between T2V (None) / I2V (1-frame) / TV2V (4n+1-frame) modes.
* Pipeline class docstrings + main doc: explicitly describe AnyFlow's
  two-stage LoRA distillation including DMD reverse-divergence supervision
  with Flow-Map backward simulation in stage 2 (was previously implicit).
* training_rollout: add detailed docstring explaining its role as the
  3-segment Flow-Map backward simulation entry point used during DMD training.
* Long-form tutorial doc `using-diffusers/anyflow.md` (EN, 239 LOC) and
  Chinese mirror `docs/source/zh/using-diffusers/anyflow.md` (224 LOC) added
  and registered in both `_toctree.yml` files.

Tests:
* Skip `test_attention_slicing_forward_pass` in both pipeline test classes
  with a clear rationale (custom attention processor does not support slicing).
* All 21 standalone tests still pass (12 scheduler + 9 transformer).

Quality gates:
* `ruff check` clean across all AnyFlow files.
* `ruff format --check` reports 6 files already formatted.
* `python utils/check_copies.py` reports no diff.

Out of scope for this commit (deferred until reviewer feedback):
* Splitting AnyFlowTransformer3DModel into bidi + causal subclasses
* Unifying _forward_inference / _forward_cache return types
* Migrating model tests from plain unittest to BaseModelTesterConfig + mixins
* HF model card / config.json metadata updates on the nvidia/* repos
  (push to Hub manually before opening the PR)

* [AnyFlow] rename Causal->FAR + explicit forward signature + dataclass output

Round 2 of review feedback. Three groups of changes; transformer state-dict
keys, module hierarchy, and tensor flow are unchanged so the H200 bit-exact
validation remains valid.

A. Pipeline rename (mechanical, no behavior change):
   * Class: AnyFlowCausalPipeline -> AnyFlowFARPipeline (Causal in diffusers
     usually means an attention mask; AnyFlow's variant is FAR autoregressive,
     so the FAR name is more specific and matches the paper).
   * File: pipeline_anyflow_causal.py -> pipeline_anyflow_far.py (git mv).
   * Test file: test_anyflow_causal.py -> test_anyflow_far.py (git mv).
   * All references updated in src/, tests/, docs/, scripts/, plus stale
     anyflowcausalpipeline anchor links in tutorial markdown.

B. Pipeline test bug fixes (closes 19 fast-test failures reported by
   precision-validation reviewer):
   * pipeline_anyflow.py / pipeline_anyflow_far.py: __call__ now sets
     self._num_timesteps = num_inference_steps before the rollout, so the
     PipelineTesterMixin callback tests can read pipe.num_timesteps.
   * tests/pipelines/anyflow/test_anyflow_far.py: drop the fictitious
     task_type="t2v" kwarg that crashed every causal fast test (the FAR
     pipeline selects mode via context_sequence, not a task_type arg).

C. Transformer architecture cleanups (review-driven, no tensor changes):
   * Replace forward(*args, **kwargs) dispatcher with an explicit signature
     listing every supported kwarg (hidden_states, timestep, r_timestep,
     encoder_hidden_states, encoder_hidden_states_image, chunk_partition,
     clean_hidden_states, clean_timestep, kv_cache, kv_cache_flag, is_causal,
     attention_kwargs, return_dict). Helps IDE / type-checker / torch.compile
     tracing.
   * Drop SimpleNamespace returns. Add AnyFlowFARTransformerOutput
     (BaseOutput dataclass with sample + kv_cache fields) for the two causal
     paths that need to also propagate kv_cache (_forward_inference and the
     newly return_dict-aware _forward_cache). _forward_train and
     _forward_bidirection now consistently return Transformer2DModelOutput.
     Pipeline call sites already use return_dict=False with positional
     unpacking, so the fix is transparent there.

Out of scope (deferred until canonical-org HF metadata sync):
   * Splitting AnyFlowTransformer3DModel into a bidi class plus an
     AnyFlowFARTransformer3DModel subclass — touches register_to_config keys
     and would require updating model_index.json on every released checkpoint.
   * Promoting chunk_partition from register_to_config to a forward-time
     argument (same reason).
   * Renaming training_rollout to _denoise — would break callers in the
     FAR-Dev on-policy trainer that produced the released checkpoints.

Local fast tests: 21/21 still pass (12 scheduler + 9 transformer).
ruff check, ruff format, and check_copies.py are all clean.

* [AnyFlow] wire callback_on_step_end through inference_range + add chunk_partition to FAR fast-test fixture

Two root causes for the 19 remaining PipelineTesterMixin failures, identified
by the H200 reviewer:

1. callback_on_step_end was accepted by __call__ but never invoked. Both
   pipelines pass it through to training_rollout (and FAR additionally through
   inference()), and inference_range now fires it after scheduler.step in
   the standard inference branch:

       if callback_on_step_end is not None:
           callback_kwargs = {k: locals()[k] for k in callback_on_step_end_tensor_inputs}
           callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
           latents = callback_outputs.pop("latents", latents)
           prompt_embeds = ...
           negative_prompt_embeds = ...

   `nonlocal prompt_embeds, negative_prompt_embeds` lets the callback rewrite
   the closure-captured embeddings, matching upstream WanPipeline semantics.
   The 3-segment grad_timestep training rollout does not invoke the callback;
   it is intentionally training-only.

2. tests/pipelines/anyflow/test_anyflow_far.py::get_dummy_components built
   the dummy transformer without a `chunk_partition`, leaving it None on the
   model config and crashing the pipeline at `sum(self.transformer.config.chunk_partition)`.
   Set `chunk_partition=[1, 1, 1]` in the fixture (3 chunks of 1 latent frame
   each, matching the test's num_frames=9 -> 3 latent frames).

Local fast tests: 21/21 still pass.
ruff check, ruff format, and check_copies.py are all clean.

* [AnyFlow] Phase 2: split transformer + drop chunk_partition from config + rename helpers

Major architectural refactor that aligns the integration with diffusers conventions
ahead of the canonical-org Hub upload. State-dict keys, module hierarchy, and
tensor flow are unchanged so the H200 bit-exact validation remains valid; only
the on-disk transformer/config.json fields move.

Changes:

1. **Sibling transformer classes** replace the flag-driven single class:
   * AnyFlowTransformer3DModel — bidirectional only. Drops compressed_patch_size /
     full_chunk_limit / init_far_model / init_flowmap_model / chunk_partition
     kwargs (always-on for AnyFlow distilled checkpoints).
   * AnyFlowFARTransformer3DModel — adds far_patch_embedding + the 3 FAR forward
     paths (train / cache-prefill / autoregressive inference).
   * AnyFlowTimeTextImageEmbedding (the legacy single-time embedder used only by
     the old setup_flowmap_model bootstrap) is removed; both classes now build
     AnyFlowDualTimestepTextImageEmbedding directly in __init__.
   * setup_flowmap_model / setup_far_model methods are removed; weight warm-start
     for far_patch_embedding (trilinear interpolation from patch_embedding) moves
     into AnyFlowFARTransformer3DModel.__init__.

2. **chunk_partition** is no longer a model config field. The FAR pipeline owns
   the schedule:
   * AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]
     matches the released 81-frame NVIDIA checkpoints.
   * AnyFlowFARPipeline.__call__ / _denoise_rollout accept a chunk_partition
     argument that overrides the default for non-default num_frames.

3. **training_rollout -> _denoise_rollout** rename across both pipelines and all
   English / Chinese docs that referenced it. Signals the method is internal to
   the pipeline driver, not a public training API.

4. **Conversion script + tests + docs + registries**:
   * scripts/convert_anyflow_to_diffusers.py: VARIANTS dict picks the right
     transformer class per variant; init_far_model / init_flowmap_model /
     chunk_partition kwargs are removed from the from_pretrained call.
   * Transformer test file split into AnyFlowTransformer3DModelTest and
     AnyFlowFARTransformer3DModelTest classes.
   * Pipeline test fixtures use the right class and pass chunk_partition via
     get_dummy_inputs (3-frame schedule [1, 1, 1] for the 9-frame test).
   * New docs page docs/source/en/api/models/anyflow_far_transformer3d.md;
     anyflow_transformer3d.md rewritten for the bidi-only class.
   * AnyFlowFARTransformer3DModel registered in src/diffusers/__init__.py,
     src/diffusers/models/__init__.py, models/transformers/__init__.py and the
     dummy_pt_objects.py stubs.
   * docs/source/en/_toctree.yml: new entry for the FAR transformer page.

5. **Cleanups**:
   * Pipeline __call__ no longer passes is_causal=False to the bidi forward (the
     bidi class doesn't accept it).
   * Pipeline class docstrings drop stale references to init_*_model flags.

Local tests: 22/22 pass (12 scheduler + 10 transformer covering both classes).
ruff check / format / check_copies clean.

Hub artifacts (model_index.json, transformer/config.json, scheduler config) need
to be regenerated for the released checkpoints; the HF update guide will be
delivered separately.

* [AnyFlow] Phase 3: convention compliance against .ai/AGENTS.md + .ai/models.md

Hard violations (per official diffusers guidelines):

* drop einops dependency — replace 25+ rearrange() calls with native
  permute/reshape/unflatten in transformer + both pipelines
* device-gate torch.float64 — apply_rotary_emb and AnyFlowRotaryPosEmbed now
  fall back to float32 / complex64 on MPS / NPU; freqs are lazily rebuilt
  per-device via _build_freqs (matches transformer_wan / transformer_flux
  pattern)
* migrate attention to dispatch_attention_fn — replace direct
  F.scaled_dot_product_attention calls with dispatch_attention_fn (works
  with sage / flash / native backends); introduce AnyFlowAttention(
  AttentionModuleMixin) with _default_processor_cls / _available_processors;
  rename processors to AnyFlowAttnProcessor / AnyFlowCrossAttnProcessor and
  declare _attention_backend / _parallel_config class attrs
* drop dead config fields — qk_norm and added_kv_proj_dim are pruned from
  both transformer __init__ signatures and AnyFlowTransformerBlock;
  AnyFlowAttention is hardcoded to rms-norm-across-heads (the only scheme
  the released checkpoints use) and has no add_k_proj path (T2V only)
* add _repeated_blocks = ["AnyFlowTransformerBlock"] to both transformer
  classes for compile_repeated_blocks() support (matches Wan)
* annotate prepare_latents with `# Copied from diffusers.pipelines.wan.
  pipeline_wan.WanPipeline.prepare_latents`; the pipeline-side rearrange
  to (B, T, C, H, W) layout is moved to the call site

State-dict keys are preserved (legacy Attention had identical to_q / to_k /
to_v / to_out / norm_q / norm_k naming), so existing AnyFlow checkpoints load
bit-exactly into the new AnyFlowAttention class.

The HF Hub config-update guide is updated correspondingly: transformer/
config.json now drops qk_norm and added_kv_proj_dim alongside the previous
init_far_model / init_flowmap_model / chunk_partition removals.

22 fast CPU tests still pass; ruff format / ruff check / check_copies all
clean.

* [AnyFlow] FAR fast-test compat: rope 0-dim guard + flex_attention CPU/head-dim fallbacks + KV-cache dtype + num_timesteps

Phase 3 migrated bidi + cross-attention to dispatch_attention_fn but the FAR
causal path still calls flex_attention directly, which has hard requirements
(CPU compile, head_dim >= 16) that fail on PipelineTesterMixin's tiny dummy
components. Real ckpts (head_dim=128, CUDA) never hit these branches; bit-exact
numerical equivalence with FAR-Dev preserved on all 4 released ckpts (forward
0.00e+00, backward kernel-nondet only, ratio 1.000).

Code fixes:

1. AnyFlowRotaryPosEmbed._forward_compressed_frame / _forward_full_frame now
   short-circuit to an empty tensor when num_frames / height / width is 0.
   PipelineTesterMixin's dummy VAE has scale_factor_spatial=8, so a 16x16 raw
   spatial input becomes a 2x2 latent which then floors to 0 against
   compressed_patch_size=(1, 4, 4); the original
   `freqs[:0].view(0, k, 1, -1)` reshape was ambiguous in that regime.

2. flex_attention dispatch: split the module-load
   `torch.compile(flex_attention, dynamic=True)` into `_flex_attention_eager`
   (always available) plus `_flex_attention_compiled`, with a tiny wrapper
   that picks compiled for CUDA tensors and eager for CPU. Avoids
   torch._inductor C++ codegen failures that broke fast tests after
   `pipe.to("cpu")`. CUDA performance unchanged (L10 benchmark: 0.0% delta on
   bidi 1.3B fwd, 0.0% delta on FAR causal 1.3B fwd).

3. AnyFlowAttnProcessor (FAR causal branch): when head_dim < 16
   (flex_attention's hard minimum) zero-pad q/k/v's last dim to 16 and pass
   `scale=1/sqrt(original_head_dim)` to flex_attention. Padded value rows
   contribute 0, so trimming the output back is mathematically equivalent.
   Released ckpts use head_dim=128 so the branch is never taken in production.

4. pipeline_anyflow_far.encode_kv_cache: replace the hardcoded
   `latents.to(torch.bfloat16)` with `self.transformer.dtype`. The hardcoded
   bf16 crashed conv3d on dummy fp32 components ("Input type (BFloat16) and
   bias type (float) should be the same"); real bf16 ckpts are unaffected.

5. pipeline_anyflow_far._denoise_rollout sets
   `self._num_timesteps = (len(chunk_partition) - num_context_chunks) * num_inference_steps`
   before the chunk loop, so PipelineTesterMixin.test_callback_cfg's
   `pipe.num_timesteps`-based assertion matches the actual number of callback
   fires (chunks * NFE) instead of the previous hardcoded num_inference_steps.

Tests:

* test_callback_inputs cannot pass without changing FAR's chunk-wise output
  semantics — it zeroes latents on the final step and asserts the *entire*
  output buffer is zero, but only the active chunk's slice is overwritten in
  a chunk-wise rollout. Marked `@unittest.skip` with a detailed rationale;
  callback functionality itself is still covered by test_callback_cfg.
* Full pytest run on tests/pipelines/anyflow/ +
  tests/models/transformers/test_models_transformer_anyflow.py +
  tests/schedulers/test_scheduler_flow_map_euler_discrete.py: 81 passed,
  0 failed, 11 skipped.

Quality gates:

* `ruff check` and `ruff format --check` clean across all AnyFlow files.
* `python utils/check_copies.py` clean.
* `python utils/check_dummies.py` clean.

* [AnyFlow] docs/code: paper-release tidy-up

User-facing alignment with the official HF Hub model card and the day-of-announcement
materials at https://huggingface.co/collections/nvidia/anyflow.

* Fill in the arXiv identifier 2605.13724 (5 paper links + 2 BibTeX entries).
* Rename TV2V → V2V across docs + pipeline_anyflow{,_far}.py so the diffusers
  copy uses the same Video-to-Video terminology as the official model card.
* Add the [nvidia/anyflow](https://huggingface.co/collections/nvidia/anyflow)
  HF collection link to the three tutorial intros.
* Drop the temporary "guyuchao/* staging" tip from the EN tutorial / API page
  / ZH tutorial — the nvidia/AnyFlow-*-Diffusers repos are now live.
* Wire up NVlabs/AnyFlow (training code) and nvlabs.github.io/AnyFlow (project
  page) in place of the prior <github-org> / <project-page-url> placeholders.
* Cite the authors (Yuchao Gu, Guian Fang et al.) and NUS ShowLab × NVIDIA
  affiliation in the main tutorial, API pipeline page, and both transformer
  model pages; BibTeX uses the standard `and others` to elide the full list
  until the next pass.

Working tree, CI gates, and tests after the change:

  ruff format --check                                  ✓
  ruff check                                           ✓
  python utils/check_copies.py                         ✓
  python utils/check_dummies.py                        ✓
  pytest tests/models + tests/schedulers (22 fast)     ✓

No production code logic changes — only docstring wording inside pipeline
files (TV2V → V2V).

* [AnyFlow] docs: drop in official BibTeX (full author list)

Replace the placeholder ``@article{gu2026anyflow, author = {Gu, Yuchao and
Fang, Guian and others}, ...}`` block in both the English and Chinese
tutorials with the canonical ``@misc{gu2026anyflowanystepvideodiffusion,
...}`` form from arxiv.org/abs/2605.13724, which lists all seven authors:
Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai,
Mike Zheng Shou.

Docs-only.

* [AnyFlow] align with diffusers conventions + drop training-only code

Scheduler
- FlowMapEulerDiscreteScheduler.step now returns a
  FlowMapEulerDiscreteSchedulerOutput dataclass (or tuple with return_dict=False)
  and uses the conventional positional order (model_output, timestep, sample,
  r_timestep).
- Drop training-only helpers: adaptive_weighting, set_train_weight,
  get_train_weight, linear_timesteps_weights, and the weight_type config field.
- Add scale_model_input no-op for API parity; raise ValueError on missing
  r_timestep.

Transformer
- Remove gate_track debug write inside
  AnyFlowDualTimestepTextImageEmbedding.forward_timestep.
- Compile flex_attention lazily on first CUDA call instead of at import time.
- Replace assert with ValueError in build_block_mask.
- Resolve <arxiv-id> placeholders to 2605.13724.

Pipelines (AnyFlowPipeline + AnyFlowFARPipeline)
- Add EXAMPLE_DOC_STRING + @replace_example_docstring and full __call__
  docstrings covering every argument.
- Move use_mean_velocity from __init__ to __call__ so save/load round-trips.
- Drop _denoise_rollout's grad_timestep branch (DMD on-policy training rollout),
  the inner inference_range closure, and the redundant negative-prompt concat.
- Replace asserts with ValueError; wire show_progress to tqdm; rename inference
  -> _inference; remove dead current_timestep property.
- Update scheduler.step call sites to the new signature.
- Trim class docstrings to inference-only language.

Pipeline output
- Add Apache 2.0 license header; switch to relative import.

Auto pipeline / conversion script
- Register AnyFlowFARPipeline in AUTO_IMAGE2VIDEO_PIPELINES_MAPPING and
  AUTO_VIDEO2VIDEO_PIPELINES_MAPPING.
- Document the weights_only=False requirement in the conversion script.

Tests
- Scheduler tests use the new step signature and verify the Output dataclass
  contract.
- Drop the four obsolete training-weight tests; drop weight_type kwarg from
  pipeline test fixtures; remove internal milestone names from TODO comments.

Docs
- Resolve <arxiv-id> in the scheduler docs page.
- Trim DMD / on-policy distillation language in EN/ZH tutorials and the
  pipelines page; the paper abstract quote is preserved verbatim.

* [AnyFlow] split FAR causal transformer into transformer_anyflow_far.py

Per @dg845's review on #13745: extract FAR causal modules into a dedicated
sibling file so each transformer variant reads in isolation. Shared submodules
are duplicated via `# Copied from` so `make fix-copies` keeps both in sync.

- `transformer_anyflow.py`: bidi-only. `AnyFlowAttnProcessor` no longer carries
  the flex/KV-cache branch (was: dispatch in one branch, bare flex_attention in
  the other); `AnyFlowRotaryPosEmbed` drops the compressed-frame helpers and
  the `is_causal` arg; `AnyFlowDualTimestepTextImageEmbedding` drops its causal
  branch. `AnyFlowTransformerBlock` keeps a single class with a new
  `is_causal: bool = False` ctor flag that selects the self-attn processor —
  the forward path is identical in both modes, only the processor differs.

- `transformer_anyflow_far.py`: new. Contains `AnyFlowFARTransformerOutput`,
  `AnyFlowCausalAttnProcessor` (routed through `dispatch_attention_fn(backend=
  "flex")` with a clear ValueError when a non-flex backend is configured; the
  BlockMask is consumed only by the flex backend in `_native_flex_attention`),
  `AnyFlowDualTimestepTextImageEmbeddingCausal`, `AnyFlowCausalRotaryPosEmbed`,
  `AnyFlowFARTransformer3DModel`, and `# Copied from` clones of the shared
  shared `AnyFlowAttention`/`AnyFlowCrossAttnProcessor`/`AnyFlowImageEmbedding`/
  `AnyFlowTransformerBlock`/`AnyFlowAttnProcessor` modules.

Verified bit-exact against the pre-refactor branch on H200 (float32):
- bidi:  L2 = 0.000e+00, max|Δ| = 0.000e+00
- FAR :  L2 = 4.772e-06, max|Δ| = 3.576e-07
The FAR delta is fp32 accumulation noise from the dispatch path permuting
(B,L,H,D) ↔ (B,H,L,D) around the same `flex_attention` kernel.

Addresses review comments at transformer_anyflow.py:215, :261, :450, :622,
:671, :958.

* [AnyFlow] pipeline cleanup: video_processor, encode_video, inline rollout, kwarg rename

Per @dg845's review on #13745, applied to both bidi `AnyFlowPipeline` and
causal `AnyFlowFARPipeline`:

- Use `self.video_processor.preprocess_video(...)` instead of the manual
  `* 2 - 1` normalize.
- Merge `vae_encode` + `encode_latents` + `_normalize_latents` into a single
  `encode_video` method, mirroring `WanImageToVideoPipeline.encode_image`'s
  flat structure.
- Inline `_denoise_rollout` into `AnyFlowPipeline.__call__`. For the FAR
  pipeline, inline both `_denoise_rollout` and `_inference` as a nested loop
  (outer over chunks, inner over denoising steps), mirroring
  `WanAnimatePipeline.__call__`. `encode_kv_cache` is intentionally kept as a
  method — it is one transformer call with a different `kv_cache_flag` mode
  (cache-write), and inlining it would interleave two distinct forward
  semantics in the same loop body and lose readability.
- Rename `context_sequence` → `video` (pixel-space) + `video_latents`
  (pre-encoded), matching `WanVideoToVideoPipeline`. For the FAR pipeline,
  the old `{"raw"/"latent"}` dict form is replaced by the two kwargs.
  Mutually-exclusive validation raises `ValueError`.

Addresses review comments at pipeline_anyflow.py:358, :372, :393, :473 and
pipeline_anyflow_far.py:395, :489, :675.

* [AnyFlow] scheduler: N-length timesteps + step defaults r_timestep

Per @dg845's review on #13745:

- `set_timesteps(N)` now produces `N` timesteps backed by an internal
  `sigmas[N+1]` linspace, matching `FlowMatchEulerDiscreteScheduler.set_
  timesteps`. The final sigma (== 0) is the implicit r-endpoint of the last
  step; the pipeline rollouts iterate `for i, t in enumerate(timesteps)`
  without the old `[:-1]` slicing.
- `step(r_timestep=None)` now defaults to the next timestep on the schedule
  (resolved via fp-tolerant `argmin` over `sigmas[:-1]`), instead of raising.
  Any-step sampling is preserved when `r_timestep` is explicit. The raise
  stays only for the case where the caller passes a `timestep` value that
  isn't on the schedule and provides no `r_timestep` — there's no sensible
  default in that case.
- Build sigmas in float64 on CPU then move to the target device, with a
  float32 downcast for MPS / NPU (float64 isn't supported on those backends).

Pipeline rollout loops updated to compute `r = sigmas[i + 1] * num_train_
timesteps` for the model's `r_timestep` input and pass `r_timestep=None` to
`scheduler.step` (which resolves it from the schedule internally).

Addresses review comments at scheduling_flow_map_euler_discrete.py:107 and
:148.

* [AnyFlow] tests: regenerate via generate_model_tests.py; split bidi/FAR files

Per @dg845's review on #13745: replaced the hand-rolled transformer tests
with the standard mixin-based suite produced by `utils/generate_model_tests
.py`, and split the FAR causal model tests into their own file to mirror the
transformer file split.

- `tests/models/transformers/test_models_transformer_anyflow.py`: regenerated
  bidi suite. Pulls in `ModelTesterMixin`, `MemoryTesterMixin`,
  `TrainingTesterMixin`, `AttentionTesterMixin`, `TorchCompileTesterMixin` via
  `BaseModelTesterConfig`, with `get_init_dict()` / `get_dummy_inputs()`
  filled in for the small bidi config used in CI.

- `tests/models/transformers/test_models_transformer_anyflow_far.py`: new.
  Same mixin set (TorchCompile is intentionally skipped — FAR's
  `_build_causal_mask` uses `flex_attention.create_block_mask(_compile=False)`
  which conflicts with the standard compile tester's assumptions; the bidi
  file covers compile, FAR is bit-exact-validated end-to-end on H200 via the
  pipeline replay). Also carries an `AnyFlowCausalAttnProcessor` smoke test
  that exercises the backend gate (non-flex backends must raise) and asserts
  the `AnyFlowFARTransformerOutput` dataclass exposes the expected fields.

Addresses review comments at test_models_transformer_anyflow.py:71 and :128.

* [AnyFlow] docs: update for video / video_latents kwarg rename

Following the pipeline kwarg refactor in e9d50b2, sweep the user-facing docs
to reflect the new API:

- `docs/source/en/api/pipelines/anyflow.md`: T2V / I2V / V2V code examples now
  use `video=` instead of `context_sequence={"raw": ...}`. The "Generation
  with AnyFlow (FAR Causal)" intro describes the new mutually-exclusive
  `video` / `video_latents` selector.

- `docs/source/en/using-diffusers/anyflow.md`: the scenario selector table,
  the "Image-to-video and video-to-video" walkthrough, and the closing note
  about pre-encoded latents are all updated. `vae_encode` references are
  replaced with `encode_video`.

* [AnyFlow] tests: skip FAR training tests on CPU (flex backward); align scheduler tests with N-length timesteps

- TestAnyFlowFARTransformer3DTraining: skip test_training / test_training_with_ema /
  test_gradient_checkpointing_equivalence on CPU. FAR causal self-attn uses
  torch.nn.attention.flex_attention whose backward kernel is GPU-only.
- test_scheduler_flow_map_euler_discrete: assert timesteps is N-length (not N+1) and
  the sigma=0 r-endpoint lives in self.sigmas[-1]; test_step_one_shot_sampling now
  exercises r_timestep=None (resolved from sigmas) since N=1 has no timesteps[1].

* [AnyFlow] docs: complete forward() Args: sections for check_forward_call_docstrings

main #13758 added utils/check_forward_call_docstrings.py which requires every signature
arg to appear as its own `name (...):` entry under Args:. Expand the bidi and FAR
transformer forward docstrings to list each parameter individually.

* [AnyFlow] apply 5/21 review suggestions (A: 1-click)

FAR transformer:
- AnyFlowCausalAttnProcessor: default _attention_backend = 'flex' (was None);
  remove None from _SUPPORTED_BACKENDS. None previously fell through to SDPA
  which silently ignored the BlockMask; failing loudly is the right default.
- dispatch_attention_fn call: read self._attention_backend instead of hardcoded
  'flex', so '_native_flex' selection works.
- _build_freqs / _forward_full_frame: add '# Copied from' to bidi RoPE.

Pipelines:
- bidi + FAR docstrings: video shape (B, C, T, H, W) -> (B, T, C, H, W) to
  match VideoProcessor.preprocess_video.
- FAR EXAMPLE_DOC_STRING: single-frame I2V tensor wrap uses unsqueeze(1) for the
  T axis instead of unsqueeze(2).
- FAR encode_video: drop duplicated @torch.no_grad() decorator.

Tests:
- test_anyflow / test_anyflow_far: lift the test_save_load_optional_components
  skip (the test actually passes).
- FAR processor smoke test: assert default backend is 'flex' (was 'None').

* [AnyFlow] apply 5/21 review suggestions (B: refactors)

Pipelines:
- check_inputs accepts video / video_latents and raises early on:
    (a) mutual exclusion (was checked late in __call__);
    (b) FAR's (num_frames - 1) % 4 == 0 constraint.
  __call__ no longer carries duplicate validation.
- FAR pipeline: drop the show_progress kwarg and replace the single tqdm with
  nested progress bars in the LLaDA-2 pattern — outer 'Chunks' (position=0)
  and per-chunk inner 'Inference Steps' (position=1, leave=False) — both
  picking up DiffusionPipeline._progress_bar_config (so set_progress_bar_config
  controls them, including disable=None).

Scheduler:
- step() resolves source and target sigmas by indexing self.sigmas via the new
  index_for_timestep(), instead of dividing the input timesteps by
  num_train_timesteps. This keeps the math correct for any future schedule
  whose timestep/sigma relationship is non-linear. For an off-schedule
  r_timestep the code falls back to r / num_train_timesteps, so explicit
  any-step sampling outside the schedule still works (and t off-schedule with
  r=None still raises a clear ValueError, as before).

Numerical equivalence: for the shipped linspace+shift schedule the two
formulations are bit-identical (verified: max abs diff = 0.0 over an N=8,
shift=5 schedule).

* [AnyFlow] apply Claude bot review (5/21): 8 findings beyond dg845's list

Finding #1 — attention_kwargs plumbing:
  Both transformers now decorate forward() with @apply_lora_scale('attention_kwargs')
  (matches Wan); pipelines forward attention_kwargs to the transformer + encode_kv_cache,
  and the unused parameter is dropped from the inner _forward_train / _forward_cache /
  _forward_inference signatures. Pipeline docstrings updated to the standard wording.

Finding #2 — naming:
  Rename far_cfg -> layout_cfg in the bidi transformer (the bidi path is not FAR; the
  FAR transformer keeps far_cfg, which is accurate there).

Finding #3 — scheduler state machine:
  Add _step_index, _begin_index, step_index property, begin_index property,
  set_begin_index(), _init_step_index(). step() lazily initializes and advances the
  counter so downstream callbacks / composable schedulers can observe rollout progress.
  Sigma resolution remains a pure function of (timestep, r_timestep) — calling step()
  twice with identical args still returns identical prev_sample (idempotent).

Finding #4 — redundant @torch.no_grad():
  Drop the redundant decorators on bidi pipeline's encode_video and FAR pipeline's
  encode_kv_cache (callers are already in __call__'s no-grad scope).

Finding #5 — dead code:
  Remove the unreachable temb.ndim == 2 else branch from the bidi transformer's
  output-norm path (condition_embedder.forward always returns a 3D temb).

Finding #6 — private rename:
  forward_far_patchify[_inference] -> _forward_far_patchify[_inference] (only called
  internally by _forward_train / _forward_cache / _forward_inference).

Finding #7 — pipeline comment numbering:
  Bidi + FAR pipelines renumber steps so the # 4. slot is no longer skipped.

Finding #8 — mask-mod comment numbering:
  _build_causal_mask numbered comments now run 1) 2) 3) ... (was 1) 3) 4) ...).

Tests:
  - New test_step_index_advances + test_set_begin_index_anchors_step_index in the
    scheduler test file exercise the new state machine.
  - All existing pipeline / transformer / scheduler tests still pass (85 passed,
    85 skipped on CPU).

Bit-exact: 8-step rollout vs the previous formulation, max abs diff = 0.0 (the new
sigma-lookup is byte-identical to t/num_train_timesteps on this schedule).

* [AnyFlow] scheduler: honour off-schedule any-step in _init_step_index; drop dead _resolve_next_timestep

Audit caught two issues in the previous scheduler commit:

1. The new state machine raised in _init_step_index whenever the first timestep
   wasn't on the active schedule, contradicting the documented contract that
   step() falls back to t/num_train_timesteps for off-schedule any-step
   sampling. The fall-back numerics were intact but they were unreachable —
   the init check fired first.

   Fix: _init_step_index now initializes _step_index to 0 when the timestep is
   off-schedule (still a valid observable counter for callbacks). step()'s
   sigma resolution is untouched, so on-schedule rollouts stay bit-exact and
   off-schedule any-step sampling actually runs again. Regression test:
   test_step_off_schedule_anystep_supported.

2. _resolve_next_timestep had no remaining callers after the step() rewrite
   inlined the same lookup. Removed (private helper, no external API).

* [AnyFlow] docs: align user guides with video shape + kwarg fixes

- en api/pipelines/anyflow.md: video shape (B, C, T, H, W) -> (B, T, C, H, W);
  example tensor wrap uses unsqueeze(0).unsqueeze(1) and permute(0, 3, 1, 2)
  to match VideoProcessor.preprocess_video's 5D contract.
- zh using-diffusers/anyflow.md: same shape fixes; also flip the I2V / V2V
  examples from the obsolete context_sequence={...} dict to the current
  video= / video_latents= kwargs; helper to_video_tensor returns (1, T, C, H, W);
  add a note about mutual exclusion.

* [AnyFlow] tests: drop @slow integration test scaffolds for initial PR

.ai/skills/model-integration/SKILL.md is explicit: 'No integration / slow
tests in the initial PR — don't add anything gated on @slow / RUN_SLOW=1
yet.' Our two integration test classes were shape-only assertions with TODOs
for a future numeric reference, so dropping them loses no actual coverage —
the relevant rollouts are covered by H200 bit-exact replay outside the
pytest suite. Can land a follow-up PR after merge with proper numeric
reference slices once the maintainer is comfortable enabling slow tests.

* Apply style fixes

* [AnyFlow] apply 5/22 dg845 review: comment cleanups + custom sigmas/timesteps schedule

dg845 third pass — 7 of 9 comments applied; the 8th (custom sigmas/timesteps support)
matches FlowMatchEulerDiscreteScheduler conventions; the 9th (_build_causal_mask
refactor) is explicitly marked non-blocking and deferred to a follow-up that also
re-enables TorchCompileTesterMixin.

Comment cleanups:
- transformer_anyflow.py:704 temb output-norm comment: drop redundant 'no ndim==2 branch'.
- pipeline_anyflow.py:550 denoise loop comment: '# 6. Denoising loop'.
- pipeline_anyflow_far.py:684 denoise loop comment: '# 8. Denoising loop (outer over
  chunks, inner over timesteps).'.
- pipeline_anyflow_far.py:702 drop trailing inline comment on `timesteps = scheduler.timesteps`.
- scheduling_flow_map_euler_discrete.py: clearer wording on the off-schedule `r_timestep`
  error.

Custom schedule support:
- FlowMapEulerDiscreteScheduler.set_timesteps gains `sigmas` and `timesteps` kwargs
  mirroring FlowMatchEulerDiscreteScheduler. Default behaviour is unchanged
  (linspace + shift); the validation + length-N → length-N+1 terminal-0 append are
  shared with the default path so on-schedule rollouts stay bit-exact.
- AnyFlowPipeline.__call__ and AnyFlowFARPipeline.__call__ accept `sigmas` and
  `timesteps` kwargs, override num_inference_steps from their length, and forward
  to set_timesteps (matches LTX2Pipeline pattern).
- New scheduler tests: test_set_timesteps_custom_sigmas and
  test_set_timesteps_custom_timesteps cover both override paths.

Dtype skip on save/load:
- TestAnyFlowTransformer3D and TestAnyFlowFARTransformer3D now skip
  test_from_save_pretrained_dtype_inference (parametrized over fp16/bf16), mirroring
  WanTransformer3DModel's skip — the test's tolerance requirements are too high for
  meaningful signal under AnyFlow's flow-map mixed-precision sampling.

* [AnyFlow] docs: apply hf-doc-builder line wrap (max_len 119)

CI doc-builder style check flagged 3 files with docstring lines >119 chars.
Ran 'doc-builder style src/diffusers docs/source --max_len 119' to autoformat;
content unchanged, line wrapping only.

* [AnyFlow] apply 5/22 follow-up review: new_zeros terminal sigma + cleanup

dg845 blocking suggestion (r3287274209):
- scheduling_flow_map_euler_discrete.py:185 — use `working_sigmas.new_zeros(1)`
  instead of `torch.zeros(1, dtype=...)` so the appended terminal sigma inherits
  both device and dtype from working_sigmas. The current working_sigmas always
  starts on CPU so the device mismatch is latent, but new_zeros is the correct
  defensive pattern and matches how the published FAR test fixtures run on CUDA.

Claude bot final-review follow-ups:
- transformer_anyflow_far.py: drop three stale `# step 3: generate attention mask`
  comments left over from the original numbered-step structure (bot #6).
- pipeline_anyflow_far.py: annotate `encode_video` with
  `# Copied from diffusers.pipelines.anyflow.pipeline_anyflow.AnyFlowPipeline.encode_video`
  and align docstring + inline comment so `make fix-copies` keeps them in sync (bot #3).

Skipped (not real / judgment-call):
- bot #2 (private rename of `_forward_far_patchify*`) — already done in 84605d5;
  bot was looking at a stale snapshot.
- bot #4 (check_inputs `# Copied from`) — FAR's check_inputs has an extra
  `(num_frames - 1) % 4 == 0` constraint that doesn't map onto the bidi version,
  so a clean `# Copied from` link would require restructuring. Bot called it a
  consistency nit; leaving as-is.
- bot #5 (`encode_kv_cache` → `_encode_kv_cache`) — bot itself flagged this as
  judgment-call territory; the helper is a coherent operation that advanced
  inference callers may want to invoke directly.

---------

Co-authored-by: dg845 <58458699+dg845@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2026-05-22 03:15:00 -07:00
..
2023-03-06 10:40:18 +00:00