Merge branch 'main' into group-offloading-pytest
Some checks failed
Secret Leaks / trufflehog (push) Has been cancelled

This commit is contained in:
Sayak Paul
2026-05-27 08:22:47 +05:30
committed by GitHub
258 changed files with 8382 additions and 255 deletions

View File

@@ -172,3 +172,5 @@ Boolean gate. If `False` (default), calling that method raises `ValueError`. All
freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
```
See `transformer_flux.py`, `transformer_flux2.py`, `transformer_wan.py`, `unet_2d_condition.py` for reference usages. Never leave an unconditional `torch.float64` in the model.
6. **Using `torch.empty`.** - Do not use `torch.empty` to initialize parameters. Use `torch.zeros` or `torch.ones`, instead.

View File

@@ -60,3 +60,7 @@ When adding a new pipeline (or reviewing one), skim `pipeline_flux.py`, `pipelin
4. **Subclassing an existing pipeline for a variant.** Don't use an existing pipeline class (e.g. `FluxPipeline`) to override another (e.g. `FluxImg2ImgPipeline`) inside the core `src/` codebase. Each pipeline lives in its own file with its own class, even if it shares 90% of `__call__` with a sibling. Convention across diffusers — flux, sdxl, wan, qwenimage — is duplicated `__call__` between img2img / text2img / inpaint variants, not subclassing. Reuse private utilities (shared schedulers, prep functions) but not the pipeline class itself.
5. **Copying a method from another pipeline without `# Copied from`.** When you reuse a method like `encode_prompt`, `prepare_latents`, `check_inputs`, or `_prepare_latent_image_ids` from another pipeline, add a `# Copied from` annotation so `make fix-copies` keeps the two in sync. Forgetting it means future refactors to the source drift away from your copy silently — and reviewers waste time spotting near-identical code that should have been linked. The annotation grammar (decorator placement, rename syntax with `with old->new`, etc.) is implemented in [`utils/check_copies.py`](../utils/check_copies.py) — read it for the exact rules.
6. **Be deliberate about methods on the pipeline.** `__call__` is the user's mental model. The methods on the class are how they navigate it. Diffusers convention (flux, sdxl, wan, qwenimage) is a flat class body of public lifecycle methods (`__init__`, `check_inputs`, `encode_prompt`, `prepare_latents`, `__call__`). Two principles, not strict rules — use judgment:
- **If a method is called from `__call__`, and it's a step in the pipeline lifecycle, make it public.** Each call from `__call__` should correspond to a step a user can identify: either a standard one (`encode_prompt`, `prepare_latents`, `set_timesteps`, …) or a pipeline-specific one (`prepare_src_latents`, `prepare_reference_audio_latents`, …). Don't gate these behind a `_`; they're part of the pipeline's API surface alongside their standard siblings.
- **If a method is only used by another method, make it private (`_foo`) or lift it to a module-level function — and keep the count down.** Before adding one, see if the logic can be absorbed into its caller. Unless you expect the helper to be reused by another method (or another task pipeline), absorbing is usually the better call — especially when the body is small. Avoid a pipeline class littered with private helpers that bury the lifecycle..

11
.github/dependabot.yml vendored Normal file
View File

@@ -0,0 +1,11 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
cooldown:
default-days: 7
groups:
actions:
patterns: ["*"]

View File

@@ -45,7 +45,7 @@ jobs:
uv pip install -r benchmarks/requirements.txt
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Diffusers Benchmarking
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}

View File

@@ -156,7 +156,6 @@ jobs:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.issue.number || github.event.pull_request.number }}
COMMENT_USER: ${{ github.event.comment.user.login }}
BASE_BRANCH: ${{ github.event.repository.default_branch }}
run: |
set -euo pipefail
@@ -186,11 +185,18 @@ jobs:
exit 0
fi
# For fork PRs, an earlier step redirected `origin` to a local bare
# repo to sandbox claude-code-action. Undo that redirect so our push
# reaches the real base repo. Safe: only Claude's edits within the
# allowed paths are committed below — never the fork's other changes.
git config --unset-all url."file:///tmp/local-origin.git".insteadOf 2>/dev/null || true
PR_INFO=$(gh pr view "$PR_NUMBER" --json headRefName,isCrossRepository)
PR_BRANCH=$(echo "$PR_INFO" | jq -r '.headRefName')
IS_FORK=$(echo "$PR_INFO" | jq -r '.isCrossRepository')
# COMMIT THIS isn't supported on fork PRs: we can't push to the
# fork's branch, and falling back to main almost always conflicts
# once the PR touches files that also moved on main. Bail early —
# Claude's review comment with the suggested diff still stands.
if [[ "$IS_FORK" == "true" ]]; then
post_status " \`COMMIT THIS\` isn't supported on fork PRs. Apply Claude's suggestions manually, or open an issue to track them. See [workflow run]($RUN_URL)."
exit 0
fi
git config user.name "claude[bot]"
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
@@ -208,8 +214,6 @@ jobs:
exit 1
fi
PR_BRANCH=$(gh pr view "$PR_NUMBER" --json headRefName --jq '.headRefName')
if [[ "$PR_BRANCH" == claude/pr-* ]]; then
# Source PR is already a Claude-opened PR — iterate in place by
# committing and pushing straight to its head branch instead of
@@ -222,9 +226,14 @@ jobs:
exit 0
fi
# Otherwise: commit on the source PR's branch to get a clean SHA,
# then cherry-pick onto a fresh branch cut from the default branch.
# The follow-up PR's diff is therefore exactly Claude's edits vs. main.
# Target the source PR's head branch. The follow-up then applies
# cleanly regardless of how main has diverged, and merging it lands
# Claude's edits onto the PR for the maintainer to fold in.
BASE_BRANCH="$PR_BRANCH"
# Commit on the source PR's branch to get a clean SHA, then
# cherry-pick onto a fresh branch cut from BASE_BRANCH so the
# follow-up PR's diff is exactly Claude's edits vs. BASE_BRANCH.
NEW_BRANCH="claude/pr-${PR_NUMBER}-$(date -u +%Y%m%d-%H%M%S)"
git commit -m "Apply changes from Claude (requested by @${COMMENT_USER} on #${PR_NUMBER})
@@ -248,6 +257,6 @@ jobs:
--title "Apply Claude's changes from #${PR_NUMBER}" \
--body "Automated PR with edits Claude made in response to \`COMMIT THIS\` from @${COMMENT_USER} on [#${PR_NUMBER}](${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/pull/${PR_NUMBER}).
Targets \`${BASE_BRANCH}\` — independent of #${PR_NUMBER}. Further \`COMMIT THIS\` requests on *this* PR will commit directly to it.")
Targets \`${BASE_BRANCH}\` (the head branch of #${PR_NUMBER}). Merging this brings Claude's edits into that PR.")
post_status "✅ Opened follow-up PR (into \`${BASE_BRANCH}\`) with Claude's edits: ${NEW_PR_URL}"

View File

@@ -19,6 +19,9 @@ env:
PIPELINE_USAGE_CUTOFF: 0
SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
CONSOLIDATED_REPORT_PATH: consolidated_test_report.md
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
jobs:
setup_torch_cuda_pipeline_matrix:
@@ -74,14 +77,14 @@ jobs:
run: nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip install pytest-reportlog
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Pipeline CUDA Test
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -128,14 +131,14 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip install pytest-reportlog
- name: Environment
run: python utils/print_env.py
run: diffusers-cli env
- name: Run nightly PyTorch CUDA tests for non-pipeline modules
if: ${{ matrix.module != 'examples'}}
@@ -196,12 +199,12 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run torch compile tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -238,15 +241,15 @@ jobs:
run: nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip install pytest-reportlog
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Selected Torch CUDA Test on big GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -289,15 +292,15 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run PyTorch CUDA tests
env:
@@ -365,6 +368,7 @@ jobs:
run: nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip install -U ${{ matrix.config.backend }}
if [ "${{ join(matrix.config.additional_deps, ' ') }}" != "" ]; then
@@ -372,10 +376,9 @@ jobs:
fi
uv pip install pytest-reportlog
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: ${{ matrix.config.backend }} quantization tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -418,14 +421,14 @@ jobs:
run: nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip install -U bitsandbytes optimum_quanto
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip install pytest-reportlog
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Pipeline-level quantization tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -541,7 +544,7 @@ jobs:
# - name: Environment
# shell: arch -arch arm64 bash {0}
# run: |
# ${CONDA_RUN} python utils/print_env.py
# ${CONDA_RUN} diffusers-cli env
# - name: Run nightly PyTorch tests on M1 (MPS)
# shell: arch -arch arm64 bash {0}
# env:
@@ -597,7 +600,7 @@ jobs:
# - name: Environment
# shell: arch -arch arm64 bash {0}
# run: |
# ${CONDA_RUN} python utils/print_env.py
# ${CONDA_RUN} diffusers-cli env
# - name: Run nightly PyTorch tests on M1 (MPS)
# shell: arch -arch arm64 bash {0}
# env:

View File

@@ -34,6 +34,9 @@ env:
OMP_NUM_THREADS: 4
MKL_NUM_THREADS: 4
PYTEST_TIMEOUT: 60
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
jobs:
check_code_quality:
@@ -73,6 +76,7 @@ jobs:
python utils/check_copies.py
python utils/check_dummies.py
python utils/check_support_list.py
python utils/check_forward_call_docstrings.py
make deps_table_check_updated
- name: Check if failure
if: ${{ failure() }}
@@ -120,14 +124,14 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run fast PyTorch Pipeline CPU tests
run: |

View File

@@ -39,7 +39,7 @@ jobs:
uv pip install -e ".[quality]"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
echo $(git --version)
- name: Fetch Tests
run: |
@@ -97,7 +97,7 @@ jobs:
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run all selected tests on CPU
run: |
@@ -151,7 +151,7 @@ jobs:
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run Hub tests for models, schedulers, and pipelines on a staging env
if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}

View File

@@ -29,6 +29,9 @@ env:
OMP_NUM_THREADS: 4
MKL_NUM_THREADS: 4
PYTEST_TIMEOUT: 60
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
jobs:
check_code_quality:
@@ -68,6 +71,7 @@ jobs:
python utils/check_copies.py
python utils/check_dummies.py
python utils/check_support_list.py
python utils/check_forward_call_docstrings.py
make deps_table_check_updated
- name: Check if failure
if: ${{ failure() }}
@@ -116,14 +120,14 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run fast PyTorch Pipeline CPU tests
if: ${{ matrix.config.framework == 'pytorch_pipelines' }}
@@ -193,13 +197,13 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run Hub tests for models, schedulers, and pipelines on a staging env
if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
@@ -244,17 +248,16 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
# TODO (sayakpaul, DN6): revisit `--no-deps`
uv pip install -U peft@git+https://github.com/huggingface/peft.git --no-deps
uv pip install -U tokenizers
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run fast PyTorch LoRA tests with PEFT
run: |

View File

@@ -30,6 +30,9 @@ env:
HF_XET_HIGH_PERFORMANCE: 1
PYTEST_TIMEOUT: 600
PIPELINE_USAGE_CUTOFF: 1000000000 # set high cutoff so that only always-test pipelines run
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
jobs:
check_code_quality:
@@ -69,6 +72,7 @@ jobs:
python utils/check_copies.py
python utils/check_dummies.py
python utils/check_support_list.py
python utils/check_forward_call_docstrings.py
make deps_table_check_updated
- name: Check if failure
if: ${{ failure() }}
@@ -91,10 +95,11 @@ jobs:
fetch-depth: 2
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Fetch Pipeline Matrix
id: fetch_pipeline_matrix
run: |
@@ -132,14 +137,14 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Extract tests
id: extract_tests
run: |
@@ -202,15 +207,15 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Extract tests
id: extract_tests
@@ -267,13 +272,13 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
uv pip install -e ".[quality,training]"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run example tests on GPU
env:

View File

@@ -20,6 +20,9 @@ env:
HF_XET_HIGH_PERFORMANCE: 1
PYTEST_TIMEOUT: 600
PIPELINE_USAGE_CUTOFF: 50000
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
jobs:
setup_torch_cuda_pipeline_matrix:
@@ -37,10 +40,11 @@ jobs:
fetch-depth: 2
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Fetch Pipeline Matrix
id: fetch_pipeline_matrix
run: |
@@ -77,13 +81,13 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: PyTorch CUDA checkpoint tests on Ubuntu
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -129,15 +133,15 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run PyTorch CUDA tests
env:
@@ -184,12 +188,12 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -228,10 +232,11 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -268,11 +273,12 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run example tests on GPU
env:

View File

@@ -67,7 +67,7 @@ jobs:
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run fast PyTorch CPU tests
if: ${{ matrix.config.framework == 'pytorch' }}

View File

@@ -14,6 +14,9 @@ env:
HF_XET_HIGH_PERFORMANCE: 1
PYTEST_TIMEOUT: 600
RUN_SLOW: no
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
@@ -43,17 +46,17 @@ jobs:
- name: Install dependencies
shell: arch -arch arm64 bash {0}
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
${CONDA_RUN} python -m pip install --upgrade pip uv
${CONDA_RUN} python -m uv pip install -e ".[quality]"
${CONDA_RUN} python -m uv pip install torch torchvision torchaudio
${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
${CONDA_RUN} python -m uv pip install transformers --upgrade
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
shell: arch -arch arm64 bash {0}
run: |
${CONDA_RUN} python utils/print_env.py
${CONDA_RUN} diffusers-cli env
- name: Run fast PyTorch tests on M1 (MPS)
shell: arch -arch arm64 bash {0}

View File

@@ -44,7 +44,6 @@ jobs:
run: |
pip install -U transformers
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
python utils/print_env.py
python -c "from diffusers import __version__; print(__version__)"
python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()"
python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')"

View File

@@ -19,6 +19,9 @@ env:
MKL_NUM_THREADS: 8
PYTEST_TIMEOUT: 600
PIPELINE_USAGE_CUTOFF: 50000
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
# even when transformers@main declares a higher lower-bound.
UV_OVERRIDE: /tmp/uv-overrides.txt
jobs:
setup_torch_cuda_pipeline_matrix:
@@ -36,12 +39,12 @@ jobs:
fetch-depth: 2
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Fetch Pipeline Matrix
id: fetch_pipeline_matrix
run: |
@@ -78,13 +81,13 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Slow PyTorch CUDA checkpoint tests on Ubuntu
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -130,15 +133,15 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run PyTorch CUDA tests
env:
@@ -182,15 +185,15 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality]"
uv pip install peft@git+https://github.com/huggingface/peft.git
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run PyTorch CUDA tests
env:
@@ -243,12 +246,12 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run torch compile tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -287,12 +290,12 @@ jobs:
nvidia-smi
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run example tests on GPU
env:
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
@@ -331,13 +334,13 @@ jobs:
- name: Install dependencies
run: |
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
uv pip install -e ".[quality,training]"
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
- name: Environment
run: |
python utils/print_env.py
diffusers-cli env
- name: Run example tests on GPU
env:

View File

@@ -36,6 +36,7 @@ repo-consistency:
python utils/check_dummies.py
python utils/check_repo.py
python utils/check_inits.py
python utils/check_forward_call_docstrings.py
# this target runs checks on all files
@@ -74,6 +75,10 @@ fix-copies:
modular-autodoctrings:
python utils/modular_auto_docstring.py
# Verify forward() / __call__() arguments are documented in their docstrings
check-forward-call-docstrings:
python utils/check_forward_call_docstrings.py
# Run tests for the library
test:

View File

@@ -299,6 +299,10 @@
title: AceStepTransformer1DModel
- local: api/models/allegro_transformer3d
title: AllegroTransformer3DModel
- local: api/models/anyflow_far_transformer3d
title: AnyFlowFARTransformer3DModel
- local: api/models/anyflow_transformer3d
title: AnyFlowTransformer3DModel
- local: api/models/aura_flow_transformer2d
title: AuraFlowTransformer2DModel
- local: api/models/transformer_bria_fibo
@@ -631,6 +635,8 @@
- sections:
- local: api/pipelines/allegro
title: Allegro
- local: api/pipelines/anyflow
title: AnyFlow
- local: api/pipelines/chronoedit
title: ChronoEdit
- local: api/pipelines/cogvideox
@@ -706,6 +712,8 @@
title: EulerAncestralDiscreteScheduler
- local: api/schedulers/euler
title: EulerDiscreteScheduler
- local: api/schedulers/flow_map_euler_discrete
title: FlowMapEulerDiscreteScheduler
- local: api/schedulers/flow_match_euler_discrete
title: FlowMatchEulerDiscreteScheduler
- local: api/schedulers/flow_match_heun_discrete

View File

@@ -0,0 +1,45 @@
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AnyFlowFARTransformer3DModel
The causal (FAR) 3D Transformer used by [`AnyFlowFARPipeline`](../pipelines/anyflow#anyflowfarpipeline) —
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS
ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:
1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
generation as introduced in [FAR (Gu et al., 2025)](https://arxiv.org/abs/2503.19325).
2. **Compressed-frame patch embedding** (`far_patch_embedding`) for context (already-generated) frames,
warm-started from the full-resolution `patch_embedding` at construction time via trilinear interpolation.
3. **Dual-timestep flow-map embedding** (same as
[`AnyFlowTransformer3DModel`](anyflow_transformer3d)) — every forward call conditions on both the source
timestep ``t`` and the target timestep ``r``.
The chunk schedule (`chunk_partition`) is **not** baked into the model config. It is a per-call argument to
`forward`, so the same checkpoint handles different `num_frames` configurations without retraining.
```python
from diffusers import AnyFlowFARTransformer3DModel
# Causal AnyFlow checkpoint (FAR):
transformer = AnyFlowFARTransformer3DModel.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", subfolder="transformer"
)
```
## AnyFlowFARTransformer3DModel
[[autodoc]] AnyFlowFARTransformer3DModel
## AnyFlowFARTransformerOutput
[[autodoc]] models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput

View File

@@ -0,0 +1,36 @@
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# AnyFlowTransformer3DModel
The bidirectional 3D Transformer used by [`AnyFlowPipeline`](../pipelines/anyflow#anyflowpipeline). It is the
v0.35.1 Wan2.1 backbone with one structural change: the timestep embedder is replaced by
``AnyFlowDualTimestepTextImageEmbedding``, so every forward call conditions on both the source timestep
``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
:math:`\Phi_{r\leftarrow t}` introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS ShowLab × NVIDIA).
For frame-level autoregressive (FAR causal) generation, use
[`AnyFlowFARTransformer3DModel`](anyflow_far_transformer3d) instead.
```python
from diffusers import AnyFlowTransformer3DModel
# Bidirectional AnyFlow checkpoint (T2V):
transformer = AnyFlowTransformer3DModel.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer"
)
```
## AnyFlowTransformer3DModel
[[autodoc]] AnyFlowTransformer3DModel

View File

@@ -26,7 +26,7 @@ ACE-Step 1.5 ships three DiT checkpoints that share the same transformer archite
| Variant | CFG | Default steps | Default `guidance_scale` | Default `shift` | HF repo |
|---------|:---:|:-------------:|:------------------------:|:---------------:|---------|
| `turbo` (guidance-distilled) | off | 8 | ignored | 3.0 | [`ACE-Step/Ace-Step1.5`](https://huggingface.co/ACE-Step/Ace-Step1.5) |
| `turbo` (guidance-distilled) | off | 8 | ignored | 3.0 | [`ACE-Step/acestep-v15-xl-turbo-diffusers`](https://huggingface.co/ACE-Step/acestep-v15-xl-turbo-diffusers) |
| `base` | on | 8 | 7.0 | 3.0 | [`ACE-Step/acestep-v15-base`](https://huggingface.co/ACE-Step/acestep-v15-base) |
| `sft` | on | 8 | 7.0 | 3.0 | [`ACE-Step/acestep-v15-sft`](https://huggingface.co/ACE-Step/acestep-v15-sft) |
@@ -54,7 +54,7 @@ import torch
import soundfile as sf
from diffusers import AceStepPipeline
pipe = AceStepPipeline.from_pretrained("ACE-Step/Ace-Step1.5", torch_dtype=torch.bfloat16)
pipe = AceStepPipeline.from_pretrained("ACE-Step/acestep-v15-xl-turbo-diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
audio = pipe(

View File

@@ -0,0 +1,218 @@
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/loaders/lora_pipeline.py">
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-supported-green">
</a>
</div>
</div>
# AnyFlow
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.
*Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*
The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).
The following AnyFlow checkpoints are supported:
| Checkpoint | Backbone | Description |
|------------|----------|-------------|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
| [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |
All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.
> [!TIP]
> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.
> [!TIP]
> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.
### Optimizing Memory and Inference Speed
<hfoptions id="optimization">
<hfoption id="memory">
```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```
</hfoption>
<hfoption id="inference speed">
```py
import torch
from diffusers import AnyFlowPipeline
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```
</hfoption>
</hfoptions>
### Generation with AnyFlow (Bidirectional T2V)
<hfoptions id="anyflow-bidi">
<hfoption id="usage">
```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
</hfoption>
</hfoptions>
### Generation with AnyFlow (FAR Causal)
The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
omit both for plain text-to-video, or pass ``video=<tensor>`` of shape ``(B, T, C, H, W)`` in ``[0, 1]``
with ``T = 4n + 1`` to condition on existing frames. Use a single conditioning frame for I2V and a longer
clip for V2V continuation. If you already have pre-encoded latents in the model layout, pass them via
``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.
> [!IMPORTANT]
> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.
<hfoptions id="anyflow-far">
<hfoption id="t2v">
```py
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
video = pipe(
prompt="A cat surfing a wave, sunset",
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
</hfoption>
<hfoption id="i2v">
```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0 # (480, 832, 3)
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
video = pipe(
prompt="a cat walks across a sunlit lawn",
video=context_tensor,
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
</hfoption>
<hfoption id="v2v">
```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_video
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
context_frames = load_video("path/to/context.mp4")[:9]
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda") # (1, 9, 3, 480, 832)
video = pipe(
prompt="continue the story",
video=context_tensor,
num_inference_steps=4,
num_frames=81,
# Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
chunk_partition=[3, 3, 3, 3, 3, 3, 3],
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```
</hfoption>
</hfoptions>
## Notes
- Classifier-free guidance is fused into the released checkpoints, so inference does not run a second guided forward pass. Keep the default `guidance_scale=1.0` unless your own checkpoint requires otherwise.
- `FlowMapEulerDiscreteScheduler` is general-purpose. You can attach it to any flow-map-distilled checkpoint via `from_pretrained(..., scheduler=FlowMapEulerDiscreteScheduler.from_config(...))`.
- `AnyFlowPipeline` uses [`AnyFlowTransformer3DModel`](../models/anyflow_transformer3d) (bidirectional). `AnyFlowFARPipeline` uses [`AnyFlowFARTransformer3DModel`](../models/anyflow_far_transformer3d), which adds a compressed-frame patch embedding and the FAR causal block-mask.
- LoRA loading is supported via `WanLoraLoaderMixin`, the same mixin used by the upstream Wan pipelines.
- For training recipes (forward flow-map training and on-policy distillation), refer to the original AnyFlow training framework at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow); training is out of scope for diffusers.
## AnyFlowPipeline
[[autodoc]] AnyFlowPipeline
- all
- __call__
## AnyFlowFARPipeline
[[autodoc]] AnyFlowFARPipeline
- all
- __call__
## AnyFlowPipelineOutput
[[autodoc]] pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput

View File

@@ -377,7 +377,7 @@ height = 512
random_seed = 42
frame_rate = 24.0
generator = torch.Generator(device).manual_seed(random_seed)
model_path = "dg845/LTX-2.3-Diffusers"
model_path = "diffusers/LTX-2.3-Diffusers"
pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe.enable_sequential_cpu_offload(device=device)
@@ -449,7 +449,7 @@ height = 512
random_seed = 42
frame_rate = 24.0
generator = torch.Generator(device).manual_seed(random_seed)
model_path = "dg845/LTX-2.3-Diffusers"
model_path = "diffusers/LTX-2.3-Diffusers"
pipe = LTX2Pipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload(device=device)

View File

@@ -0,0 +1,28 @@
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# FlowMapEulerDiscreteScheduler
`FlowMapEulerDiscreteScheduler` is an Euler-style sampler designed for flow-map-distilled diffusion
models. Flow-map models learn arbitrary-interval transitions $\mathbf{z}_t \to \mathbf{z}_r$ rather than
the fixed $\mathbf{z}_t \to \mathbf{z}_0$ mapping of consistency models. Both endpoints of the step are
caller-provided, which is what enables any-step sampling: a single distilled checkpoint can be evaluated at
1, 2, 4, 8, 16... NFE without retraining.
The scheduler was introduced in
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724)
and ships with the `AnyFlowPipeline` and `AnyFlowFARPipeline` integrations, but it is not
AnyFlow-specific — any flow-map-distilled checkpoint can use it.
## FlowMapEulerDiscreteScheduler
[[autodoc]] FlowMapEulerDiscreteScheduler

View File

@@ -130,6 +130,8 @@
- title: Specific pipeline examples
isExpanded: false
sections:
- local: using-diffusers/anyflow
title: AnyFlow
- local: using-diffusers/consisid
title: ConsisID
- local: using-diffusers/helios

View File

@@ -0,0 +1,253 @@
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See
the License for the specific language governing permissions and limitations under the License.
-->
# AnyFlow
[AnyFlow](https://huggingface.co/papers/2605.13724) 是一个视频扩散**蒸馏**框架,把预训练的 Wan2.1 教师
模型蒸馏成在标准 Euler 采样下支持*任意步数 (any-step)* 的学生模型。同一个蒸馏出来的 checkpoint 可以
在 1、2、4、8、16... NFE 下推理,**质量随步数单调提升** —— 这一点和 consistency models 不同,后者
NFE 增加反而经常掉点。
核心思路是学习 **flow map** $\Phi_{r\leftarrow t}: \mathbf{z}_t \to \mathbf{z}_r$(任意 $1 \ge t \ge r \ge 0$
而不是 consistency models 学的固定端点映射 $\mathbf{z}_t \to \mathbf{z}_0$。Flow map 的可组合性消除了
采样步之间的 re-noisingon-policy 蒸馏阶段额外用 **DMD 反向散度监督** + **Flow-Map backward simulation**
3 段 shortcut补上 consistency 蒸馏遗留的 exposure-bias 缺口。
AnyFlow 由 Yuchao Gu、Guian Fang 等人在 [NUS ShowLab](https://sites.google.com/view/showlab) 与 NVIDIA 合作完成。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)。4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。
本文档梳理实战要点:怎么选 pipeline、怎么用 any-step 采样、怎么把 AnyFlow 嵌进 T2V / I2V / V2V 工作流。
## Bidirectional 还是 Causal —— 怎么选 pipeline
AnyFlow 提供两个 pipeline 形态scheduler 和蒸馏方法相同,区别在于**怎么对帧采样**
- [`AnyFlowPipeline`](../api/pipelines/anyflow#anyflowpipeline) —— **bidirectional** T2V。一次性对整个
视频张量去噪,全局自注意力。**纯 prompt 输入、不要流式输出**时选这个。
- [`AnyFlowFARPipeline`](../api/pipelines/anyflow#anyflowfarpipeline) —— **causal (FAR)**
按 chunk 分段去噪,块稀疏因果注意力 + 跨 chunk 复用 KV cache。**图生视频 (I2V)**、**视频续写 (V2V)**、
或任何受益于逐帧自回归采样的场景选这个。同一个模型通过 `video`(像素空间)或 `video_latents`
(已编码 latent这两个互斥 kwarg 来切换三种任务模式。
简化对照表:
| 场景 | Pipeline | 调用方式 |
|------|----------|----------|
| 纯文生视频,固定 NFE 求最大质量 | `AnyFlowPipeline` | `pipe(prompt, ...)` |
| 图生视频(首帧给定) | `AnyFlowFARPipeline` | `pipe(prompt, video=<单帧 tensor>, ...)` |
| 视频续写 / V2V | `AnyFlowFARPipeline` | `pipe(prompt, video=<多帧 tensor>, ...)` |
| 流式 / 渐进式生成 | `AnyFlowFARPipeline` | — |
高分辨率下 bidirectional 单 token 更快causal 牺牲一点单步速度,换来在所有 latent 帧分配前就能开始
采样的能力,对超长序列尤其有用。
## 加载 checkpoint
NVIDIA 发布了 4 个 AnyFlow checkpointpipeline × 规模各一份:
```py
import torch
from diffusers import AnyFlowPipeline, AnyFlowFARPipeline
# Bidirectional, 轻量
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Bidirectional, 满血
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Causal (FAR), 1.3B
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
# Causal (FAR), 14B
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
```
四个 checkpoint 共用同一份 [`FlowMapEulerDiscreteScheduler`](../api/schedulers/flow_map_euler_discrete)
默认 `shift=5.0`
## Any-step 采样
AnyFlow 最关键的特性是同一个 checkpoint **不需重新调度**NFE 越大质量越高。固定 prompt、扫一下步数
就能看出模型怎么在延迟和保真度之间权衡:
```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
prompt = "森林里一只小熊猫在啃竹子,电影感光照"
for nfe in [1, 2, 4, 8, 16, 32]:
# 每轮重建 generator —— 这样跨步数对比时唯一变量是 NFE。
generator = torch.Generator("cuda").manual_seed(0)
video = pipe(prompt, num_inference_steps=nfe, num_frames=33, generator=generator).frames[0]
export_to_video(video, f"out_nfe{nfe}.mp4", fps=16)
```
paper 的 Tab 3 / Fig 1 表明:每个 AnyFlow checkpoint 在 4 → 32 NFE 范围 VBench Quality 都单调上升,而
consistency 类基线rCM、Self-Forcing在同区间反而掉点。
> [!TIP]
> Classifier-free guidance (CFG) 已经在训练阶段融进权重。pipeline 推理
> 时**不会**再跑一次 unconditional 前向 —— guidance 直接由蒸馏后的权重带出。release 出来的 checkpoint
> 都用默认的 `guidance_scale=1.0` 即可。
## 图生视频 与 视频续写
Causal pipeline 用同一个蒸馏模型支持三种任务模式,**通过 `video` / `video_latents` 二选一来选**
- `video` —— 像素空间张量,形状 `(B, T, C, H, W)``[0, 1]`pipeline 内部会过一遍 `VideoProcessor`
+ VAE 编码;
- `video_latents` —— 已经在模型布局下的 latent跳过 VAE 编码;
- 两者都不传 —— 纯文生视频;
- 两者同时传 —— 抛 `ValueError`(互斥)。
Context tensor 的帧数必须满足 `T = 4n + 1`,跟 VAE 时间步长对齐。
> [!IMPORTANT]
> FAR pipeline 是分块 (chunk) rollout`num_frames` 必须配合 chunk 调度。默认
> `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`(求和 21对应发布 checkpoint 的标准 `num_frames=81`
> 21 = (81 1) // 4 + 1。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`,使其求和等于
> `(num_frames - 1) // 4 + 1`,否则 pipeline 会抛 `AssertionError`。比如 `num_frames=33` 对应 9 个 latent
> 帧,可用 `chunk_partition=[1, 4, 4]`。
```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image, load_video
pipe = AnyFlowFARPipeline.from_pretrained(
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
def to_video_tensor(images, height=480, width=832):
"""把 PIL 列表转成 FAR pipeline 需要的 (B, T, C, H, W) [0, 1] 张量。"""
frames = np.stack([np.asarray(img.resize((width, height))) for img in images]).astype("float32") / 255.0
# frames: (T, H, W, C) → (T, C, H, W) → 加 batch 维 → (1, T, C, H, W)
return torch.from_numpy(frames).permute(0, 3, 1, 2).unsqueeze(0)
# 1) 文生视频(无 context。81 帧匹配默认 chunk_partition。
video = pipe(prompt="一只猫在夕阳下冲浪", num_inference_steps=4, num_frames=81).frames[0]
export_to_video(video, "t2v.mp4", fps=16)
# 2) 图生视频 —— 单帧 context 经过 VAE 是 1 个 latent正好对上默认 chunk_partition 的第一项 (`[1, ...]`)。
first_frame = load_image("path/to/first_frame.png")
context_tensor = to_video_tensor([first_frame]).to("cuda") # (1, 1, 3, 480, 832), [0, 1]
video = pipe(
prompt="一只猫走过阳光下的草坪",
video=context_tensor,
num_inference_steps=4,
num_frames=81,
).frames[0]
export_to_video(video, "i2v.mp4", fps=16)
# 3) 视频续写。9 帧 raw context → 3 个 latent context显式覆盖 chunk_partition让第一块正好覆盖 context。
context_frames = load_video("path/to/context.mp4")[:9] # 9 = 4·2 + 1
context_tensor = to_video_tensor(context_frames).to("cuda") # (1, 9, 3, 480, 832)
video = pipe(
prompt="继续这个故事",
video=context_tensor,
num_inference_steps=4,
num_frames=81,
chunk_partition=[3, 3, 3, 3, 3, 3, 3], # 7 个 chunk × 3 = 21 latent首块就是 context
).frames[0]
export_to_video(video, "v2v.mp4", fps=16)
```
底层 patchify chunk 调度根据 `video` / `video_latents` 是否给定自动调整:纯文生用 kernel 2 (full) 和
4 (compressed);有 context 时第一个 chunk 改成 kernel 1让条件帧保留全分辨率。
如果你已经有 VAE 编码过的 latent可以直接传 `video_latents=<tensor>` 跳过 `vae_encode` 步骤
(和 `video` 互斥)。
## 显存与推理速度
14B 的 AnyFlow 模型用 group offload + VAE slicing 单卡 40 GB 能跑:
```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading
pipe = AnyFlowPipeline.from_pretrained(
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```
延迟方面,`torch.compile` 对 transformer最重的模块效果很好
```py
pipe = pipe.to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```
编译开销跑几步就摊销掉;配合 AnyFlow 的低 NFE4-8 步),`torch.compile` 在 14B 上相比 eager
模式有明显加速。
## LoRA 微调
两个 pipeline 都复用 [`WanLoraLoaderMixin`](../api/loaders/lora),因此为对应 Wan2.1 backbone 训练的
LoRA adapter 直接加载即可:
```py
pipe.load_lora_weights("path/or/repo/with/wan_lora")
```
如果要做**继续 on-policy 蒸馏微调**(用论文里相同的 DMD 反向散度监督配方训新 LoRA请参考原始
AnyFlow 训练框架 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),这套训练流程不在
diffusers 范围内。
## 常见坑
- **永远 `guidance_scale=1.0`。** 蒸馏后的 checkpoint 已经把 CFG 融进权重。设 `> 1` 会多跑一遍
unconditional 前向、延迟翻倍、质量微降。
- **Bidirectional pipeline 不支持流式。** 所有 `num_frames` 一起去噪。需要边采边播请用 causal pipeline。
- **Causal pipeline KV cache 假设 chunk 调度跨调用一致。** 中途重建 cache 不被 release 模型支持。
- **`num_frames` 必须满足 VAE 时间步长。** release checkpoint 用 `(N - 1) % 4 == 0` 的值(如 9、17、33、81
## 引用
```bibtex
@misc{gu2026anyflowanystepvideodiffusion,
title={AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation},
author={Yuchao Gu and Guian Fang and Yuxin Jiang and Weijia Mao and Song Han and Han Cai and Mike Zheng Shou},
year={2026},
eprint={2605.13724},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.13724},
}
@article{gu2025long,
title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
author={Gu, Yuchao and Mao, Weijia and Shou, Mike Zheng},
journal={arXiv preprint arXiv:2503.19325},
year={2025}
}
```

View File

@@ -0,0 +1,152 @@
# Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Convert AnyFlow training checkpoints to the diffusers ``save_pretrained`` layout.
The AnyFlow training pipeline emits ``.pt`` files containing an ``ema`` key whose value is a flat state
dict for the transformer. This script:
1. Loads the matching base Wan2.1 pipeline from the Hub (provides VAE, tokenizer, and text encoder).
2. Constructs an ``AnyFlowTransformer3DModel`` with the right config flags for the chosen variant.
3. Loads the ``ema`` weights into the transformer.
4. Wraps everything in an ``AnyFlowPipeline`` (bidirectional) or ``AnyFlowFARPipeline`` (FAR causal).
5. Calls ``pipeline.save_pretrained(output_dir)``.
Example:
```bash
python scripts/convert_anyflow_to_diffusers.py \\
--variant AnyFlow-FAR-Wan2.1-1.3B-Diffusers \\
--ckpt /path/to/anyflow-checkpoint.pt \\
--output-dir /path/to/output/AnyFlow-FAR-Wan2.1-1.3B-Diffusers
```
"""
import argparse
import logging
import os
import torch
from diffusers import (
AnyFlowFARPipeline,
AnyFlowFARTransformer3DModel,
AnyFlowPipeline,
AnyFlowTransformer3DModel,
FlowMapEulerDiscreteScheduler,
)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
# Per-variant configuration. ``base_model`` is fetched from the Hub to source the matching VAE / text encoder.
VARIANTS = {
"AnyFlow-FAR-Wan2.1-1.3B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"transformer_cls": AnyFlowFARTransformer3DModel,
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
"pipeline_cls": AnyFlowFARPipeline,
},
"AnyFlow-FAR-Wan2.1-14B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
"transformer_cls": AnyFlowFARTransformer3DModel,
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
"pipeline_cls": AnyFlowFARPipeline,
},
"AnyFlow-Wan2.1-T2V-1.3B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"transformer_cls": AnyFlowTransformer3DModel,
"transformer_kwargs": {},
"pipeline_cls": AnyFlowPipeline,
},
"AnyFlow-Wan2.1-T2V-14B-Diffusers": {
"base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
"transformer_cls": AnyFlowTransformer3DModel,
"transformer_kwargs": {},
"pipeline_cls": AnyFlowPipeline,
},
}
def build_pipeline(variant: str, ckpt_path: str):
if variant not in VARIANTS:
raise ValueError(f"Unknown variant {variant!r}. Choices: {list(VARIANTS)}.")
spec = VARIANTS[variant]
transformer = spec["transformer_cls"].from_pretrained(
spec["base_model"],
subfolder="transformer",
gate_value=0.25,
deltatime_type="r",
**spec["transformer_kwargs"],
)
# NVlabs/AnyFlow training checkpoints are wrapped Python objects (the `ema` key carries metadata
# alongside tensors), so the unpickle is required. Only run this script on checkpoints you trust.
state_dict = torch.load(ckpt_path, map_location="cpu", weights_only=False)["ema"]
missing, unexpected = transformer.load_state_dict(state_dict, strict=False)
if unexpected:
logger.warning(
"Unexpected keys in state dict (ignored): %s%s",
unexpected[:5],
"..." if len(unexpected) > 5 else "",
)
if missing:
logger.warning(
"Missing keys not loaded from state dict: %s%s",
missing[:5],
"..." if len(missing) > 5 else "",
)
scheduler = FlowMapEulerDiscreteScheduler(num_train_timesteps=1000, shift=5.0)
pipeline = spec["pipeline_cls"].from_pretrained(
spec["base_model"],
transformer=transformer,
scheduler=scheduler,
)
return pipeline
def main():
parser = argparse.ArgumentParser(
description="Convert an AnyFlow training checkpoint into a diffusers pipeline directory."
)
parser.add_argument(
"--variant",
required=True,
choices=list(VARIANTS),
help="Which AnyFlow variant the checkpoint corresponds to.",
)
parser.add_argument(
"--ckpt",
required=True,
help="Path to the AnyFlow training checkpoint (a .pt file containing an 'ema' key).",
)
parser.add_argument(
"--output-dir",
required=True,
help="Destination directory for pipeline.save_pretrained.",
)
args = parser.parse_args()
os.makedirs(args.output_dir, exist_ok=True)
pipeline = build_pipeline(args.variant, args.ckpt)
pipeline.save_pretrained(args.output_dir)
logger.info("Saved %s pipeline to %s", args.variant, args.output_dir)
if __name__ == "__main__":
main()

View File

@@ -191,6 +191,8 @@ else:
[
"AceStepTransformer1DModel",
"AllegroTransformer3DModel",
"AnyFlowFARTransformer3DModel",
"AnyFlowTransformer3DModel",
"AsymmetricAutoencoderKL",
"AttentionBackendName",
"AuraFlowTransformer2DModel",
@@ -380,6 +382,7 @@ else:
"EDMEulerScheduler",
"EulerAncestralDiscreteScheduler",
"EulerDiscreteScheduler",
"FlowMapEulerDiscreteScheduler",
"FlowMatchEulerDiscreteScheduler",
"FlowMatchHeunDiscreteScheduler",
"FlowMatchLCMScheduler",
@@ -511,6 +514,8 @@ else:
"AnimateDiffSparseControlNetPipeline",
"AnimateDiffVideoToVideoControlNetPipeline",
"AnimateDiffVideoToVideoPipeline",
"AnyFlowFARPipeline",
"AnyFlowPipeline",
"AudioLDM2Pipeline",
"AudioLDM2ProjectionModel",
"AudioLDM2UNet2DConditionModel",
@@ -1019,6 +1024,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .models import (
AceStepTransformer1DModel,
AllegroTransformer3DModel,
AnyFlowFARTransformer3DModel,
AnyFlowTransformer3DModel,
AsymmetricAutoencoderKL,
AttentionBackendName,
AuraFlowTransformer2DModel,
@@ -1204,6 +1211,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
EDMEulerScheduler,
EulerAncestralDiscreteScheduler,
EulerDiscreteScheduler,
FlowMapEulerDiscreteScheduler,
FlowMatchEulerDiscreteScheduler,
FlowMatchHeunDiscreteScheduler,
FlowMatchLCMScheduler,
@@ -1316,6 +1324,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
AnimateDiffSparseControlNetPipeline,
AnimateDiffVideoToVideoControlNetPipeline,
AnimateDiffVideoToVideoPipeline,
AnyFlowFARPipeline,
AnyFlowPipeline,
AudioLDM2Pipeline,
AudioLDM2ProjectionModel,
AudioLDM2UNet2DConditionModel,

View File

@@ -95,6 +95,8 @@ if is_torch_available():
_import_structure["transformers.t5_film_transformer"] = ["T5FilmDecoder"]
_import_structure["transformers.transformer_2d"] = ["Transformer2DModel"]
_import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
_import_structure["transformers.transformer_anyflow"] = ["AnyFlowTransformer3DModel"]
_import_structure["transformers.transformer_anyflow_far"] = ["AnyFlowFARTransformer3DModel"]
_import_structure["transformers.transformer_bria"] = ["BriaTransformer2DModel"]
_import_structure["transformers.transformer_bria_fibo"] = ["BriaFiboTransformer2DModel"]
_import_structure["transformers.transformer_chroma"] = ["ChromaTransformer2DModel"]
@@ -214,6 +216,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
from .transformers import (
AceStepTransformer1DModel,
AllegroTransformer3DModel,
AnyFlowFARTransformer3DModel,
AnyFlowTransformer3DModel,
AuraFlowTransformer2DModel,
BriaFiboTransformer2DModel,
BriaTransformer2DModel,

View File

@@ -269,6 +269,10 @@ class T2IAdapter(ModelMixin, ConfigMixin):
each representing information extracted at a different scale from the input. The length of the list is
determined by the number of downsample blocks in the Adapter, as specified by the `channels` and
`num_res_blocks` parameters during initialization.
Args:
x (`torch.Tensor`):
The input tensor to process through the adapter model.
"""
return self.adapter(x)

View File

@@ -166,6 +166,9 @@ class AsymmetricAutoencoderKL(ModelMixin, AutoencoderMixin, ConfigMixin):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -706,6 +706,12 @@ class AutoencoderDC(ModelMixin, AutoencoderMixin, ConfigMixin, FromOriginalModel
return DecoderOutput(sample=decoded)
def forward(self, sample: torch.Tensor, return_dict: bool = True) -> torch.Tensor:
r"""
Args:
sample (`torch.Tensor`): Input sample.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
"""
encoded = self.encode(sample, return_dict=False)[0]
decoded = self.decode(encoded, return_dict=False)[0]
if not return_dict:

View File

@@ -424,6 +424,9 @@ class AutoencoderKL(
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -1409,6 +1409,17 @@ class AutoencoderKLCogVideoX(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> torch.Tensor | torch.Tensor:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist
if sample_posterior:

View File

@@ -1078,6 +1078,17 @@ class AutoencoderKLCosmos(ModelMixin, AutoencoderMixin, ConfigMixin):
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> tuple[torch.Tensor] | DecoderOutput:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist
if sample_posterior:

View File

@@ -441,6 +441,9 @@ class AutoencoderKLFlux2(
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -1061,6 +1061,9 @@ class AutoencoderKLHunyuanVideo(ModelMixin, AutoencoderMixin, ConfigMixin):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -674,8 +674,13 @@ class AutoencoderKLHunyuanImage(ModelMixin, AutoencoderMixin, ConfigMixin, FromO
"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
posterior = self.encode(sample).latent_dist
if sample_posterior:

View File

@@ -908,6 +908,9 @@ class AutoencoderKLHunyuanImageRefiner(ModelMixin, AutoencoderMixin, ConfigMixin
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -941,6 +941,9 @@ class AutoencoderKLHunyuanVideo15(ModelMixin, AutoencoderMixin, ConfigMixin):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -787,6 +787,9 @@ class AutoencoderKLKVAE(ModelMixin, AutoencoderMixin, ConfigMixin):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -942,6 +942,17 @@ class AutoencoderKLKVAEVideo(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
return_dict: bool = True,
generator: Optional[torch.Generator] = None,
) -> Union[DecoderOutput, torch.Tensor]:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist
if sample_posterior:

View File

@@ -1522,6 +1522,19 @@ class AutoencoderKLLTXVideo(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrigi
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> torch.Tensor | torch.Tensor:
r"""
Args:
sample (`torch.Tensor`): Input sample.
temb (`torch.Tensor`, *optional*):
Optional timestep embedding tensor used to condition the decoder.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist
if sample_posterior:

View File

@@ -1542,6 +1542,23 @@ class AutoencoderKLLTX2Video(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> torch.Tensor | torch.Tensor:
r"""
Args:
sample (`torch.Tensor`): Input sample.
temb (`torch.Tensor`, *optional*):
Optional timestep embedding tensor used to condition the decoder.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
encoder_causal (`bool`, *optional*):
Whether the encoder should use causal convolutions. If `None`, falls back to the model default.
decoder_causal (`bool`, *optional*):
Whether the decoder should use causal convolutions. If `None`, falls back to the model default.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x, causal=encoder_causal).latent_dist
if sample_posterior:

View File

@@ -792,6 +792,17 @@ class AutoencoderKLLTX2Audio(ModelMixin, AutoencoderMixin, ConfigMixin):
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> DecoderOutput | torch.Tensor:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
posterior = self.encode(sample).latent_dist
if sample_posterior:
z = posterior.sample(generator=generator)

View File

@@ -1057,6 +1057,9 @@ class AutoencoderKLMagvit(ModelMixin, AutoencoderMixin, ConfigMixin):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -1093,6 +1093,17 @@ class AutoencoderKLMochi(ModelMixin, AutoencoderMixin, ConfigMixin):
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> torch.Tensor | torch.Tensor:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist
if sample_posterior:

View File

@@ -1043,8 +1043,13 @@ class AutoencoderKLQwenImage(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -287,6 +287,11 @@ class AutoencoderKLTemporalDecoder(ModelMixin, AttentionMixin, AutoencoderMixin,
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
num_frames (`int`, *optional*, defaults to 1):
The number of frames to decode per batch.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -1416,8 +1416,13 @@ class AutoencoderKLWan(ModelMixin, AutoencoderMixin, ConfigMixin, FromOriginalMo
"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -393,6 +393,17 @@ class LongCatAudioDiTVae(ModelMixin, AutoencoderMixin, ConfigMixin):
return_dict: bool = True,
generator: torch.Generator | None = None,
) -> LongCatAudioDiTVaeDecoderOutput | tuple[torch.Tensor]:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `False`):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`LongCatAudioDiTVaeDecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
latents = self.encode(sample, sample_posterior=sample_posterior, return_dict=True, generator=generator).latents
decoded = self.decode(latents, return_dict=True).sample
if not return_dict:

View File

@@ -528,6 +528,9 @@ class AutoencoderOobleck(ModelMixin, AutoencoderMixin, ConfigMixin):
Whether to sample from the posterior.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`OobleckDecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
posterior = self.encode(x).latent_dist

View File

@@ -682,6 +682,15 @@ class AutoencoderRAE(ModelMixin, AttentionMixin, AutoencoderMixin, ConfigMixin):
def forward(
self, sample: torch.Tensor, return_dict: bool = True, generator: torch.Generator | None = None
) -> DecoderOutput | tuple[torch.Tensor]:
r"""
Args:
sample (`torch.Tensor`): Input sample.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
latents = self.encode(sample, return_dict=False, generator=generator)[0]
decoded = self.decode(latents, return_dict=False)[0]
if not return_dict:

View File

@@ -1440,6 +1440,19 @@ class AutoencoderVidTok(ModelMixin, ConfigMixin):
return_dict: bool = True,
generator: Optional[torch.Generator] = None,
) -> Union[torch.Tensor, DecoderOutput]:
r"""
Args:
sample (`torch.Tensor`): Input sample.
sample_posterior (`bool`, *optional*, defaults to `True`):
Whether to sample from the posterior.
encoder_mode (`bool`, *optional*, defaults to `False`):
If `True`, only run the encoder and return the encoded latent without decoding.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
generator (`torch.Generator`, *optional*):
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
deterministic.
"""
x = sample
res = 1 if self.is_causal else 0
if self.is_causal:

View File

@@ -188,8 +188,12 @@ class FluxControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAdapterMi
from the embeddings of input conditions.
timestep ( `torch.LongTensor`):
Used to indicate denoising step.
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
A list of tensors that if specified are added to the residuals of transformer blocks.
img_ids (`torch.Tensor`):
Positional ids for the image tokens.
txt_ids (`torch.Tensor`):
Positional ids for the text tokens.
guidance (`torch.Tensor`, *optional*):
Guidance scale tensor used by guidance-distilled variants of the model.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
@@ -355,6 +359,35 @@ class FluxMultiControlNetModel(ModelMixin):
joint_attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> FluxControlNetOutput | tuple:
r"""
Args:
hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
Input `hidden_states`.
controlnet_cond (`list` of `torch.Tensor`):
A list of conditional input tensors, one per ControlNet.
controlnet_mode (`list` of `torch.Tensor`):
A list of mode tensors selecting the control type for each ControlNet.
conditioning_scale (`list` of `float`):
A list of scale factors applied to the ControlNet outputs.
encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`):
Embeddings projected from the embeddings of input conditions.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
img_ids (`torch.Tensor`):
Positional ids for the image tokens.
txt_ids (`torch.Tensor`):
Positional ids for the text tokens.
guidance (`torch.Tensor`, *optional*):
Guidance scale tensor used by guidance-distilled variants of the model.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`FluxControlNetOutput`] instead of a plain tuple.
"""
# ControlNet-Union with multiple conditions
# only load one ControlNet for saving memories
if len(self.nets) == 1:

View File

@@ -286,6 +286,32 @@ class QwenImageMultiControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, F
joint_attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> QwenImageControlNetOutput | tuple:
r"""
Args:
hidden_states (`torch.FloatTensor`):
Input `hidden_states`.
controlnet_cond (`list` of `torch.Tensor`):
A list of conditional input tensors, one per ControlNet.
conditioning_scale (`list` of `float`):
A list of scale factors applied to the ControlNet outputs.
encoder_hidden_states (`torch.Tensor`, *optional*):
Conditional embeddings (embeddings computed from the input conditions such as prompts).
encoder_hidden_states_mask (`torch.Tensor`, *optional*):
Mask for the encoder hidden states.
timestep (`torch.LongTensor`, *optional*):
Used to indicate denoising step.
img_shapes (`list` of `tuple[int, int, int]`, *optional*):
Per-sample image shapes used to construct positional encodings.
txt_seq_lens (`list` of `int`, *optional*):
Deprecated. The text sequence length is now inferred from `encoder_hidden_states` and
`encoder_hidden_states_mask`.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`QwenImageControlNetOutput`] instead of a plain tuple.
"""
if txt_seq_lens is not None:
deprecate(
"txt_seq_lens",

View File

@@ -130,6 +130,30 @@ class SanaControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAdapterMi
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor, ...] | Transformer2DModelOutput:
r"""
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, channel, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
controlnet_cond (`torch.Tensor`):
The conditional input tensor for the ControlNet.
conditioning_scale (`float`, *optional*, defaults to `1.0`):
The scale factor for ControlNet outputs.
encoder_attention_mask (`torch.Tensor`, *optional*):
Attention mask applied to `encoder_hidden_states`.
attention_mask (`torch.Tensor`, *optional*):
Attention mask applied to `hidden_states`.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
# ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
# we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
# we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.

View File

@@ -402,6 +402,27 @@ class SD3MultiControlNetModel(ModelMixin):
joint_attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> SD3ControlNetOutput | tuple:
r"""
Args:
hidden_states (`torch.Tensor`):
Input `hidden_states`.
controlnet_cond (`list` of `torch.Tensor`):
A list of conditional input tensors, one per ControlNet.
conditioning_scale (`list` of `float`):
A list of scale factors applied to the ControlNet outputs.
pooled_projections (`torch.Tensor`):
Embeddings projected from the embeddings of input conditions.
encoder_hidden_states (`torch.Tensor`, *optional*):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`, *optional*):
Used to indicate denoising step.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`SD3ControlNetOutput`] instead of a plain tuple.
"""
for i, (image, scale, controlnet) in enumerate(zip(controlnet_cond, conditioning_scale, self.nets)):
block_samples = controlnet(
hidden_states=hidden_states,

View File

@@ -558,8 +558,6 @@ class SparseControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, FromOrigina
The conditional input tensor of shape `(batch_size, sequence_length, hidden_size)`.
conditioning_scale (`float`, defaults to `1.0`):
The scale factor for ControlNet outputs.
class_labels (`torch.Tensor`, *optional*, defaults to `None`):
Optional class labels for conditioning. Their embeddings will be summed with the timestep embeddings.
timestep_cond (`torch.Tensor`, *optional*, defaults to `None`):
Additional conditional embeddings for timestep. If provided, the embeddings will be summed with the
timestep_embedding passed through the `self.time_embedding` layer to obtain the final timestep
@@ -568,8 +566,8 @@ class SparseControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, FromOrigina
An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
negative values to the attention scores corresponding to "discard" tokens.
added_cond_kwargs (`dict`):
Additional conditions for the Stable Diffusion XL UNet.
conditioning_mask (`torch.Tensor`, *optional*, defaults to `None`):
Optional mask indicating which frames in `controlnet_cond` are valid conditioning frames.
cross_attention_kwargs (`dict[str]`, *optional*, defaults to `None`):
A kwargs dictionary that if specified is passed along to the `AttnProcessor`.
guess_mode (`bool`, defaults to `False`):

View File

@@ -661,6 +661,23 @@ class ZImageControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOrigi
patch_size=2,
f_patch_size=1,
):
r"""
Args:
x (`list` of `torch.Tensor`):
A list of input image latents, one tensor per sample in the batch.
t (`torch.Tensor`):
Timestep tensor used to indicate the denoising step.
cap_feats (`list` of `torch.Tensor`):
A list of caption (text) feature tensors, one per sample.
control_context (`list` of `torch.Tensor`):
A list of control conditioning feature tensors, one per sample.
conditioning_scale (`float`, *optional*, defaults to `1.0`):
The scale factor for ControlNet outputs.
patch_size (`int`, *optional*, defaults to `2`):
Spatial patch size used to tokenize the latent.
f_patch_size (`int`, *optional*, defaults to `1`):
Temporal (frame) patch size used to tokenize the latent.
"""
if (
self.t_scale is None
or self.t_embedder is None

View File

@@ -44,6 +44,34 @@ class MultiControlNetModel(ModelMixin):
guess_mode: bool = False,
return_dict: bool = True,
) -> ControlNetOutput | tuple:
r"""
Args:
sample (`torch.Tensor`):
The noisy input tensor.
timestep (`torch.Tensor`, `float`, or `int`):
The number of timesteps to denoise an input.
encoder_hidden_states (`torch.Tensor`):
The encoder hidden states.
controlnet_cond (`list` of `torch.Tensor`):
A list of conditional input tensors, one per ControlNet.
conditioning_scale (`list` of `float`):
A list of scale factors applied to the ControlNet outputs.
class_labels (`torch.Tensor`, *optional*):
Optional class labels for conditioning.
timestep_cond (`torch.Tensor`, *optional*):
Additional conditional embeddings for timestep.
attention_mask (`torch.Tensor`, *optional*):
Attention mask applied to `encoder_hidden_states`.
added_cond_kwargs (`dict`, *optional*):
Additional conditions for the Stable Diffusion XL UNet.
cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttnProcessor`.
guess_mode (`bool`, *optional*, defaults to `False`):
In this mode, the ControlNet encoder tries its best to recognize the input content even if you remove
all prompts.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`ControlNetOutput`] instead of a plain tuple.
"""
for i, (image, scale, controlnet) in enumerate(zip(controlnet_cond, conditioning_scale, self.nets)):
down_samples, mid_sample = controlnet(
sample=sample,

View File

@@ -47,6 +47,38 @@ class MultiControlNetUnionModel(ModelMixin):
guess_mode: bool = False,
return_dict: bool = True,
) -> ControlNetOutput | tuple:
r"""
Args:
sample (`torch.Tensor`):
The noisy input tensor.
timestep (`torch.Tensor`, `float`, or `int`):
The number of timesteps to denoise an input.
encoder_hidden_states (`torch.Tensor`):
The encoder hidden states.
controlnet_cond (`list` of `torch.Tensor`):
A list of conditional input tensors, one per ControlNet.
control_type (`list` of `torch.Tensor`):
A list of control type tensors, one per ControlNet, indicating the active control types.
control_type_idx (`list` of `list` of `int`):
Per-ControlNet list of control type indices corresponding to `controlnet_cond`.
conditioning_scale (`list` of `float`):
A list of scale factors applied to the ControlNet outputs.
class_labels (`torch.Tensor`, *optional*):
Optional class labels for conditioning.
timestep_cond (`torch.Tensor`, *optional*):
Additional conditional embeddings for timestep.
attention_mask (`torch.Tensor`, *optional*):
Attention mask applied to `encoder_hidden_states`.
added_cond_kwargs (`dict`, *optional*):
Additional conditions for the Stable Diffusion XL UNet.
cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttnProcessor`.
guess_mode (`bool`, *optional*, defaults to `False`):
In this mode, the ControlNet encoder tries its best to recognize the input content even if you remove
all prompts.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`ControlNetOutput`] instead of a plain tuple.
"""
down_block_res_samples, mid_block_res_sample = None, None
for i, (image, ctype, ctype_idx, scale, controlnet) in enumerate(
zip(controlnet_cond, control_type, control_type_idx, conditioning_scale, self.nets)

View File

@@ -18,6 +18,8 @@ if is_torch_available():
from .t5_film_transformer import T5FilmDecoder
from .transformer_2d import Transformer2DModel
from .transformer_allegro import AllegroTransformer3DModel
from .transformer_anyflow import AnyFlowTransformer3DModel
from .transformer_anyflow_far import AnyFlowFARTransformer3DModel
from .transformer_bria import BriaTransformer2DModel
from .transformer_bria_fibo import BriaFiboTransformer2DModel
from .transformer_chroma import ChromaTransformer2DModel

View File

@@ -406,6 +406,28 @@ class AuraFlowTransformer2DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAd
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`AuraFlowTransformer2DModel`] forward method.
Args:
hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
height, width = hidden_states.shape[-2:]
# Apply patch embedding, timestep embedding, and project the caption embeddings.

View File

@@ -375,6 +375,35 @@ class CogVideoXTransformer3DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftA
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`CogVideoXTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_frames, channels, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
timestep_cond (`torch.Tensor`, *optional*):
Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
through the `self.time_embedding` layer to obtain the final timestep embeddings.
ofs (`torch.Tensor`, *optional*):
Offset embeddings used in CogVideoX-5b-I2V.
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
Pre-computed rotary positional embeddings.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_frames, channels, height, width = hidden_states.shape
# 1. Time embedding

View File

@@ -633,6 +633,37 @@ class ConsisIDTransformer3DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAd
id_vit_hidden: torch.Tensor | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`ConsisIDTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_frames, channels, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
timestep_cond (`torch.Tensor`, *optional*):
Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
through the `self.time_embedding` layer to obtain the final timestep embeddings.
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
Pre-computed rotary positional embeddings.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
id_cond (`torch.Tensor`, *optional*):
The face embedding extracted by the local facial extractor used for identity conditioning.
id_vit_hidden (`torch.Tensor`, *optional*):
The ViT hidden states extracted from face images used for identity conditioning.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
# fuse clip and insightface
valid_face_emb = None
if self.is_train_face:

View File

@@ -392,6 +392,8 @@ class HunyuanDiT2DModel(ModelMixin, AttentionMixin, ConfigMixin):
Conditional embedding indicate the style
image_rotary_emb (`torch.Tensor`):
The image rotary embeddings to apply on query and key tensors during attention calculation.
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
return_dict: bool
Whether to return a dictionary.
"""

View File

@@ -176,7 +176,7 @@ class LatteTransformer3DModel(ModelMixin, ConfigMixin, CacheMixin):
The [`LatteTransformer3DModel`] forward method.
Args:
hidden_states shape `(batch size, channel, num_frame, height, width)`:
hidden_states (`torch.Tensor` of shape `(batch size, channel, num_frame, height, width)`):
Input `hidden_states`.
timestep ( `torch.LongTensor`, *optional*):
Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`.

View File

@@ -306,6 +306,15 @@ class LuminaNextDiT2DModel(ModelMixin, ConfigMixin):
timestep (torch.Tensor): Tensor of diffusion timesteps of shape (N,).
encoder_hidden_states (torch.Tensor): Tensor of caption features of shape (N, D).
encoder_mask (torch.Tensor): Tensor of caption masks of shape (N, L).
image_rotary_emb (`torch.Tensor`):
Pre-computed rotary positional embeddings.
cross_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
hidden_states, mask, img_size, image_rotary_emb = self.patch_embedder(hidden_states, image_rotary_emb)
image_rotary_emb = image_rotary_emb.to(hidden_states.device)

View File

@@ -427,6 +427,36 @@ class SanaTransformer2DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAdapte
controlnet_block_samples: tuple[torch.Tensor] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor, ...] | Transformer2DModelOutput:
"""
The [`SanaTransformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding.
encoder_attention_mask (`torch.Tensor`, *optional*):
Cross-attention mask applied to `encoder_hidden_states`.
attention_mask (`torch.Tensor`, *optional*):
Self-attention mask applied to `hidden_states`.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
controlnet_block_samples (`tuple` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
# ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
# we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
# we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.

View File

@@ -90,6 +90,18 @@ class T5FilmDecoder(ModelMixin, ConfigMixin):
return mask.unsqueeze(-3)
def forward(self, encodings_and_masks, decoder_input_tokens, decoder_noise_time):
"""
The [`T5FilmDecoder`] forward method.
Args:
encodings_and_masks (`list` of `tuple` of `torch.Tensor`):
A list of `(encoding, mask)` tuples produced by upstream encoders. The encodings are concatenated and
cross-attended to by the decoder.
decoder_input_tokens (`torch.Tensor` of shape `(batch_size, seq_length, input_dims)`):
Input tokens for the decoder.
decoder_noise_time (`torch.Tensor` of shape `(batch_size,)`):
Diffusion timesteps in `[0, 1)` used to condition the decoder.
"""
batch, _, _ = decoder_input_tokens.shape
assert decoder_noise_time.shape == (batch,)

View File

@@ -312,6 +312,30 @@ class AllegroTransformer3DModel(ModelMixin, ConfigMixin, CacheMixin):
image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
return_dict: bool = True,
):
"""
The [`AllegroTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
attention_mask (`torch.Tensor`, *optional*):
Self-attention mask applied to `hidden_states`.
encoder_attention_mask (`torch.Tensor`, *optional*):
Cross-attention mask applied to `encoder_hidden_states`.
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
Pre-computed rotary positional embeddings.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p_t = self.config.patch_size_t
p = self.config.patch_size

View File

@@ -0,0 +1,726 @@
# Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This file derives from the FAR architecture (Gu et al., 2025, arXiv:2503.19325) and adds the
# AnyFlow dual-timestep flow-map embedding (AnyFlowDualTimestepTextImageEmbedding) introduced by
# Yuchao Gu, Guian Fang et al. (arXiv:2605.13724). The base 3D DiT structure is adapted from the
# v0.35.1 Wan2.1 transformer (transformer_wan.py); upstream Wan has since been refactored, so
# this file is intentionally self-contained rather than annotated with `# Copied from`.
import math
from typing import Any, Dict, Optional, Tuple, Union
import torch
import torch.nn as nn
import torch.nn.functional as F
from ...configuration_utils import ConfigMixin, register_to_config
from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
from ...utils import apply_lora_scale, logging
from ..attention import AttentionModuleMixin, FeedForward
from ..attention_dispatch import dispatch_attention_fn
from ..embeddings import PixArtAlphaTextProjection, TimestepEmbedding, Timesteps, get_1d_rotary_pos_embed
from ..modeling_outputs import Transformer2DModelOutput
from ..modeling_utils import ModelMixin
from ..normalization import FP32LayerNorm, RMSNorm
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
def apply_rotary_emb(hidden_states: torch.Tensor, freqs: torch.Tensor):
# MPS / NPU backends do not support complex128 / float64; fall back to float32 on those devices.
is_mps = hidden_states.device.type == "mps"
is_npu = hidden_states.device.type == "npu"
rotary_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
x_rotated = torch.view_as_complex(hidden_states.to(rotary_dtype).unflatten(3, (-1, 2)))
x_out = torch.view_as_real(x_rotated * freqs).flatten(3, 4)
return x_out.type_as(hidden_states)
class AnyFlowAttnProcessor:
"""
Bidirectional self-attention processor for AnyFlow. Routes through
:func:`~diffusers.models.attention_dispatch.dispatch_attention_fn` so any SDPA-compatible backend is supported
(SDPA, flash-attn, xformers, flex, …). FAR causal generation lives in
:class:`~diffusers.models.transformers.transformer_anyflow_far.AnyFlowCausalAttnProcessor`.
"""
_attention_backend = None
_parallel_config = None
def __init__(self):
if not hasattr(F, "scaled_dot_product_attention"):
raise ImportError(
"AnyFlowAttnProcessor requires PyTorch 2.0. To use it, please upgrade PyTorch to 2.0 or higher."
)
def __call__(
self,
attn: "AnyFlowAttention",
hidden_states: torch.Tensor,
encoder_hidden_states: Optional[torch.Tensor] = None,
attention_mask: Optional[Any] = None,
rotary_emb: Optional[Dict[str, torch.Tensor]] = None,
) -> torch.Tensor:
if encoder_hidden_states is None:
encoder_hidden_states = hidden_states
query = attn.to_q(hidden_states)
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
if attn.norm_q is not None:
query = attn.norm_q(query)
if attn.norm_k is not None:
key = attn.norm_k(key)
# Layout (B, H, L, D) for rotary application; transposed to (B, L, H, D) before dispatch.
query = query.unflatten(2, (attn.heads, -1)).transpose(1, 2)
key = key.unflatten(2, (attn.heads, -1)).transpose(1, 2)
value = value.unflatten(2, (attn.heads, -1)).transpose(1, 2)
if rotary_emb is not None:
query = apply_rotary_emb(query, rotary_emb["query"])
key = apply_rotary_emb(key, rotary_emb["key"])
hidden_states = dispatch_attention_fn(
query.transpose(1, 2),
key.transpose(1, 2),
value.transpose(1, 2),
attn_mask=attention_mask,
dropout_p=0.0,
is_causal=False,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
hidden_states = hidden_states.flatten(2, 3)
hidden_states = hidden_states.type_as(query)
hidden_states = attn.to_out[0](hidden_states)
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class AnyFlowCrossAttnProcessor:
"""
Cross-attention processor for AnyFlow. Always uses the dispatched SDPA-compatible backend; no rotary embedding or
KV cache is applied to the text→video cross-attention path.
"""
_attention_backend = None
_parallel_config = None
def __init__(self):
if not hasattr(F, "scaled_dot_product_attention"):
raise ImportError(
"AnyFlowCrossAttnProcessor requires PyTorch 2.0. To use it, please upgrade PyTorch to 2.0 or higher."
)
def __call__(
self,
attn: "AnyFlowAttention",
hidden_states: torch.Tensor,
encoder_hidden_states: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
) -> torch.Tensor:
query = attn.to_q(hidden_states)
key = attn.to_k(encoder_hidden_states)
value = attn.to_v(encoder_hidden_states)
if attn.norm_q is not None:
query = attn.norm_q(query)
if attn.norm_k is not None:
key = attn.norm_k(key)
# (B, L, H, D) layout for dispatch_attention_fn.
query = query.unflatten(2, (attn.heads, -1))
key = key.unflatten(2, (attn.heads, -1))
value = value.unflatten(2, (attn.heads, -1))
hidden_states = dispatch_attention_fn(
query,
key,
value,
attn_mask=attention_mask,
dropout_p=0.0,
is_causal=False,
backend=self._attention_backend,
parallel_config=self._parallel_config,
)
hidden_states = hidden_states.flatten(2, 3)
hidden_states = hidden_states.type_as(query)
hidden_states = attn.to_out[0](hidden_states)
hidden_states = attn.to_out[1](hidden_states)
return hidden_states
class AnyFlowAttention(torch.nn.Module, AttentionModuleMixin):
"""
Attention module used by :class:`AnyFlowTransformerBlock`. Layout matches the legacy
:class:`~diffusers.models.attention_processor.Attention` so existing AnyFlow checkpoints load bit-exactly into this
class.
"""
_default_processor_cls = AnyFlowAttnProcessor
_available_processors = [AnyFlowAttnProcessor, AnyFlowCrossAttnProcessor]
def __init__(
self,
dim: int,
heads: int,
dim_head: int,
eps: float = 1e-6,
processor: Optional[Any] = None,
):
super().__init__()
self.heads = heads
self.inner_dim = heads * dim_head
self.to_q = torch.nn.Linear(dim, self.inner_dim, bias=True)
self.to_k = torch.nn.Linear(dim, self.inner_dim, bias=True)
self.to_v = torch.nn.Linear(dim, self.inner_dim, bias=True)
self.to_out = torch.nn.ModuleList(
[
torch.nn.Linear(self.inner_dim, dim, bias=True),
torch.nn.Dropout(0.0),
]
)
# ``rms_norm_across_heads`` per-axis: normalize Q and K across the entire ``heads * dim_head``
# channel axis. We use diffusers' RMSNorm (rather than ``torch.nn.RMSNorm``) so the numerics
# match the legacy Attention class that produced the released checkpoints.
self.norm_q = RMSNorm(self.inner_dim, eps=eps)
self.norm_k = RMSNorm(self.inner_dim, eps=eps)
self.set_processor(processor if processor is not None else self._default_processor_cls())
def forward(self, hidden_states: torch.Tensor, **kwargs) -> torch.Tensor:
return self.processor(self, hidden_states, **kwargs)
class AnyFlowImageEmbedding(torch.nn.Module):
def __init__(self, in_features: int, out_features: int):
super().__init__()
self.norm1 = FP32LayerNorm(in_features)
self.ff = FeedForward(in_features, out_features, mult=1, activation_fn="gelu")
self.norm2 = FP32LayerNorm(out_features)
def forward(self, encoder_hidden_states_image: torch.Tensor) -> torch.Tensor:
hidden_states = self.norm1(encoder_hidden_states_image)
hidden_states = self.ff(hidden_states)
hidden_states = self.norm2(hidden_states)
return hidden_states
class AnyFlowDualTimestepTextImageEmbedding(nn.Module):
def __init__(
self,
dim: int,
gate_value: float,
deltatime_type: str,
time_freq_dim: int,
time_proj_dim: int,
text_embed_dim: int,
image_embed_dim: Optional[int] = None,
):
super().__init__()
self.timesteps_proj = Timesteps(num_channels=time_freq_dim, flip_sin_to_cos=True, downscale_freq_shift=0)
self.time_embedder = TimestepEmbedding(in_channels=time_freq_dim, time_embed_dim=dim)
self.delta_embedder = TimestepEmbedding(in_channels=time_freq_dim, time_embed_dim=dim)
self.act_fn = nn.SiLU()
self.time_proj = nn.Linear(dim, time_proj_dim)
self.text_embedder = PixArtAlphaTextProjection(text_embed_dim, dim, act_fn="gelu_tanh")
self.image_embedder = None
if image_embed_dim is not None:
self.image_embedder = AnyFlowImageEmbedding(image_embed_dim, dim)
self.register_buffer("delta_emb_gate", torch.tensor([gate_value], dtype=torch.float32), persistent=False)
self.deltatime_type = deltatime_type
def forward_timestep(
self, timestep: torch.Tensor, delta_timestep: torch.Tensor, encoder_hidden_states, token_per_frame
):
batch_size, num_frames = timestep.shape
timestep = timestep.reshape(-1)
delta_timestep = delta_timestep.reshape(-1)
timestep = self.timesteps_proj(timestep)
time_embedder_dtype = next(iter(self.time_embedder.parameters())).dtype
if timestep.dtype != time_embedder_dtype and time_embedder_dtype != torch.int8:
timestep = timestep.to(time_embedder_dtype)
temb = self.time_embedder(timestep).type_as(encoder_hidden_states)
delta_timestep = self.timesteps_proj(delta_timestep)
delta_embedder_dtype = next(iter(self.delta_embedder.parameters())).dtype
if delta_timestep.dtype != delta_embedder_dtype and delta_embedder_dtype != torch.int8:
delta_timestep = delta_timestep.to(delta_embedder_dtype)
delta_emb = self.delta_embedder(delta_timestep).type_as(encoder_hidden_states)
gate = self.delta_emb_gate.to(delta_embedder_dtype)
rt_emb = (1 - gate) * temb + gate * delta_emb
timestep_proj = self.time_proj(self.act_fn(rt_emb))
rt_emb = rt_emb.unflatten(0, (batch_size, num_frames)).repeat_interleave(token_per_frame, dim=1)
timestep_proj = timestep_proj.unflatten(0, (batch_size, num_frames)).repeat_interleave(token_per_frame, dim=1)
return rt_emb, timestep_proj
def forward(
self,
timestep: torch.Tensor,
r_timestep: torch.Tensor,
encoder_hidden_states: torch.Tensor,
encoder_hidden_states_image: Optional[torch.Tensor] = None,
layout_cfg=None,
):
if self.deltatime_type == "r":
delta_timestep = r_timestep
elif self.deltatime_type == "t-r":
delta_timestep = timestep - r_timestep
else:
raise NotImplementedError
timestep, timestep_proj = self.forward_timestep(
timestep, delta_timestep, encoder_hidden_states, layout_cfg["full_token_per_frame"]
)
encoder_hidden_states = self.text_embedder(encoder_hidden_states)
if encoder_hidden_states_image is not None:
encoder_hidden_states_image = self.image_embedder(encoder_hidden_states_image)
return timestep, timestep_proj, encoder_hidden_states, encoder_hidden_states_image
class AnyFlowRotaryPosEmbed(nn.Module):
"""Rotary positional embedding for the bidirectional AnyFlow transformer.
The FAR causal variant lives in :mod:`~diffusers.models.transformers.transformer_anyflow_far` and additionally
handles compressed-frame chunks; this bidi class produces frequencies for the single full-resolution token grid
only.
"""
def __init__(
self,
attention_head_dim: int,
patch_size: Tuple[int, int, int],
max_seq_len: int,
theta: float = 10000.0,
):
super().__init__()
self.attention_head_dim = attention_head_dim
self.patch_size = patch_size
self.max_seq_len = max_seq_len
self.theta = theta
# Frequency table is lazily built per-device in ``_build_freqs``: MPS / NPU don't support
# complex128, so we downcast to complex64 there.
self._freqs_cache: Optional[Tuple[Any, torch.Tensor]] = None
def _build_freqs(self, device: torch.device) -> torch.Tensor:
cache_key = (device.type, str(device))
if self._freqs_cache is not None and self._freqs_cache[0] == cache_key:
return self._freqs_cache[1]
is_mps = device.type == "mps"
is_npu = device.type == "npu"
freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
h_dim = w_dim = 2 * (self.attention_head_dim // 6)
t_dim = self.attention_head_dim - h_dim - w_dim
freqs_list = []
for dim in (t_dim, h_dim, w_dim):
f = get_1d_rotary_pos_embed(
dim,
self.max_seq_len,
self.theta,
use_real=False,
repeat_interleave_real=False,
freqs_dtype=freqs_dtype,
)
freqs_list.append(f.to(device))
freqs = torch.cat(freqs_list, dim=1)
self._freqs_cache = (cache_key, freqs)
return freqs
def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:
ppf, pph, ppw = num_frames, height, width
freqs_full = self._build_freqs(device)
if min(ppf, pph, ppw) <= 0:
freq_channels = self.attention_head_dim // 2
return torch.empty((ppf, pph, ppw, freq_channels), dtype=freqs_full.dtype, device=device)
freqs = freqs_full.split_with_sizes(
[
self.attention_head_dim // 2 - 2 * (self.attention_head_dim // 6),
self.attention_head_dim // 6,
self.attention_head_dim // 6,
],
dim=1,
)
freqs_f = freqs[0][:ppf].view(ppf, 1, 1, -1).expand(ppf, pph, ppw, -1)
freqs_h = freqs[1][:pph].view(1, pph, 1, -1).expand(ppf, pph, ppw, -1)
freqs_w = freqs[2][:ppw].view(1, 1, ppw, -1).expand(ppf, pph, ppw, -1)
freqs = torch.cat([freqs_f, freqs_h, freqs_w], dim=-1)
return freqs
def forward(self, layout_cfg, device):
freqs = self._forward_full_frame(
num_frames=layout_cfg["total_frames"],
height=layout_cfg["full_frame_shape"][0],
width=layout_cfg["full_frame_shape"][1],
device=device,
)
freqs = freqs.flatten(start_dim=0, end_dim=2)
freqs = freqs[None, None, ...]
return {"query": freqs, "key": freqs}
class AnyFlowTransformerBlock(nn.Module):
"""AnyFlow transformer block.
The self-attention processor is chosen at construction by ``is_causal``: the bidirectional transformer passes
``is_causal=False`` (the default), the FAR causal transformer passes ``is_causal=True``. The forward pass is
identical in both modes — only the processor differs, so all causal-specific machinery (BlockMask, KV cache) lives
inside the processor.
"""
def __init__(
self,
dim: int,
ffn_dim: int,
num_heads: int,
cross_attn_norm: bool = False,
eps: float = 1e-6,
is_causal: bool = False,
):
super().__init__()
self.is_causal = is_causal
# 1. Self-attention. The causal processor lives in the FAR sibling module; lazy-import to
# avoid a circular import at module load time.
if is_causal:
from .transformer_anyflow_far import AnyFlowCausalAttnProcessor
self_attn_processor = AnyFlowCausalAttnProcessor()
else:
self_attn_processor = AnyFlowAttnProcessor()
self.norm1 = FP32LayerNorm(dim, eps, elementwise_affine=False)
self.attn1 = AnyFlowAttention(
dim=dim,
heads=num_heads,
dim_head=dim // num_heads,
eps=eps,
processor=self_attn_processor,
)
# 2. Cross-attention
self.attn2 = AnyFlowAttention(
dim=dim,
heads=num_heads,
dim_head=dim // num_heads,
eps=eps,
processor=AnyFlowCrossAttnProcessor(),
)
self.norm2 = FP32LayerNorm(dim, eps, elementwise_affine=True) if cross_attn_norm else nn.Identity()
# 3. Feed-forward
self.ffn = FeedForward(dim, inner_dim=ffn_dim, activation_fn="gelu-approximate")
self.norm3 = FP32LayerNorm(dim, eps, elementwise_affine=False)
self.scale_shift_table = nn.Parameter(torch.randn(1, 6, dim) / dim**0.5)
def forward(
self,
hidden_states: torch.Tensor,
encoder_hidden_states: torch.Tensor,
temb: torch.Tensor,
rotary_emb: torch.Tensor,
attention_mask: torch.Tensor,
kv_cache=None,
kv_cache_flag=None,
) -> torch.Tensor:
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
self.scale_shift_table + temb.float()
).chunk(6, dim=2)
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
shift_msa.squeeze(2),
scale_msa.squeeze(2),
gate_msa.squeeze(2),
c_shift_msa.squeeze(2),
c_scale_msa.squeeze(2),
c_gate_msa.squeeze(2),
) # noqa: E501
# 1. Self-attention
norm_hidden_states = (self.norm1(hidden_states.float()) * (1 + scale_msa) + shift_msa).type_as(hidden_states)
attn1_kwargs = {
"hidden_states": norm_hidden_states,
"rotary_emb": rotary_emb,
"attention_mask": attention_mask,
}
# KV cache kwargs are only consumed by the FAR causal processor; the bidi processor
# doesn't accept them, so we forward them only when they're actually populated.
if kv_cache is not None:
attn1_kwargs["kv_cache"] = kv_cache
attn1_kwargs["kv_cache_flag"] = kv_cache_flag
attn_output = self.attn1(**attn1_kwargs)
hidden_states = (hidden_states.float() + attn_output * gate_msa).type_as(hidden_states)
# 2. Cross-attention
norm_hidden_states = self.norm2(hidden_states.float()).type_as(hidden_states)
attn_output = self.attn2(hidden_states=norm_hidden_states, encoder_hidden_states=encoder_hidden_states)
hidden_states = hidden_states + attn_output
# 3. Feed-forward
norm_hidden_states = (self.norm3(hidden_states.float()) * (1 + c_scale_msa) + c_shift_msa).type_as(
hidden_states
)
ff_output = self.ffn(norm_hidden_states)
hidden_states = (hidden_states.float() + ff_output.float() * c_gate_msa).type_as(hidden_states)
return hidden_states
class AnyFlowTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
r"""
Bidirectional 3D Transformer for AnyFlow flow-map sampling.
The architecture is the v0.35.1 Wan2.1 3D DiT backbone with one structural change: the timestep embedder is
replaced by ``AnyFlowDualTimestepTextImageEmbedding`` so that every forward call conditions on both the source
timestep ``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
:math:`\Phi_{r\leftarrow t}` introduced in [AnyFlow](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian
Fang et al.
For frame-level autoregressive (FAR causal) generation, use ``AnyFlowFARTransformer3DModel`` instead; that variant
adds the FAR causal block-mask and a compressed-frame patch embedding on top of the same backbone.
Args:
patch_size (`Tuple[int]`, defaults to `(1, 2, 2)`):
3D patch dimensions for video embedding (t_patch, h_patch, w_patch).
num_attention_heads (`int`, defaults to `40`):
Number of attention heads.
attention_head_dim (`int`, defaults to `128`):
The number of channels in each head.
in_channels (`int`, defaults to `16`):
The number of channels in the input latent.
out_channels (`int`, defaults to `16`):
The number of channels in the output latent.
text_dim (`int`, defaults to `4096`):
Input dimension for text embeddings (UMT5).
freq_dim (`int`, defaults to `256`):
Dimension for sinusoidal time embeddings.
ffn_dim (`int`, defaults to `13824`):
Intermediate dimension in feed-forward network.
num_layers (`int`, defaults to `40`):
Number of transformer blocks.
cross_attn_norm (`bool`, defaults to `True`):
Enable cross-attention normalization.
eps (`float`, defaults to `1e-6`):
Epsilon for normalization layers.
image_dim (`Optional[int]`, *optional*, defaults to `None`):
Image embedding dimension for I2V conditioning (`1280` for the original Wan2.1-I2V model).
rope_max_seq_len (`int`, defaults to `1024`):
Maximum sequence length used to precompute rotary position frequencies.
gate_value (`float`, defaults to `0.25`):
Mixing gate between source-timestep and delta-timestep embeddings (the AnyFlow paper's :math:`g` parameter,
fixed at 0.25 in stage-1 distillation).
deltatime_type (`str`, defaults to `'r'`):
Either ``"r"`` (delta is the target timestep) or ``"t-r"`` (delta is the absolute interval).
"""
_supports_gradient_checkpointing = True
_skip_layerwise_casting_patterns = ["patch_embedding", "condition_embedder", "norm"]
_no_split_modules = ["AnyFlowTransformerBlock"]
_keep_in_fp32_modules = ["time_embedder", "scale_shift_table", "norm1", "norm2", "norm3"]
_repeated_blocks = ["AnyFlowTransformerBlock"]
@register_to_config
def __init__(
self,
patch_size: Tuple[int] = (1, 2, 2),
num_attention_heads: int = 40,
attention_head_dim: int = 128,
in_channels: int = 16,
out_channels: int = 16,
text_dim: int = 4096,
freq_dim: int = 256,
ffn_dim: int = 13824,
num_layers: int = 40,
cross_attn_norm: bool = True,
eps: float = 1e-6,
image_dim: Optional[int] = None,
rope_max_seq_len: int = 1024,
gate_value: float = 0.25,
deltatime_type: str = "r",
) -> None:
super().__init__()
inner_dim = num_attention_heads * attention_head_dim
out_channels = out_channels or in_channels
# 1. Patch & position embedding (full-frame only).
self.rope = AnyFlowRotaryPosEmbed(attention_head_dim, patch_size, rope_max_seq_len)
self.patch_embedding = nn.Conv3d(in_channels, inner_dim, kernel_size=patch_size, stride=patch_size)
# 2. Condition embedding (always dual-timestep for AnyFlow distilled checkpoints).
self.condition_embedder = AnyFlowDualTimestepTextImageEmbedding(
dim=inner_dim,
gate_value=gate_value,
deltatime_type=deltatime_type,
time_freq_dim=freq_dim,
time_proj_dim=inner_dim * 6,
text_embed_dim=text_dim,
image_embed_dim=image_dim,
)
# 3. Transformer blocks
self.blocks = nn.ModuleList(
[
AnyFlowTransformerBlock(inner_dim, ffn_dim, num_attention_heads, cross_attn_norm, eps)
for _ in range(num_layers)
]
)
# 4. Output norm & projection
self.norm_out = FP32LayerNorm(inner_dim, eps, elementwise_affine=False)
self.proj_out = nn.Linear(inner_dim, out_channels * math.prod(patch_size))
self.scale_shift_table = nn.Parameter(torch.randn(1, 2, inner_dim) / inner_dim**0.5)
self.gradient_checkpointing = False
def _unpack_latent_sequence(self, latents, num_frames, height, width, patch_size):
batch_size, num_patches, channels = latents.shape
height, width = height // patch_size, width // patch_size
latents = latents.view(
batch_size * num_frames, height, width, patch_size, patch_size, channels // (patch_size * patch_size)
)
latents = latents.permute(0, 5, 1, 3, 2, 4)
latents = latents.reshape(
batch_size, num_frames, channels // (patch_size * patch_size), height * patch_size, width * patch_size
)
return latents
@apply_lora_scale("attention_kwargs")
def forward(
self,
hidden_states: torch.Tensor,
timestep: torch.Tensor,
r_timestep: torch.Tensor,
encoder_hidden_states: torch.Tensor,
encoder_hidden_states_image: Optional[torch.Tensor] = None,
attention_kwargs: Optional[Dict[str, Any]] = None,
return_dict: bool = True,
) -> Union[Transformer2DModelOutput, Tuple]:
"""
Bidirectional flow-map forward pass. ``hidden_states`` is laid out as ``(B, F, C, H, W)`` (per-frame latents).
The input is patchified with the standard ``patch_embedding`` (kernel = stride = ``patch_size``) and denoised
with global bidirectional self-attention over the resulting flat token sequence.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_frames, num_channels, height, width)`):
Input video latents.
timestep (`torch.Tensor`):
Source (noisier) flow-map timestep `t`.
r_timestep (`torch.Tensor`):
Target (cleaner) flow-map timestep `r`; defines the destination of the flow-map step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Text-conditioning embeddings.
encoder_hidden_states_image (`torch.Tensor`, *optional*):
Image-conditioning embeddings; concatenated before the text tokens when provided.
attention_kwargs (`dict`, *optional*):
Kwargs forwarded to the `AttentionProcessor` as defined under `self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain tuple.
Returns:
[`~models.transformer_2d.Transformer2DModelOutput`] if `return_dict` is True, otherwise a `tuple` whose
first element is the predicted velocity tensor.
"""
hidden_states = hidden_states.permute(0, 2, 1, 3, 4)
batch_size, num_channels, num_frames, height, width = hidden_states.shape
full_token_per_frame = (height * width) // (self.config.patch_size[1] * self.config.patch_size[2])
layout_cfg = {
"total_frames": num_frames,
"full_frame_shape": (height // self.config.patch_size[1], width // self.config.patch_size[2]),
"full_token_per_frame": full_token_per_frame,
}
rotary_emb = self.rope(layout_cfg=layout_cfg, device=hidden_states.device)
hidden_states = self.patch_embedding(hidden_states)
hidden_states = hidden_states.flatten(2).transpose(1, 2)
temb, timestep_proj, encoder_hidden_states, encoder_hidden_states_image = self.condition_embedder(
timestep,
r_timestep,
encoder_hidden_states,
encoder_hidden_states_image,
layout_cfg=layout_cfg,
)
timestep_proj = timestep_proj.unflatten(2, (6, -1))
attention_mask = None
if encoder_hidden_states_image is not None:
encoder_hidden_states = torch.concat([encoder_hidden_states_image, encoder_hidden_states], dim=1)
if torch.is_grad_enabled() and self.gradient_checkpointing:
for block in self.blocks:
hidden_states = self._gradient_checkpointing_func(
block, hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, attention_mask
)
else:
for block in self.blocks:
hidden_states = block(hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, attention_mask)
# Output norm, projection & unpatchify.
# `temb` is always 3D from `condition_embedder.forward()` (broadcast over total tokens).
shift, scale = (self.scale_shift_table.unsqueeze(0) + temb.unsqueeze(2)).chunk(2, dim=2)
shift = shift.squeeze(2)
scale = scale.squeeze(2)
# Move shift/scale to hidden_states' device for multi-GPU accelerate inference.
shift = shift.to(hidden_states.device)
scale = scale.to(hidden_states.device)
hidden_states = (self.norm_out(hidden_states.float()) * (1 + scale) + shift).type_as(hidden_states)
hidden_states = self.proj_out(hidden_states)
output = self._unpack_latent_sequence(
hidden_states,
num_frames=layout_cfg["total_frames"],
height=height,
width=width,
patch_size=self.config.patch_size[1],
)
if not return_dict:
return (output,)
return Transformer2DModelOutput(sample=output)

File diff suppressed because it is too large Load Diff

View File

@@ -608,8 +608,16 @@ class BriaTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOrig
from the embeddings of input conditions.
timestep ( `torch.LongTensor`):
Used to indicate denoising step.
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
img_ids (`torch.Tensor`):
Image position ids used to compute the rotary positional embeddings.
txt_ids (`torch.Tensor`):
Text position ids used to compute the rotary positional embeddings.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
controlnet_single_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of single transformer blocks.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in

View File

@@ -529,10 +529,18 @@ class BriaFiboTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, From
Input `hidden_states`.
encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
text_encoder_layers (`list` of `torch.Tensor`):
Per-block text encoder hidden states, one tensor per transformer block.
pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected
from the embeddings of input conditions.
timestep ( `torch.LongTensor`):
Used to indicate denoising step.
img_ids (`torch.Tensor`):
Image position ids used to compute the rotary positional embeddings.
txt_ids (`torch.Tensor`):
Text position ids used to compute the rotary positional embeddings.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in

View File

@@ -498,8 +498,18 @@ class ChromaTransformer2DModel(
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep ( `torch.LongTensor`):
Used to indicate denoising step.
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
img_ids (`torch.Tensor`):
Image position ids used to compute the rotary positional embeddings.
txt_ids (`torch.Tensor`):
Text position ids used to compute the rotary positional embeddings.
attention_mask (`torch.Tensor`, *optional*):
Mask applied to `encoder_hidden_states` during attention.
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
controlnet_single_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of single transformer blocks.
controlnet_blocks_repeat (`bool`, *optional*, defaults to `False`):
Whether to repeat the controlnet block samples across all transformer blocks.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in

View File

@@ -651,6 +651,30 @@ class ChronoEditTransformer3DModel(
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
) -> torch.Tensor | dict[str, torch.Tensor]:
"""
The [`ChronoEditTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_hidden_states_image (`torch.Tensor`, *optional*):
Conditional image embeddings for image-conditioned generation.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p_t, p_h, p_w = self.config.patch_size
post_patch_num_frames = num_frames // p_t

View File

@@ -713,6 +713,38 @@ class CogView4Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
attention_mask: torch.Tensor | None = None,
image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | list[tuple[torch.Tensor, torch.Tensor]] | None = None,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`CogView4Transformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
original_size (`torch.Tensor`):
Original image size conditioning.
target_size (`torch.Tensor`):
Target image size conditioning.
crop_coords (`torch.Tensor`):
Crop coordinates conditioning.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
attention_mask (`torch.Tensor`, *optional*):
Mask applied to attention scores.
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
Pre-computed rotary positional embeddings.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, height, width = hidden_states.shape
# 1. RoPE

View File

@@ -697,6 +697,34 @@ class CosmosTransformer3DModel(ModelMixin, ConfigMixin, FromOriginalModelMixin,
padding_mask: torch.Tensor | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`CosmosTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
block_controlnet_hidden_states (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
attention_mask (`torch.Tensor`, *optional*):
Mask applied to `encoder_hidden_states` during attention.
fps (`int`, *optional*):
Frames per second of the input video used to compute the rotary positional embeddings.
condition_mask (`torch.Tensor`, *optional*):
Mask channel concatenated to `hidden_states` to indicate the conditioning region.
padding_mask (`torch.Tensor`, *optional*):
Padding mask concatenated to `hidden_states` when `concat_padding_mask` is enabled.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
# 1. Concatenate padding mask if needed & prepare attention mask

View File

@@ -469,6 +469,33 @@ class EasyAnimateTransformer3DModel(ModelMixin, ConfigMixin):
control_latents: torch.Tensor | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`EasyAnimateTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
timestep_cond (`torch.Tensor`, *optional*):
Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
through the `self.time_embedding` layer to obtain the final timestep embeddings.
encoder_hidden_states (`torch.Tensor`, *optional*):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_hidden_states_t5 (`torch.Tensor`, *optional*):
Additional conditional embeddings computed from a T5 text encoder.
inpaint_latents (`torch.Tensor`, *optional*):
Latents concatenated to `hidden_states` for inpainting variants of the model.
control_latents (`torch.Tensor`, *optional*):
Latents concatenated to `hidden_states` for control variants of the model.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, channels, video_length, height, width = hidden_states.size()
p = self.config.patch_size
post_patch_height = height // p

View File

@@ -350,6 +350,23 @@ class ErnieImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
text_lens: torch.Tensor,
return_dict: bool = True,
):
"""
The [`ErnieImageTransformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
text_bth (`torch.Tensor`):
Conditional text embeddings (embeddings computed from the input conditions such as prompts) to use,
shaped `(batch_size, text_length, embed_dims)`.
text_lens (`torch.Tensor`):
Per-sample text sequence lengths used to build the attention mask.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
device, dtype = hidden_states.device, hidden_states.dtype
B, C, H, W = hidden_states.shape
p, Hp, Wp = self.patch_size, H // self.patch_size, W // self.patch_size

View File

@@ -662,8 +662,18 @@ class FluxTransformer2DModel(
from the embeddings of input conditions.
timestep ( `torch.LongTensor`):
Used to indicate denoising step.
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
img_ids (`torch.Tensor`):
Image position ids used to compute the rotary positional embeddings.
txt_ids (`torch.Tensor`):
Text position ids used to compute the rotary positional embeddings.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
controlnet_single_block_samples (`list` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of single transformer blocks.
controlnet_blocks_repeat (`bool`, *optional*, defaults to `False`):
Whether to repeat the controlnet block samples across all transformer blocks.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in

View File

@@ -1201,6 +1201,12 @@ class Flux2Transformer2DModel(
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
img_ids (`torch.Tensor`):
Image position ids used to compute the rotary positional embeddings.
txt_ids (`torch.Tensor`):
Text position ids used to compute the rotary positional embeddings.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
joint_attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in

View File

@@ -609,6 +609,42 @@ class GlmImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
kv_caches: GlmImageKVCache | None = None,
image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | list[tuple[torch.Tensor, torch.Tensor]] | None = None,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`GlmImageTransformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
prior_token_id (`torch.Tensor`):
Token ids for the prior embedding lookup.
prior_token_drop (`torch.Tensor`):
Boolean mask indicating which prior embeddings should be dropped (zeroed out).
timestep (`torch.LongTensor`):
Used to indicate denoising step.
target_size (`torch.Tensor`):
Target image size conditioning.
crop_coords (`torch.Tensor`):
Crop coordinates conditioning.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
attention_mask (`torch.Tensor`, *optional*):
Mask applied to attention scores.
kv_caches (`GlmImageKVCache`, *optional*):
Pre-computed key/value caches used to speed up inference.
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
Pre-computed rotary positional embeddings.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, height, width = hidden_states.shape
# 1. RoPE

View File

@@ -671,6 +671,42 @@ class HeliosTransformer3DModel(
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
) -> torch.Tensor | dict[str, torch.Tensor]:
"""
The [`HeliosTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
indices_hidden_states (`torch.Tensor`, *optional*):
Frame indices for `hidden_states` used to compute the rotary positional embeddings.
indices_latents_history_short (`torch.Tensor`, *optional*):
Frame indices for the short history latents.
indices_latents_history_mid (`torch.Tensor`, *optional*):
Frame indices for the mid history latents.
indices_latents_history_long (`torch.Tensor`, *optional*):
Frame indices for the long history latents.
latents_history_short (`torch.Tensor`, *optional*):
Short history latents conditioning.
latents_history_mid (`torch.Tensor`, *optional*):
Mid history latents conditioning.
latents_history_long (`torch.Tensor`, *optional*):
Long history latents conditioning.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
# 1. Input
batch_size = hidden_states.shape[0]
p_t, p_h, p_w = self.config.patch_size

View File

@@ -788,6 +788,38 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin,
return_dict: bool = True,
**kwargs,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`HiDreamImageTransformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)` or `(batch_size, patch_height * patch_width, patch_size * patch_size * channels)`):
Input `hidden_states`.
timesteps (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states_t5 (`torch.Tensor`):
Conditional embeddings computed from the T5 text encoder.
encoder_hidden_states_llama3 (`torch.Tensor`):
Conditional embeddings computed from the Llama3 text encoder.
pooled_embeds (`torch.Tensor`):
Pooled text embeddings used for additional conditioning.
img_ids (`torch.Tensor`, *optional*):
Image position ids for the patched hidden states.
img_sizes (`list` of `tuple` of `int`, *optional*):
Per-sample patch grid sizes used to unpatchify the output.
hidden_states_masks (`torch.Tensor`, *optional*):
Mask over patched `hidden_states`.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
encoder_hidden_states = kwargs.get("encoder_hidden_states", None)
if encoder_hidden_states is not None:

View File

@@ -1003,6 +1003,34 @@ class HunyuanVideoTransformer3DModel(
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`HunyuanVideoTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
pooled_projections (`torch.Tensor` of shape `(batch_size, projection_dim)`):
Embeddings projected from the embeddings of input conditions.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p, p_t = self.config.patch_size, self.config.patch_size_t
post_patch_num_frames = num_frames // p_t

View File

@@ -634,6 +634,38 @@ class HunyuanVideo15Transformer3DModel(
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`HunyuanVideo15Transformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
timestep_r (`torch.LongTensor`, *optional*):
Refiner timestep conditioning.
encoder_hidden_states_2 (`torch.Tensor`, *optional*):
Additional conditional embeddings computed from a second text encoder (ByT5).
encoder_attention_mask_2 (`torch.Tensor`, *optional*):
Mask applied to `encoder_hidden_states_2` during attention.
image_embeds (`torch.Tensor`, *optional*):
Image embeddings for image-conditioned generation.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p_t, p_h, p_w = self.config.patch_size_t, self.config.patch_size, self.config.patch_size
post_patch_num_frames = num_frames // p_t

View File

@@ -218,6 +218,50 @@ class HunyuanVideoFramepackTransformer3DModel(
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
"""
The [`HunyuanVideoFramepackTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
pooled_projections (`torch.Tensor` of shape `(batch_size, projection_dim)`):
Embeddings projected from the embeddings of input conditions.
image_embeds (`torch.Tensor`):
Image embeddings for image-conditioned generation.
indices_latents (`torch.Tensor`):
Frame indices for `hidden_states` used to compute the rotary positional embeddings.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
latents_clean (`torch.Tensor`, *optional*):
Clean (denoised) history latents conditioning.
indices_latents_clean (`torch.Tensor`, *optional*):
Frame indices for `latents_clean`.
latents_history_2x (`torch.Tensor`, *optional*):
2x downsampled history latents conditioning.
indices_latents_history_2x (`torch.Tensor`, *optional*):
Frame indices for `latents_history_2x`.
latents_history_4x (`torch.Tensor`, *optional*):
4x downsampled history latents conditioning.
indices_latents_history_4x (`torch.Tensor`, *optional*):
Frame indices for `latents_history_4x`.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p, p_t = self.config.patch_size, self.config.patch_size_t
post_patch_num_frames = num_frames // p_t

View File

@@ -754,6 +754,38 @@ class HunyuanImageTransformer2DModel(
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> torch.Tensor | dict[str, torch.Tensor]:
"""
The [`HunyuanImageTransformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` or `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
timestep_r (`torch.LongTensor`, *optional*):
Refiner timestep conditioning.
encoder_hidden_states_2 (`torch.Tensor`, *optional*):
Additional conditional embeddings computed from a second text encoder.
encoder_attention_mask_2 (`torch.Tensor`, *optional*):
Mask applied to `encoder_hidden_states_2` during attention.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
if hidden_states.ndim == 4:
batch_size, channels, height, width = hidden_states.shape
sizes = (height, width)

View File

@@ -526,6 +526,20 @@ class JoyImageEditTransformer3DModel(ModelMixin, ConfigMixin, AttentionMixin):
encoder_hidden_states: torch.Tensor = None,
return_dict: bool = True,
):
"""
The [`JoyImageEditTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)` or `(batch_size, num_items, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor`, *optional*):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
# handle multi-item input (b, n, c, t, h, w)
is_multi_item = hidden_states.ndim == 6
num_items = 0

View File

@@ -545,6 +545,25 @@ class LongCatAudioDiTTransformer(ModelMixin, ConfigMixin):
latent_cond: torch.Tensor | None = None,
return_dict: bool = True,
) -> LongCatAudioDiTTransformerOutput | tuple[torch.Tensor]:
"""
The [`LongCatAudioDiTTransformer`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, sequence_length, in_channels)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_attention_mask (`torch.BoolTensor`):
Mask applied to `encoder_hidden_states` during attention.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
attention_mask (`torch.BoolTensor`, *optional*):
Mask applied to `hidden_states` during self-attention.
latent_cond (`torch.Tensor`, *optional*):
Latent conditioning concatenated to `hidden_states`.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`LongCatAudioDiTTransformerOutput`] instead of a plain tuple.
"""
dtype = hidden_states.dtype
encoder_hidden_states = encoder_hidden_states.to(dtype)
timestep = timestep.to(dtype)

View File

@@ -483,8 +483,12 @@ class LongCatImageTransformer2DModel(
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep ( `torch.LongTensor`):
Used to indicate denoising step.
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
A list of tensors that if specified are added to the residuals of transformer blocks.
img_ids (`torch.Tensor`):
Image position ids used to compute the rotary positional embeddings.
txt_ids (`torch.Tensor`):
Text position ids used to compute the rotary positional embeddings.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding used for guidance-distilled variants of the model.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.

View File

@@ -506,6 +506,36 @@ class LTXVideoTransformer3DModel(
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> torch.Tensor:
"""
The [`LTXVideoTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, sequence_length, in_channels)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
num_frames (`int`, *optional*):
Number of frames in the video used to compute the rotary positional embeddings.
height (`int`, *optional*):
Height of the latent used to compute the rotary positional embeddings.
width (`int`, *optional*):
Width of the latent used to compute the rotary positional embeddings.
rope_interpolation_scale (`tuple` of `float` or `torch.Tensor`, *optional*):
Interpolation scale used by the rotary positional embeddings.
video_coords (`torch.Tensor`, *optional*):
Pre-computed video coordinates used by the rotary positional embeddings.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
image_rotary_emb = self.rope(hidden_states, num_frames, height, width, rope_interpolation_scale, video_coords)
# convert encoder_attention_mask to a bias the same way we do for attention_mask

View File

@@ -465,6 +465,30 @@ class Lumina2Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromO
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> torch.Tensor | Transformer2DModelOutput:
"""
The [`Lumina2Transformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
# 1. Condition, positional & patch embedding
batch_size, _, height, width = hidden_states.shape

View File

@@ -414,6 +414,26 @@ class MochiTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOri
attention_kwargs: dict[str, Any] | None = None,
return_dict: bool = True,
) -> torch.Tensor:
"""
The [`MochiTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_attention_mask (`torch.Tensor`):
Mask applied to `encoder_hidden_states` during attention.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p = self.config.patch_size

View File

@@ -415,6 +415,29 @@ class OmniGenTransformer2DModel(ModelMixin, ConfigMixin):
position_ids: torch.Tensor,
return_dict: bool = True,
) -> Transformer2DModelOutput | tuple[torch.Tensor]:
"""
The [`OmniGenTransformer2DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
input_ids (`torch.Tensor`):
Multimodal text token ids used as conditioning.
input_img_latents (`list` of `torch.Tensor`):
List of latents for input images used as conditioning.
input_image_sizes (`dict` of `int` to `list` of `int`):
Mapping from sample index to the positions where input image embeddings should be placed in the
conditioning sequence.
attention_mask (`torch.Tensor`):
Attention mask for the joint multimodal sequence.
position_ids (`torch.Tensor`):
Position ids used to compute the positional embeddings.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
"""
batch_size, num_channels, height, width = hidden_states.shape
p = self.config.patch_size
post_patch_height, post_patch_width = height // p, width // p

View File

@@ -868,6 +868,8 @@ class QwenImageTransformer2DModel(
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
controlnet_block_samples (*optional*):
ControlNet block samples to add to the transformer blocks.
additional_t_cond (`torch.Tensor`, *optional*):
Additional timestep conditioning added to the timestep embedding.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.

View File

@@ -583,6 +583,36 @@ class SanaVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Fro
controlnet_block_samples: tuple[torch.Tensor] | None = None,
return_dict: bool = True,
) -> tuple[torch.Tensor, ...] | Transformer2DModelOutput:
"""
The [`SanaVideoTransformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, num_frames, height, width)`):
Input `hidden_states`.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
guidance (`torch.Tensor`, *optional*):
Guidance scale embedding.
encoder_attention_mask (`torch.Tensor`, *optional*):
Cross-attention mask applied to `encoder_hidden_states`.
attention_mask (`torch.Tensor`, *optional*):
Self-attention mask applied to `hidden_states`.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
controlnet_block_samples (`tuple` of `torch.Tensor`, *optional*):
A list of tensors that if specified are added to the residuals of transformer blocks.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
# ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
# we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
# we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.

View File

@@ -642,6 +642,34 @@ class SkyReelsV2Transformer3DModel(
return_dict: bool = True,
attention_kwargs: dict[str, Any] | None = None,
) -> torch.Tensor | dict[str, torch.Tensor]:
"""
The [`SkyReelsV2Transformer3DModel`] forward method.
Args:
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
Input `hidden_states`.
timestep (`torch.LongTensor`):
Used to indicate denoising step.
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
encoder_hidden_states_image (`torch.Tensor`, *optional*):
Conditional image embeddings for image-conditioned generation.
enable_diffusion_forcing (`bool`, *optional*, defaults to `False`):
Whether to enable diffusion forcing (per-block causal masking).
fps (`torch.Tensor`, *optional*):
FPS conditioning embedding.
return_dict (`bool`, *optional*, defaults to `True`):
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
tuple.
attention_kwargs (`dict`, *optional*):
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
`self.processor` in
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
Returns:
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
`tuple` where the first element is the sample tensor.
"""
batch_size, num_channels, num_frames, height, width = hidden_states.shape
p_t, p_h, p_w = self.config.patch_size
post_patch_num_frames = num_frames // p_t

Some files were not shown because too many files have changed in this diff Show More