mirror of
https://github.com/huggingface/diffusers.git
synced 2026-05-28 00:39:35 +08:00
Merge branch 'main' into group-offloading-pytest
Some checks failed
Secret Leaks / trufflehog (push) Has been cancelled
Some checks failed
Secret Leaks / trufflehog (push) Has been cancelled
This commit is contained in:
@@ -172,3 +172,5 @@ Boolean gate. If `False` (default), calling that method raises `ValueError`. All
|
||||
freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
|
||||
```
|
||||
See `transformer_flux.py`, `transformer_flux2.py`, `transformer_wan.py`, `unet_2d_condition.py` for reference usages. Never leave an unconditional `torch.float64` in the model.
|
||||
|
||||
6. **Using `torch.empty`.** - Do not use `torch.empty` to initialize parameters. Use `torch.zeros` or `torch.ones`, instead.
|
||||
@@ -60,3 +60,7 @@ When adding a new pipeline (or reviewing one), skim `pipeline_flux.py`, `pipelin
|
||||
4. **Subclassing an existing pipeline for a variant.** Don't use an existing pipeline class (e.g. `FluxPipeline`) to override another (e.g. `FluxImg2ImgPipeline`) inside the core `src/` codebase. Each pipeline lives in its own file with its own class, even if it shares 90% of `__call__` with a sibling. Convention across diffusers — flux, sdxl, wan, qwenimage — is duplicated `__call__` between img2img / text2img / inpaint variants, not subclassing. Reuse private utilities (shared schedulers, prep functions) but not the pipeline class itself.
|
||||
|
||||
5. **Copying a method from another pipeline without `# Copied from`.** When you reuse a method like `encode_prompt`, `prepare_latents`, `check_inputs`, or `_prepare_latent_image_ids` from another pipeline, add a `# Copied from` annotation so `make fix-copies` keeps the two in sync. Forgetting it means future refactors to the source drift away from your copy silently — and reviewers waste time spotting near-identical code that should have been linked. The annotation grammar (decorator placement, rename syntax with `with old->new`, etc.) is implemented in [`utils/check_copies.py`](../utils/check_copies.py) — read it for the exact rules.
|
||||
|
||||
6. **Be deliberate about methods on the pipeline.** `__call__` is the user's mental model. The methods on the class are how they navigate it. Diffusers convention (flux, sdxl, wan, qwenimage) is a flat class body of public lifecycle methods (`__init__`, `check_inputs`, `encode_prompt`, `prepare_latents`, `__call__`). Two principles, not strict rules — use judgment:
|
||||
- **If a method is called from `__call__`, and it's a step in the pipeline lifecycle, make it public.** Each call from `__call__` should correspond to a step a user can identify: either a standard one (`encode_prompt`, `prepare_latents`, `set_timesteps`, …) or a pipeline-specific one (`prepare_src_latents`, `prepare_reference_audio_latents`, …). Don't gate these behind a `_`; they're part of the pipeline's API surface alongside their standard siblings.
|
||||
- **If a method is only used by another method, make it private (`_foo`) or lift it to a module-level function — and keep the count down.** Before adding one, see if the logic can be absorbed into its caller. Unless you expect the helper to be reused by another method (or another task pipeline), absorbing is usually the better call — especially when the body is small. Avoid a pipeline class littered with private helpers that bury the lifecycle..
|
||||
|
||||
11
.github/dependabot.yml
vendored
Normal file
11
.github/dependabot.yml
vendored
Normal file
@@ -0,0 +1,11 @@
|
||||
version: 2
|
||||
updates:
|
||||
- package-ecosystem: "github-actions"
|
||||
directory: "/"
|
||||
schedule:
|
||||
interval: "weekly"
|
||||
cooldown:
|
||||
default-days: 7
|
||||
groups:
|
||||
actions:
|
||||
patterns: ["*"]
|
||||
2
.github/workflows/benchmark.yml
vendored
2
.github/workflows/benchmark.yml
vendored
@@ -45,7 +45,7 @@ jobs:
|
||||
uv pip install -r benchmarks/requirements.txt
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Diffusers Benchmarking
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
|
||||
33
.github/workflows/claude_review.yml
vendored
33
.github/workflows/claude_review.yml
vendored
@@ -156,7 +156,6 @@ jobs:
|
||||
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
PR_NUMBER: ${{ github.event.issue.number || github.event.pull_request.number }}
|
||||
COMMENT_USER: ${{ github.event.comment.user.login }}
|
||||
BASE_BRANCH: ${{ github.event.repository.default_branch }}
|
||||
run: |
|
||||
set -euo pipefail
|
||||
|
||||
@@ -186,11 +185,18 @@ jobs:
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# For fork PRs, an earlier step redirected `origin` to a local bare
|
||||
# repo to sandbox claude-code-action. Undo that redirect so our push
|
||||
# reaches the real base repo. Safe: only Claude's edits within the
|
||||
# allowed paths are committed below — never the fork's other changes.
|
||||
git config --unset-all url."file:///tmp/local-origin.git".insteadOf 2>/dev/null || true
|
||||
PR_INFO=$(gh pr view "$PR_NUMBER" --json headRefName,isCrossRepository)
|
||||
PR_BRANCH=$(echo "$PR_INFO" | jq -r '.headRefName')
|
||||
IS_FORK=$(echo "$PR_INFO" | jq -r '.isCrossRepository')
|
||||
|
||||
# COMMIT THIS isn't supported on fork PRs: we can't push to the
|
||||
# fork's branch, and falling back to main almost always conflicts
|
||||
# once the PR touches files that also moved on main. Bail early —
|
||||
# Claude's review comment with the suggested diff still stands.
|
||||
if [[ "$IS_FORK" == "true" ]]; then
|
||||
post_status "ℹ️ \`COMMIT THIS\` isn't supported on fork PRs. Apply Claude's suggestions manually, or open an issue to track them. See [workflow run]($RUN_URL)."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
git config user.name "claude[bot]"
|
||||
git config user.email "41898282+github-actions[bot]@users.noreply.github.com"
|
||||
@@ -208,8 +214,6 @@ jobs:
|
||||
exit 1
|
||||
fi
|
||||
|
||||
PR_BRANCH=$(gh pr view "$PR_NUMBER" --json headRefName --jq '.headRefName')
|
||||
|
||||
if [[ "$PR_BRANCH" == claude/pr-* ]]; then
|
||||
# Source PR is already a Claude-opened PR — iterate in place by
|
||||
# committing and pushing straight to its head branch instead of
|
||||
@@ -222,9 +226,14 @@ jobs:
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Otherwise: commit on the source PR's branch to get a clean SHA,
|
||||
# then cherry-pick onto a fresh branch cut from the default branch.
|
||||
# The follow-up PR's diff is therefore exactly Claude's edits vs. main.
|
||||
# Target the source PR's head branch. The follow-up then applies
|
||||
# cleanly regardless of how main has diverged, and merging it lands
|
||||
# Claude's edits onto the PR for the maintainer to fold in.
|
||||
BASE_BRANCH="$PR_BRANCH"
|
||||
|
||||
# Commit on the source PR's branch to get a clean SHA, then
|
||||
# cherry-pick onto a fresh branch cut from BASE_BRANCH so the
|
||||
# follow-up PR's diff is exactly Claude's edits vs. BASE_BRANCH.
|
||||
NEW_BRANCH="claude/pr-${PR_NUMBER}-$(date -u +%Y%m%d-%H%M%S)"
|
||||
|
||||
git commit -m "Apply changes from Claude (requested by @${COMMENT_USER} on #${PR_NUMBER})
|
||||
@@ -248,6 +257,6 @@ jobs:
|
||||
--title "Apply Claude's changes from #${PR_NUMBER}" \
|
||||
--body "Automated PR with edits Claude made in response to \`COMMIT THIS\` from @${COMMENT_USER} on [#${PR_NUMBER}](${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/pull/${PR_NUMBER}).
|
||||
|
||||
Targets \`${BASE_BRANCH}\` — independent of #${PR_NUMBER}. Further \`COMMIT THIS\` requests on *this* PR will commit directly to it.")
|
||||
Targets \`${BASE_BRANCH}\` (the head branch of #${PR_NUMBER}). Merging this brings Claude's edits into that PR.")
|
||||
|
||||
post_status "✅ Opened follow-up PR (into \`${BASE_BRANCH}\`) with Claude's edits: ${NEW_PR_URL}"
|
||||
|
||||
35
.github/workflows/nightly_tests.yml
vendored
35
.github/workflows/nightly_tests.yml
vendored
@@ -19,6 +19,9 @@ env:
|
||||
PIPELINE_USAGE_CUTOFF: 0
|
||||
SLACK_API_TOKEN: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
|
||||
CONSOLIDATED_REPORT_PATH: consolidated_test_report.md
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
jobs:
|
||||
setup_torch_cuda_pipeline_matrix:
|
||||
@@ -74,14 +77,14 @@ jobs:
|
||||
run: nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip install pytest-reportlog
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Pipeline CUDA Test
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -128,14 +131,14 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip install pytest-reportlog
|
||||
- name: Environment
|
||||
run: python utils/print_env.py
|
||||
run: diffusers-cli env
|
||||
|
||||
- name: Run nightly PyTorch CUDA tests for non-pipeline modules
|
||||
if: ${{ matrix.module != 'examples'}}
|
||||
@@ -196,12 +199,12 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Run torch compile tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -238,15 +241,15 @@ jobs:
|
||||
run: nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip install pytest-reportlog
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Selected Torch CUDA Test on big GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -289,15 +292,15 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run PyTorch CUDA tests
|
||||
env:
|
||||
@@ -365,6 +368,7 @@ jobs:
|
||||
run: nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip install -U ${{ matrix.config.backend }}
|
||||
if [ "${{ join(matrix.config.additional_deps, ' ') }}" != "" ]; then
|
||||
@@ -372,10 +376,9 @@ jobs:
|
||||
fi
|
||||
uv pip install pytest-reportlog
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: ${{ matrix.config.backend }} quantization tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -418,14 +421,14 @@ jobs:
|
||||
run: nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip install -U bitsandbytes optimum_quanto
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip install pytest-reportlog
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Pipeline-level quantization tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -541,7 +544,7 @@ jobs:
|
||||
# - name: Environment
|
||||
# shell: arch -arch arm64 bash {0}
|
||||
# run: |
|
||||
# ${CONDA_RUN} python utils/print_env.py
|
||||
# ${CONDA_RUN} diffusers-cli env
|
||||
# - name: Run nightly PyTorch tests on M1 (MPS)
|
||||
# shell: arch -arch arm64 bash {0}
|
||||
# env:
|
||||
@@ -597,7 +600,7 @@ jobs:
|
||||
# - name: Environment
|
||||
# shell: arch -arch arm64 bash {0}
|
||||
# run: |
|
||||
# ${CONDA_RUN} python utils/print_env.py
|
||||
# ${CONDA_RUN} diffusers-cli env
|
||||
# - name: Run nightly PyTorch tests on M1 (MPS)
|
||||
# shell: arch -arch arm64 bash {0}
|
||||
# env:
|
||||
|
||||
8
.github/workflows/pr_modular_tests.yml
vendored
8
.github/workflows/pr_modular_tests.yml
vendored
@@ -34,6 +34,9 @@ env:
|
||||
OMP_NUM_THREADS: 4
|
||||
MKL_NUM_THREADS: 4
|
||||
PYTEST_TIMEOUT: 60
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
jobs:
|
||||
check_code_quality:
|
||||
@@ -73,6 +76,7 @@ jobs:
|
||||
python utils/check_copies.py
|
||||
python utils/check_dummies.py
|
||||
python utils/check_support_list.py
|
||||
python utils/check_forward_call_docstrings.py
|
||||
make deps_table_check_updated
|
||||
- name: Check if failure
|
||||
if: ${{ failure() }}
|
||||
@@ -120,14 +124,14 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run fast PyTorch Pipeline CPU tests
|
||||
run: |
|
||||
|
||||
6
.github/workflows/pr_test_fetcher.yml
vendored
6
.github/workflows/pr_test_fetcher.yml
vendored
@@ -39,7 +39,7 @@ jobs:
|
||||
uv pip install -e ".[quality]"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
echo $(git --version)
|
||||
- name: Fetch Tests
|
||||
run: |
|
||||
@@ -97,7 +97,7 @@ jobs:
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run all selected tests on CPU
|
||||
run: |
|
||||
@@ -151,7 +151,7 @@ jobs:
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run Hub tests for models, schedulers, and pipelines on a staging env
|
||||
if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
|
||||
|
||||
17
.github/workflows/pr_tests.yml
vendored
17
.github/workflows/pr_tests.yml
vendored
@@ -29,6 +29,9 @@ env:
|
||||
OMP_NUM_THREADS: 4
|
||||
MKL_NUM_THREADS: 4
|
||||
PYTEST_TIMEOUT: 60
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
jobs:
|
||||
check_code_quality:
|
||||
@@ -68,6 +71,7 @@ jobs:
|
||||
python utils/check_copies.py
|
||||
python utils/check_dummies.py
|
||||
python utils/check_support_list.py
|
||||
python utils/check_forward_call_docstrings.py
|
||||
make deps_table_check_updated
|
||||
- name: Check if failure
|
||||
if: ${{ failure() }}
|
||||
@@ -116,14 +120,14 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run fast PyTorch Pipeline CPU tests
|
||||
if: ${{ matrix.config.framework == 'pytorch_pipelines' }}
|
||||
@@ -193,13 +197,13 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run Hub tests for models, schedulers, and pipelines on a staging env
|
||||
if: ${{ matrix.config.framework == 'hub_tests_pytorch' }}
|
||||
@@ -244,17 +248,16 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
# TODO (sayakpaul, DN6): revisit `--no-deps`
|
||||
uv pip install -U peft@git+https://github.com/huggingface/peft.git --no-deps
|
||||
uv pip install -U tokenizers
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git --no-deps
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run fast PyTorch LoRA tests with PEFT
|
||||
run: |
|
||||
|
||||
19
.github/workflows/pr_tests_gpu.yml
vendored
19
.github/workflows/pr_tests_gpu.yml
vendored
@@ -30,6 +30,9 @@ env:
|
||||
HF_XET_HIGH_PERFORMANCE: 1
|
||||
PYTEST_TIMEOUT: 600
|
||||
PIPELINE_USAGE_CUTOFF: 1000000000 # set high cutoff so that only always-test pipelines run
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
jobs:
|
||||
check_code_quality:
|
||||
@@ -69,6 +72,7 @@ jobs:
|
||||
python utils/check_copies.py
|
||||
python utils/check_dummies.py
|
||||
python utils/check_support_list.py
|
||||
python utils/check_forward_call_docstrings.py
|
||||
make deps_table_check_updated
|
||||
- name: Check if failure
|
||||
if: ${{ failure() }}
|
||||
@@ -91,10 +95,11 @@ jobs:
|
||||
fetch-depth: 2
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Fetch Pipeline Matrix
|
||||
id: fetch_pipeline_matrix
|
||||
run: |
|
||||
@@ -132,14 +137,14 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Extract tests
|
||||
id: extract_tests
|
||||
run: |
|
||||
@@ -202,15 +207,15 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Extract tests
|
||||
id: extract_tests
|
||||
@@ -267,13 +272,13 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
uv pip install -e ".[quality,training]"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
|
||||
24
.github/workflows/push_tests.yml
vendored
24
.github/workflows/push_tests.yml
vendored
@@ -20,6 +20,9 @@ env:
|
||||
HF_XET_HIGH_PERFORMANCE: 1
|
||||
PYTEST_TIMEOUT: 600
|
||||
PIPELINE_USAGE_CUTOFF: 50000
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
jobs:
|
||||
setup_torch_cuda_pipeline_matrix:
|
||||
@@ -37,10 +40,11 @@ jobs:
|
||||
fetch-depth: 2
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Fetch Pipeline Matrix
|
||||
id: fetch_pipeline_matrix
|
||||
run: |
|
||||
@@ -77,13 +81,13 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: PyTorch CUDA checkpoint tests on Ubuntu
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -129,15 +133,15 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run PyTorch CUDA tests
|
||||
env:
|
||||
@@ -184,12 +188,12 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -228,10 +232,11 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -268,11 +273,12 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
|
||||
2
.github/workflows/push_tests_fast.yml
vendored
2
.github/workflows/push_tests_fast.yml
vendored
@@ -67,7 +67,7 @@ jobs:
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run fast PyTorch CPU tests
|
||||
if: ${{ matrix.config.framework == 'pytorch' }}
|
||||
|
||||
7
.github/workflows/push_tests_mps.yml
vendored
7
.github/workflows/push_tests_mps.yml
vendored
@@ -14,6 +14,9 @@ env:
|
||||
HF_XET_HIGH_PERFORMANCE: 1
|
||||
PYTEST_TIMEOUT: 600
|
||||
RUN_SLOW: no
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
concurrency:
|
||||
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
|
||||
@@ -43,17 +46,17 @@ jobs:
|
||||
- name: Install dependencies
|
||||
shell: arch -arch arm64 bash {0}
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
${CONDA_RUN} python -m pip install --upgrade pip uv
|
||||
${CONDA_RUN} python -m uv pip install -e ".[quality]"
|
||||
${CONDA_RUN} python -m uv pip install torch torchvision torchaudio
|
||||
${CONDA_RUN} python -m uv pip install accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
${CONDA_RUN} python -m uv pip install transformers --upgrade
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
shell: arch -arch arm64 bash {0}
|
||||
run: |
|
||||
${CONDA_RUN} python utils/print_env.py
|
||||
${CONDA_RUN} diffusers-cli env
|
||||
|
||||
- name: Run fast PyTorch tests on M1 (MPS)
|
||||
shell: arch -arch arm64 bash {0}
|
||||
|
||||
1
.github/workflows/pypi_publish.yaml
vendored
1
.github/workflows/pypi_publish.yaml
vendored
@@ -44,7 +44,6 @@ jobs:
|
||||
run: |
|
||||
pip install -U transformers
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
python utils/print_env.py
|
||||
python -c "from diffusers import __version__; print(__version__)"
|
||||
python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()"
|
||||
python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')"
|
||||
|
||||
31
.github/workflows/release_tests_fast.yml
vendored
31
.github/workflows/release_tests_fast.yml
vendored
@@ -19,6 +19,9 @@ env:
|
||||
MKL_NUM_THREADS: 8
|
||||
PYTEST_TIMEOUT: 600
|
||||
PIPELINE_USAGE_CUTOFF: 50000
|
||||
# Force tokenizers<0.23.0 across every `uv pip install` in this workflow,
|
||||
# even when transformers@main declares a higher lower-bound.
|
||||
UV_OVERRIDE: /tmp/uv-overrides.txt
|
||||
|
||||
jobs:
|
||||
setup_torch_cuda_pipeline_matrix:
|
||||
@@ -36,12 +39,12 @@ jobs:
|
||||
fetch-depth: 2
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Fetch Pipeline Matrix
|
||||
id: fetch_pipeline_matrix
|
||||
run: |
|
||||
@@ -78,13 +81,13 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Slow PyTorch CUDA checkpoint tests on Ubuntu
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -130,15 +133,15 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run PyTorch CUDA tests
|
||||
env:
|
||||
@@ -182,15 +185,15 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality]"
|
||||
uv pip install peft@git+https://github.com/huggingface/peft.git
|
||||
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run PyTorch CUDA tests
|
||||
env:
|
||||
@@ -243,12 +246,12 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Run torch compile tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -287,12 +290,12 @@ jobs:
|
||||
nvidia-smi
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.DIFFUSERS_HF_HUB_READ_TOKEN }}
|
||||
@@ -331,13 +334,13 @@ jobs:
|
||||
|
||||
- name: Install dependencies
|
||||
run: |
|
||||
echo 'tokenizers<0.23.0' > "$UV_OVERRIDE"
|
||||
uv pip install -e ".[quality,training]"
|
||||
uv pip uninstall transformers huggingface_hub && UV_PRERELEASE=allow uv pip install -U transformers@git+https://github.com/huggingface/transformers.git
|
||||
uv pip uninstall tokenizers && uv pip install "tokenizers<=0.23.0"
|
||||
|
||||
- name: Environment
|
||||
run: |
|
||||
python utils/print_env.py
|
||||
diffusers-cli env
|
||||
|
||||
- name: Run example tests on GPU
|
||||
env:
|
||||
|
||||
5
Makefile
5
Makefile
@@ -36,6 +36,7 @@ repo-consistency:
|
||||
python utils/check_dummies.py
|
||||
python utils/check_repo.py
|
||||
python utils/check_inits.py
|
||||
python utils/check_forward_call_docstrings.py
|
||||
|
||||
# this target runs checks on all files
|
||||
|
||||
@@ -74,6 +75,10 @@ fix-copies:
|
||||
modular-autodoctrings:
|
||||
python utils/modular_auto_docstring.py
|
||||
|
||||
# Verify forward() / __call__() arguments are documented in their docstrings
|
||||
check-forward-call-docstrings:
|
||||
python utils/check_forward_call_docstrings.py
|
||||
|
||||
# Run tests for the library
|
||||
|
||||
test:
|
||||
|
||||
@@ -299,6 +299,10 @@
|
||||
title: AceStepTransformer1DModel
|
||||
- local: api/models/allegro_transformer3d
|
||||
title: AllegroTransformer3DModel
|
||||
- local: api/models/anyflow_far_transformer3d
|
||||
title: AnyFlowFARTransformer3DModel
|
||||
- local: api/models/anyflow_transformer3d
|
||||
title: AnyFlowTransformer3DModel
|
||||
- local: api/models/aura_flow_transformer2d
|
||||
title: AuraFlowTransformer2DModel
|
||||
- local: api/models/transformer_bria_fibo
|
||||
@@ -631,6 +635,8 @@
|
||||
- sections:
|
||||
- local: api/pipelines/allegro
|
||||
title: Allegro
|
||||
- local: api/pipelines/anyflow
|
||||
title: AnyFlow
|
||||
- local: api/pipelines/chronoedit
|
||||
title: ChronoEdit
|
||||
- local: api/pipelines/cogvideox
|
||||
@@ -706,6 +712,8 @@
|
||||
title: EulerAncestralDiscreteScheduler
|
||||
- local: api/schedulers/euler
|
||||
title: EulerDiscreteScheduler
|
||||
- local: api/schedulers/flow_map_euler_discrete
|
||||
title: FlowMapEulerDiscreteScheduler
|
||||
- local: api/schedulers/flow_match_euler_discrete
|
||||
title: FlowMatchEulerDiscreteScheduler
|
||||
- local: api/schedulers/flow_match_heun_discrete
|
||||
|
||||
45
docs/source/en/api/models/anyflow_far_transformer3d.md
Normal file
45
docs/source/en/api/models/anyflow_far_transformer3d.md
Normal file
@@ -0,0 +1,45 @@
|
||||
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
|
||||
License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
|
||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# AnyFlowFARTransformer3DModel
|
||||
|
||||
The causal (FAR) 3D Transformer used by [`AnyFlowFARPipeline`](../pipelines/anyflow#anyflowfarpipeline) —
|
||||
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS
|
||||
ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:
|
||||
|
||||
1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
|
||||
generation as introduced in [FAR (Gu et al., 2025)](https://arxiv.org/abs/2503.19325).
|
||||
2. **Compressed-frame patch embedding** (`far_patch_embedding`) for context (already-generated) frames,
|
||||
warm-started from the full-resolution `patch_embedding` at construction time via trilinear interpolation.
|
||||
3. **Dual-timestep flow-map embedding** (same as
|
||||
[`AnyFlowTransformer3DModel`](anyflow_transformer3d)) — every forward call conditions on both the source
|
||||
timestep ``t`` and the target timestep ``r``.
|
||||
|
||||
The chunk schedule (`chunk_partition`) is **not** baked into the model config. It is a per-call argument to
|
||||
`forward`, so the same checkpoint handles different `num_frames` configurations without retraining.
|
||||
|
||||
```python
|
||||
from diffusers import AnyFlowFARTransformer3DModel
|
||||
|
||||
# Causal AnyFlow checkpoint (FAR):
|
||||
transformer = AnyFlowFARTransformer3DModel.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", subfolder="transformer"
|
||||
)
|
||||
```
|
||||
|
||||
## AnyFlowFARTransformer3DModel
|
||||
|
||||
[[autodoc]] AnyFlowFARTransformer3DModel
|
||||
|
||||
## AnyFlowFARTransformerOutput
|
||||
|
||||
[[autodoc]] models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput
|
||||
36
docs/source/en/api/models/anyflow_transformer3d.md
Normal file
36
docs/source/en/api/models/anyflow_transformer3d.md
Normal file
@@ -0,0 +1,36 @@
|
||||
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
|
||||
License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
|
||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# AnyFlowTransformer3DModel
|
||||
|
||||
The bidirectional 3D Transformer used by [`AnyFlowPipeline`](../pipelines/anyflow#anyflowpipeline). It is the
|
||||
v0.35.1 Wan2.1 backbone with one structural change: the timestep embedder is replaced by
|
||||
``AnyFlowDualTimestepTextImageEmbedding``, so every forward call conditions on both the source timestep
|
||||
``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
|
||||
:math:`\Phi_{r\leftarrow t}` introduced in
|
||||
[AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS ShowLab × NVIDIA).
|
||||
|
||||
For frame-level autoregressive (FAR causal) generation, use
|
||||
[`AnyFlowFARTransformer3DModel`](anyflow_far_transformer3d) instead.
|
||||
|
||||
```python
|
||||
from diffusers import AnyFlowTransformer3DModel
|
||||
|
||||
# Bidirectional AnyFlow checkpoint (T2V):
|
||||
transformer = AnyFlowTransformer3DModel.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer"
|
||||
)
|
||||
```
|
||||
|
||||
## AnyFlowTransformer3DModel
|
||||
|
||||
[[autodoc]] AnyFlowTransformer3DModel
|
||||
@@ -26,7 +26,7 @@ ACE-Step 1.5 ships three DiT checkpoints that share the same transformer archite
|
||||
|
||||
| Variant | CFG | Default steps | Default `guidance_scale` | Default `shift` | HF repo |
|
||||
|---------|:---:|:-------------:|:------------------------:|:---------------:|---------|
|
||||
| `turbo` (guidance-distilled) | off | 8 | ignored | 3.0 | [`ACE-Step/Ace-Step1.5`](https://huggingface.co/ACE-Step/Ace-Step1.5) |
|
||||
| `turbo` (guidance-distilled) | off | 8 | ignored | 3.0 | [`ACE-Step/acestep-v15-xl-turbo-diffusers`](https://huggingface.co/ACE-Step/acestep-v15-xl-turbo-diffusers) |
|
||||
| `base` | on | 8 | 7.0 | 3.0 | [`ACE-Step/acestep-v15-base`](https://huggingface.co/ACE-Step/acestep-v15-base) |
|
||||
| `sft` | on | 8 | 7.0 | 3.0 | [`ACE-Step/acestep-v15-sft`](https://huggingface.co/ACE-Step/acestep-v15-sft) |
|
||||
|
||||
@@ -54,7 +54,7 @@ import torch
|
||||
import soundfile as sf
|
||||
from diffusers import AceStepPipeline
|
||||
|
||||
pipe = AceStepPipeline.from_pretrained("ACE-Step/Ace-Step1.5", torch_dtype=torch.bfloat16)
|
||||
pipe = AceStepPipeline.from_pretrained("ACE-Step/acestep-v15-xl-turbo-diffusers", torch_dtype=torch.bfloat16)
|
||||
pipe = pipe.to("cuda")
|
||||
|
||||
audio = pipe(
|
||||
|
||||
218
docs/source/en/api/pipelines/anyflow.md
Normal file
218
docs/source/en/api/pipelines/anyflow.md
Normal file
@@ -0,0 +1,218 @@
|
||||
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
|
||||
License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
|
||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<a href="https://github.com/huggingface/diffusers/blob/main/src/diffusers/loaders/lora_pipeline.py">
|
||||
<img alt="LoRA" src="https://img.shields.io/badge/LoRA-supported-green">
|
||||
</a>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
# AnyFlow
|
||||
|
||||
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.
|
||||
|
||||
*Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*
|
||||
|
||||
The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).
|
||||
|
||||
The following AnyFlow checkpoints are supported:
|
||||
|
||||
| Checkpoint | Backbone | Description |
|
||||
|------------|----------|-------------|
|
||||
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
|
||||
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
|
||||
| [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
|
||||
| [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |
|
||||
|
||||
All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.
|
||||
|
||||
> [!TIP]
|
||||
> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.
|
||||
|
||||
> [!TIP]
|
||||
> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.
|
||||
|
||||
### Optimizing Memory and Inference Speed
|
||||
|
||||
<hfoptions id="optimization">
|
||||
<hfoption id="memory">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowPipeline
|
||||
from diffusers.hooks import apply_group_offloading
|
||||
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
|
||||
)
|
||||
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
|
||||
pipe.vae.enable_slicing()
|
||||
pipe.vae.enable_tiling()
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="inference speed">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowPipeline
|
||||
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### Generation with AnyFlow (Bidirectional T2V)
|
||||
|
||||
<hfoptions id="anyflow-bidi">
|
||||
<hfoption id="usage">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowPipeline
|
||||
from diffusers.utils import export_to_video
|
||||
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
prompt = "A red panda eating bamboo in a forest, cinematic lighting"
|
||||
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
|
||||
export_to_video(video, "out.mp4", fps=16)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### Generation with AnyFlow (FAR Causal)
|
||||
|
||||
The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
|
||||
omit both for plain text-to-video, or pass ``video=<tensor>`` of shape ``(B, T, C, H, W)`` in ``[0, 1]``
|
||||
with ``T = 4n + 1`` to condition on existing frames. Use a single conditioning frame for I2V and a longer
|
||||
clip for V2V continuation. If you already have pre-encoded latents in the model layout, pass them via
|
||||
``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
|
||||
> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
|
||||
> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
|
||||
> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.
|
||||
|
||||
<hfoptions id="anyflow-far">
|
||||
<hfoption id="t2v">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowFARPipeline
|
||||
from diffusers.utils import export_to_video
|
||||
|
||||
pipe = AnyFlowFARPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
video = pipe(
|
||||
prompt="A cat surfing a wave, sunset",
|
||||
num_inference_steps=4,
|
||||
num_frames=81,
|
||||
).frames[0]
|
||||
export_to_video(video, "out.mp4", fps=16)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="i2v">
|
||||
|
||||
```py
|
||||
import numpy as np
|
||||
import torch
|
||||
from diffusers import AnyFlowFARPipeline
|
||||
from diffusers.utils import export_to_video, load_image
|
||||
|
||||
pipe = AnyFlowFARPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
|
||||
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
|
||||
arr = np.asarray(first_frame).astype("float32") / 255.0 # (480, 832, 3)
|
||||
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")
|
||||
|
||||
video = pipe(
|
||||
prompt="a cat walks across a sunlit lawn",
|
||||
video=context_tensor,
|
||||
num_inference_steps=4,
|
||||
num_frames=81,
|
||||
).frames[0]
|
||||
export_to_video(video, "out.mp4", fps=16)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="v2v">
|
||||
|
||||
```py
|
||||
import numpy as np
|
||||
import torch
|
||||
from diffusers import AnyFlowFARPipeline
|
||||
from diffusers.utils import export_to_video, load_video
|
||||
|
||||
pipe = AnyFlowFARPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
|
||||
context_frames = load_video("path/to/context.mp4")[:9]
|
||||
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
|
||||
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
|
||||
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda") # (1, 9, 3, 480, 832)
|
||||
|
||||
video = pipe(
|
||||
prompt="continue the story",
|
||||
video=context_tensor,
|
||||
num_inference_steps=4,
|
||||
num_frames=81,
|
||||
# Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
|
||||
chunk_partition=[3, 3, 3, 3, 3, 3, 3],
|
||||
).frames[0]
|
||||
export_to_video(video, "out.mp4", fps=16)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## Notes
|
||||
|
||||
- Classifier-free guidance is fused into the released checkpoints, so inference does not run a second guided forward pass. Keep the default `guidance_scale=1.0` unless your own checkpoint requires otherwise.
|
||||
- `FlowMapEulerDiscreteScheduler` is general-purpose. You can attach it to any flow-map-distilled checkpoint via `from_pretrained(..., scheduler=FlowMapEulerDiscreteScheduler.from_config(...))`.
|
||||
- `AnyFlowPipeline` uses [`AnyFlowTransformer3DModel`](../models/anyflow_transformer3d) (bidirectional). `AnyFlowFARPipeline` uses [`AnyFlowFARTransformer3DModel`](../models/anyflow_far_transformer3d), which adds a compressed-frame patch embedding and the FAR causal block-mask.
|
||||
- LoRA loading is supported via `WanLoraLoaderMixin`, the same mixin used by the upstream Wan pipelines.
|
||||
- For training recipes (forward flow-map training and on-policy distillation), refer to the original AnyFlow training framework at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow); training is out of scope for diffusers.
|
||||
|
||||
## AnyFlowPipeline
|
||||
|
||||
[[autodoc]] AnyFlowPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## AnyFlowFARPipeline
|
||||
|
||||
[[autodoc]] AnyFlowFARPipeline
|
||||
- all
|
||||
- __call__
|
||||
|
||||
## AnyFlowPipelineOutput
|
||||
|
||||
[[autodoc]] pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput
|
||||
@@ -377,7 +377,7 @@ height = 512
|
||||
random_seed = 42
|
||||
frame_rate = 24.0
|
||||
generator = torch.Generator(device).manual_seed(random_seed)
|
||||
model_path = "dg845/LTX-2.3-Diffusers"
|
||||
model_path = "diffusers/LTX-2.3-Diffusers"
|
||||
|
||||
pipe = LTX2ImageToVideoPipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
|
||||
pipe.enable_sequential_cpu_offload(device=device)
|
||||
@@ -449,7 +449,7 @@ height = 512
|
||||
random_seed = 42
|
||||
frame_rate = 24.0
|
||||
generator = torch.Generator(device).manual_seed(random_seed)
|
||||
model_path = "dg845/LTX-2.3-Diffusers"
|
||||
model_path = "diffusers/LTX-2.3-Diffusers"
|
||||
|
||||
pipe = LTX2Pipeline.from_pretrained(model_path, torch_dtype=torch.bfloat16)
|
||||
pipe.enable_model_cpu_offload(device=device)
|
||||
|
||||
28
docs/source/en/api/schedulers/flow_map_euler_discrete.md
Normal file
28
docs/source/en/api/schedulers/flow_map_euler_discrete.md
Normal file
@@ -0,0 +1,28 @@
|
||||
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
|
||||
License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an
|
||||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# FlowMapEulerDiscreteScheduler
|
||||
|
||||
`FlowMapEulerDiscreteScheduler` is an Euler-style sampler designed for flow-map-distilled diffusion
|
||||
models. Flow-map models learn arbitrary-interval transitions $\mathbf{z}_t \to \mathbf{z}_r$ rather than
|
||||
the fixed $\mathbf{z}_t \to \mathbf{z}_0$ mapping of consistency models. Both endpoints of the step are
|
||||
caller-provided, which is what enables any-step sampling: a single distilled checkpoint can be evaluated at
|
||||
1, 2, 4, 8, 16... NFE without retraining.
|
||||
|
||||
The scheduler was introduced in
|
||||
[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724)
|
||||
and ships with the `AnyFlowPipeline` and `AnyFlowFARPipeline` integrations, but it is not
|
||||
AnyFlow-specific — any flow-map-distilled checkpoint can use it.
|
||||
|
||||
## FlowMapEulerDiscreteScheduler
|
||||
|
||||
[[autodoc]] FlowMapEulerDiscreteScheduler
|
||||
@@ -130,6 +130,8 @@
|
||||
- title: Specific pipeline examples
|
||||
isExpanded: false
|
||||
sections:
|
||||
- local: using-diffusers/anyflow
|
||||
title: AnyFlow
|
||||
- local: using-diffusers/consisid
|
||||
title: ConsisID
|
||||
- local: using-diffusers/helios
|
||||
|
||||
253
docs/source/zh/using-diffusers/anyflow.md
Normal file
253
docs/source/zh/using-diffusers/anyflow.md
Normal file
@@ -0,0 +1,253 @@
|
||||
<!-- Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance
|
||||
with the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is
|
||||
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See
|
||||
the License for the specific language governing permissions and limitations under the License.
|
||||
-->
|
||||
|
||||
# AnyFlow
|
||||
|
||||
[AnyFlow](https://huggingface.co/papers/2605.13724) 是一个视频扩散**蒸馏**框架,把预训练的 Wan2.1 教师
|
||||
模型蒸馏成在标准 Euler 采样下支持*任意步数 (any-step)* 的学生模型。同一个蒸馏出来的 checkpoint 可以
|
||||
在 1、2、4、8、16... NFE 下推理,**质量随步数单调提升** —— 这一点和 consistency models 不同,后者
|
||||
NFE 增加反而经常掉点。
|
||||
|
||||
核心思路是学习 **flow map** $\Phi_{r\leftarrow t}: \mathbf{z}_t \to \mathbf{z}_r$(任意 $1 \ge t \ge r \ge 0$),
|
||||
而不是 consistency models 学的固定端点映射 $\mathbf{z}_t \to \mathbf{z}_0$。Flow map 的可组合性消除了
|
||||
采样步之间的 re-noising;on-policy 蒸馏阶段额外用 **DMD 反向散度监督** + **Flow-Map backward simulation**
|
||||
(3 段 shortcut)补上 consistency 蒸馏遗留的 exposure-bias 缺口。
|
||||
|
||||
AnyFlow 由 Yuchao Gu、Guian Fang 等人在 [NUS ShowLab](https://sites.google.com/view/showlab) 与 NVIDIA 合作完成。原始训练代码在 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),项目主页是 [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow)。4 个发布 checkpoint 归在 [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection 里。
|
||||
|
||||
本文档梳理实战要点:怎么选 pipeline、怎么用 any-step 采样、怎么把 AnyFlow 嵌进 T2V / I2V / V2V 工作流。
|
||||
|
||||
## Bidirectional 还是 Causal —— 怎么选 pipeline
|
||||
|
||||
AnyFlow 提供两个 pipeline 形态,scheduler 和蒸馏方法相同,区别在于**怎么对帧采样**:
|
||||
|
||||
- [`AnyFlowPipeline`](../api/pipelines/anyflow#anyflowpipeline) —— **bidirectional** T2V。一次性对整个
|
||||
视频张量去噪,全局自注意力。**纯 prompt 输入、不要流式输出**时选这个。
|
||||
- [`AnyFlowFARPipeline`](../api/pipelines/anyflow#anyflowfarpipeline) —— **causal (FAR)**。
|
||||
按 chunk 分段去噪,块稀疏因果注意力 + 跨 chunk 复用 KV cache。**图生视频 (I2V)**、**视频续写 (V2V)**、
|
||||
或任何受益于逐帧自回归采样的场景选这个。同一个模型通过 `video`(像素空间)或 `video_latents`
|
||||
(已编码 latent)这两个互斥 kwarg 来切换三种任务模式。
|
||||
|
||||
简化对照表:
|
||||
|
||||
| 场景 | Pipeline | 调用方式 |
|
||||
|------|----------|----------|
|
||||
| 纯文生视频,固定 NFE 求最大质量 | `AnyFlowPipeline` | `pipe(prompt, ...)` |
|
||||
| 图生视频(首帧给定) | `AnyFlowFARPipeline` | `pipe(prompt, video=<单帧 tensor>, ...)` |
|
||||
| 视频续写 / V2V | `AnyFlowFARPipeline` | `pipe(prompt, video=<多帧 tensor>, ...)` |
|
||||
| 流式 / 渐进式生成 | `AnyFlowFARPipeline` | — |
|
||||
|
||||
高分辨率下 bidirectional 单 token 更快;causal 牺牲一点单步速度,换来在所有 latent 帧分配前就能开始
|
||||
采样的能力,对超长序列尤其有用。
|
||||
|
||||
## 加载 checkpoint
|
||||
|
||||
NVIDIA 发布了 4 个 AnyFlow checkpoint,pipeline × 规模各一份:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowPipeline, AnyFlowFARPipeline
|
||||
|
||||
# Bidirectional, 轻量
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
# Bidirectional, 满血
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
# Causal (FAR), 1.3B
|
||||
pipe = AnyFlowFARPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
# Causal (FAR), 14B
|
||||
pipe = AnyFlowFARPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
四个 checkpoint 共用同一份 [`FlowMapEulerDiscreteScheduler`](../api/schedulers/flow_map_euler_discrete),
|
||||
默认 `shift=5.0`。
|
||||
|
||||
## Any-step 采样
|
||||
|
||||
AnyFlow 最关键的特性是同一个 checkpoint **不需重新调度**,NFE 越大质量越高。固定 prompt、扫一下步数
|
||||
就能看出模型怎么在延迟和保真度之间权衡:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowPipeline
|
||||
from diffusers.utils import export_to_video
|
||||
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
prompt = "森林里一只小熊猫在啃竹子,电影感光照"
|
||||
|
||||
for nfe in [1, 2, 4, 8, 16, 32]:
|
||||
# 每轮重建 generator —— 这样跨步数对比时唯一变量是 NFE。
|
||||
generator = torch.Generator("cuda").manual_seed(0)
|
||||
video = pipe(prompt, num_inference_steps=nfe, num_frames=33, generator=generator).frames[0]
|
||||
export_to_video(video, f"out_nfe{nfe}.mp4", fps=16)
|
||||
```
|
||||
|
||||
paper 的 Tab 3 / Fig 1 表明:每个 AnyFlow checkpoint 在 4 → 32 NFE 范围 VBench Quality 都单调上升,而
|
||||
consistency 类基线(rCM、Self-Forcing)在同区间反而掉点。
|
||||
|
||||
> [!TIP]
|
||||
> Classifier-free guidance (CFG) 已经在训练阶段融进权重。pipeline 推理
|
||||
> 时**不会**再跑一次 unconditional 前向 —— guidance 直接由蒸馏后的权重带出。release 出来的 checkpoint
|
||||
> 都用默认的 `guidance_scale=1.0` 即可。
|
||||
|
||||
## 图生视频 与 视频续写
|
||||
|
||||
Causal pipeline 用同一个蒸馏模型支持三种任务模式,**通过 `video` / `video_latents` 二选一来选**:
|
||||
|
||||
- `video` —— 像素空间张量,形状 `(B, T, C, H, W)` ∈ `[0, 1]`,pipeline 内部会过一遍 `VideoProcessor`
|
||||
+ VAE 编码;
|
||||
- `video_latents` —— 已经在模型布局下的 latent,跳过 VAE 编码;
|
||||
- 两者都不传 —— 纯文生视频;
|
||||
- 两者同时传 —— 抛 `ValueError`(互斥)。
|
||||
|
||||
Context tensor 的帧数必须满足 `T = 4n + 1`,跟 VAE 时间步长对齐。
|
||||
|
||||
> [!IMPORTANT]
|
||||
> FAR pipeline 是分块 (chunk) rollout,`num_frames` 必须配合 chunk 调度。默认
|
||||
> `chunk_partition=[1, 3, 3, 3, 3, 3, 3, 2]`(求和 21)对应发布 checkpoint 的标准 `num_frames=81`
|
||||
> (21 = (81 − 1) // 4 + 1)。改 `num_frames` 时**必须**显式传匹配的 `chunk_partition`,使其求和等于
|
||||
> `(num_frames - 1) // 4 + 1`,否则 pipeline 会抛 `AssertionError`。比如 `num_frames=33` 对应 9 个 latent
|
||||
> 帧,可用 `chunk_partition=[1, 4, 4]`。
|
||||
|
||||
```py
|
||||
import numpy as np
|
||||
import torch
|
||||
from diffusers import AnyFlowFARPipeline
|
||||
from diffusers.utils import export_to_video, load_image, load_video
|
||||
|
||||
pipe = AnyFlowFARPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
|
||||
).to("cuda")
|
||||
|
||||
|
||||
def to_video_tensor(images, height=480, width=832):
|
||||
"""把 PIL 列表转成 FAR pipeline 需要的 (B, T, C, H, W) [0, 1] 张量。"""
|
||||
frames = np.stack([np.asarray(img.resize((width, height))) for img in images]).astype("float32") / 255.0
|
||||
# frames: (T, H, W, C) → (T, C, H, W) → 加 batch 维 → (1, T, C, H, W)
|
||||
return torch.from_numpy(frames).permute(0, 3, 1, 2).unsqueeze(0)
|
||||
|
||||
|
||||
# 1) 文生视频(无 context)。81 帧匹配默认 chunk_partition。
|
||||
video = pipe(prompt="一只猫在夕阳下冲浪", num_inference_steps=4, num_frames=81).frames[0]
|
||||
export_to_video(video, "t2v.mp4", fps=16)
|
||||
|
||||
# 2) 图生视频 —— 单帧 context 经过 VAE 是 1 个 latent,正好对上默认 chunk_partition 的第一项 (`[1, ...]`)。
|
||||
first_frame = load_image("path/to/first_frame.png")
|
||||
context_tensor = to_video_tensor([first_frame]).to("cuda") # (1, 1, 3, 480, 832), [0, 1]
|
||||
video = pipe(
|
||||
prompt="一只猫走过阳光下的草坪",
|
||||
video=context_tensor,
|
||||
num_inference_steps=4,
|
||||
num_frames=81,
|
||||
).frames[0]
|
||||
export_to_video(video, "i2v.mp4", fps=16)
|
||||
|
||||
# 3) 视频续写。9 帧 raw context → 3 个 latent context;显式覆盖 chunk_partition,让第一块正好覆盖 context。
|
||||
context_frames = load_video("path/to/context.mp4")[:9] # 9 = 4·2 + 1
|
||||
context_tensor = to_video_tensor(context_frames).to("cuda") # (1, 9, 3, 480, 832)
|
||||
video = pipe(
|
||||
prompt="继续这个故事",
|
||||
video=context_tensor,
|
||||
num_inference_steps=4,
|
||||
num_frames=81,
|
||||
chunk_partition=[3, 3, 3, 3, 3, 3, 3], # 7 个 chunk × 3 = 21 latent;首块就是 context
|
||||
).frames[0]
|
||||
export_to_video(video, "v2v.mp4", fps=16)
|
||||
```
|
||||
|
||||
底层 patchify chunk 调度根据 `video` / `video_latents` 是否给定自动调整:纯文生用 kernel 2 (full) 和
|
||||
4 (compressed);有 context 时第一个 chunk 改成 kernel 1,让条件帧保留全分辨率。
|
||||
|
||||
如果你已经有 VAE 编码过的 latent,可以直接传 `video_latents=<tensor>` 跳过 `vae_encode` 步骤
|
||||
(和 `video` 互斥)。
|
||||
|
||||
## 显存与推理速度
|
||||
|
||||
14B 的 AnyFlow 模型用 group offload + VAE slicing 单卡 40 GB 能跑:
|
||||
|
||||
```py
|
||||
import torch
|
||||
from diffusers import AnyFlowPipeline
|
||||
from diffusers.hooks import apply_group_offloading
|
||||
|
||||
pipe = AnyFlowPipeline.from_pretrained(
|
||||
"nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
|
||||
)
|
||||
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
|
||||
pipe.vae.enable_slicing()
|
||||
pipe.vae.enable_tiling()
|
||||
```
|
||||
|
||||
延迟方面,`torch.compile` 对 transformer(最重的模块)效果很好:
|
||||
|
||||
```py
|
||||
pipe = pipe.to("cuda")
|
||||
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
|
||||
```
|
||||
|
||||
编译开销跑几步就摊销掉;配合 AnyFlow 的低 NFE(4-8 步),`torch.compile` 在 14B 上相比 eager
|
||||
模式有明显加速。
|
||||
|
||||
## LoRA 微调
|
||||
|
||||
两个 pipeline 都复用 [`WanLoraLoaderMixin`](../api/loaders/lora),因此为对应 Wan2.1 backbone 训练的
|
||||
LoRA adapter 直接加载即可:
|
||||
|
||||
```py
|
||||
pipe.load_lora_weights("path/or/repo/with/wan_lora")
|
||||
```
|
||||
|
||||
如果要做**继续 on-policy 蒸馏微调**(用论文里相同的 DMD 反向散度监督配方训新 LoRA),请参考原始
|
||||
AnyFlow 训练框架 [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow),这套训练流程不在
|
||||
diffusers 范围内。
|
||||
|
||||
## 常见坑
|
||||
|
||||
- **永远 `guidance_scale=1.0`。** 蒸馏后的 checkpoint 已经把 CFG 融进权重。设 `> 1` 会多跑一遍
|
||||
unconditional 前向、延迟翻倍、质量微降。
|
||||
- **Bidirectional pipeline 不支持流式。** 所有 `num_frames` 一起去噪。需要边采边播请用 causal pipeline。
|
||||
- **Causal pipeline KV cache 假设 chunk 调度跨调用一致。** 中途重建 cache 不被 release 模型支持。
|
||||
- **`num_frames` 必须满足 VAE 时间步长。** release checkpoint 用 `(N - 1) % 4 == 0` 的值(如 9、17、33、81)。
|
||||
|
||||
## 引用
|
||||
|
||||
```bibtex
|
||||
@misc{gu2026anyflowanystepvideodiffusion,
|
||||
title={AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation},
|
||||
author={Yuchao Gu and Guian Fang and Yuxin Jiang and Weijia Mao and Song Han and Han Cai and Mike Zheng Shou},
|
||||
year={2026},
|
||||
eprint={2605.13724},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CV},
|
||||
url={https://arxiv.org/abs/2605.13724},
|
||||
}
|
||||
|
||||
@article{gu2025long,
|
||||
title={Long-Context Autoregressive Video Modeling with Next-Frame Prediction},
|
||||
author={Gu, Yuchao and Mao, Weijia and Shou, Mike Zheng},
|
||||
journal={arXiv preprint arXiv:2503.19325},
|
||||
year={2025}
|
||||
}
|
||||
```
|
||||
152
scripts/convert_anyflow_to_diffusers.py
Normal file
152
scripts/convert_anyflow_to_diffusers.py
Normal file
@@ -0,0 +1,152 @@
|
||||
# Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
"""Convert AnyFlow training checkpoints to the diffusers ``save_pretrained`` layout.
|
||||
|
||||
The AnyFlow training pipeline emits ``.pt`` files containing an ``ema`` key whose value is a flat state
|
||||
dict for the transformer. This script:
|
||||
|
||||
1. Loads the matching base Wan2.1 pipeline from the Hub (provides VAE, tokenizer, and text encoder).
|
||||
2. Constructs an ``AnyFlowTransformer3DModel`` with the right config flags for the chosen variant.
|
||||
3. Loads the ``ema`` weights into the transformer.
|
||||
4. Wraps everything in an ``AnyFlowPipeline`` (bidirectional) or ``AnyFlowFARPipeline`` (FAR causal).
|
||||
5. Calls ``pipeline.save_pretrained(output_dir)``.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
python scripts/convert_anyflow_to_diffusers.py \\
|
||||
--variant AnyFlow-FAR-Wan2.1-1.3B-Diffusers \\
|
||||
--ckpt /path/to/anyflow-checkpoint.pt \\
|
||||
--output-dir /path/to/output/AnyFlow-FAR-Wan2.1-1.3B-Diffusers
|
||||
```
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import logging
|
||||
import os
|
||||
|
||||
import torch
|
||||
|
||||
from diffusers import (
|
||||
AnyFlowFARPipeline,
|
||||
AnyFlowFARTransformer3DModel,
|
||||
AnyFlowPipeline,
|
||||
AnyFlowTransformer3DModel,
|
||||
FlowMapEulerDiscreteScheduler,
|
||||
)
|
||||
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
|
||||
|
||||
|
||||
# Per-variant configuration. ``base_model`` is fetched from the Hub to source the matching VAE / text encoder.
|
||||
VARIANTS = {
|
||||
"AnyFlow-FAR-Wan2.1-1.3B-Diffusers": {
|
||||
"base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
|
||||
"transformer_cls": AnyFlowFARTransformer3DModel,
|
||||
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
|
||||
"pipeline_cls": AnyFlowFARPipeline,
|
||||
},
|
||||
"AnyFlow-FAR-Wan2.1-14B-Diffusers": {
|
||||
"base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
|
||||
"transformer_cls": AnyFlowFARTransformer3DModel,
|
||||
"transformer_kwargs": {"full_chunk_limit": 3, "compressed_patch_size": [1, 4, 4]},
|
||||
"pipeline_cls": AnyFlowFARPipeline,
|
||||
},
|
||||
"AnyFlow-Wan2.1-T2V-1.3B-Diffusers": {
|
||||
"base_model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
|
||||
"transformer_cls": AnyFlowTransformer3DModel,
|
||||
"transformer_kwargs": {},
|
||||
"pipeline_cls": AnyFlowPipeline,
|
||||
},
|
||||
"AnyFlow-Wan2.1-T2V-14B-Diffusers": {
|
||||
"base_model": "Wan-AI/Wan2.1-T2V-14B-Diffusers",
|
||||
"transformer_cls": AnyFlowTransformer3DModel,
|
||||
"transformer_kwargs": {},
|
||||
"pipeline_cls": AnyFlowPipeline,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def build_pipeline(variant: str, ckpt_path: str):
|
||||
if variant not in VARIANTS:
|
||||
raise ValueError(f"Unknown variant {variant!r}. Choices: {list(VARIANTS)}.")
|
||||
spec = VARIANTS[variant]
|
||||
|
||||
transformer = spec["transformer_cls"].from_pretrained(
|
||||
spec["base_model"],
|
||||
subfolder="transformer",
|
||||
gate_value=0.25,
|
||||
deltatime_type="r",
|
||||
**spec["transformer_kwargs"],
|
||||
)
|
||||
# NVlabs/AnyFlow training checkpoints are wrapped Python objects (the `ema` key carries metadata
|
||||
# alongside tensors), so the unpickle is required. Only run this script on checkpoints you trust.
|
||||
state_dict = torch.load(ckpt_path, map_location="cpu", weights_only=False)["ema"]
|
||||
missing, unexpected = transformer.load_state_dict(state_dict, strict=False)
|
||||
if unexpected:
|
||||
logger.warning(
|
||||
"Unexpected keys in state dict (ignored): %s%s",
|
||||
unexpected[:5],
|
||||
"..." if len(unexpected) > 5 else "",
|
||||
)
|
||||
if missing:
|
||||
logger.warning(
|
||||
"Missing keys not loaded from state dict: %s%s",
|
||||
missing[:5],
|
||||
"..." if len(missing) > 5 else "",
|
||||
)
|
||||
|
||||
scheduler = FlowMapEulerDiscreteScheduler(num_train_timesteps=1000, shift=5.0)
|
||||
|
||||
pipeline = spec["pipeline_cls"].from_pretrained(
|
||||
spec["base_model"],
|
||||
transformer=transformer,
|
||||
scheduler=scheduler,
|
||||
)
|
||||
return pipeline
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Convert an AnyFlow training checkpoint into a diffusers pipeline directory."
|
||||
)
|
||||
parser.add_argument(
|
||||
"--variant",
|
||||
required=True,
|
||||
choices=list(VARIANTS),
|
||||
help="Which AnyFlow variant the checkpoint corresponds to.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ckpt",
|
||||
required=True,
|
||||
help="Path to the AnyFlow training checkpoint (a .pt file containing an 'ema' key).",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--output-dir",
|
||||
required=True,
|
||||
help="Destination directory for pipeline.save_pretrained.",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
os.makedirs(args.output_dir, exist_ok=True)
|
||||
pipeline = build_pipeline(args.variant, args.ckpt)
|
||||
pipeline.save_pretrained(args.output_dir)
|
||||
logger.info("Saved %s pipeline to %s", args.variant, args.output_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -191,6 +191,8 @@ else:
|
||||
[
|
||||
"AceStepTransformer1DModel",
|
||||
"AllegroTransformer3DModel",
|
||||
"AnyFlowFARTransformer3DModel",
|
||||
"AnyFlowTransformer3DModel",
|
||||
"AsymmetricAutoencoderKL",
|
||||
"AttentionBackendName",
|
||||
"AuraFlowTransformer2DModel",
|
||||
@@ -380,6 +382,7 @@ else:
|
||||
"EDMEulerScheduler",
|
||||
"EulerAncestralDiscreteScheduler",
|
||||
"EulerDiscreteScheduler",
|
||||
"FlowMapEulerDiscreteScheduler",
|
||||
"FlowMatchEulerDiscreteScheduler",
|
||||
"FlowMatchHeunDiscreteScheduler",
|
||||
"FlowMatchLCMScheduler",
|
||||
@@ -511,6 +514,8 @@ else:
|
||||
"AnimateDiffSparseControlNetPipeline",
|
||||
"AnimateDiffVideoToVideoControlNetPipeline",
|
||||
"AnimateDiffVideoToVideoPipeline",
|
||||
"AnyFlowFARPipeline",
|
||||
"AnyFlowPipeline",
|
||||
"AudioLDM2Pipeline",
|
||||
"AudioLDM2ProjectionModel",
|
||||
"AudioLDM2UNet2DConditionModel",
|
||||
@@ -1019,6 +1024,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
|
||||
from .models import (
|
||||
AceStepTransformer1DModel,
|
||||
AllegroTransformer3DModel,
|
||||
AnyFlowFARTransformer3DModel,
|
||||
AnyFlowTransformer3DModel,
|
||||
AsymmetricAutoencoderKL,
|
||||
AttentionBackendName,
|
||||
AuraFlowTransformer2DModel,
|
||||
@@ -1204,6 +1211,7 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
|
||||
EDMEulerScheduler,
|
||||
EulerAncestralDiscreteScheduler,
|
||||
EulerDiscreteScheduler,
|
||||
FlowMapEulerDiscreteScheduler,
|
||||
FlowMatchEulerDiscreteScheduler,
|
||||
FlowMatchHeunDiscreteScheduler,
|
||||
FlowMatchLCMScheduler,
|
||||
@@ -1316,6 +1324,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
|
||||
AnimateDiffSparseControlNetPipeline,
|
||||
AnimateDiffVideoToVideoControlNetPipeline,
|
||||
AnimateDiffVideoToVideoPipeline,
|
||||
AnyFlowFARPipeline,
|
||||
AnyFlowPipeline,
|
||||
AudioLDM2Pipeline,
|
||||
AudioLDM2ProjectionModel,
|
||||
AudioLDM2UNet2DConditionModel,
|
||||
|
||||
@@ -95,6 +95,8 @@ if is_torch_available():
|
||||
_import_structure["transformers.t5_film_transformer"] = ["T5FilmDecoder"]
|
||||
_import_structure["transformers.transformer_2d"] = ["Transformer2DModel"]
|
||||
_import_structure["transformers.transformer_allegro"] = ["AllegroTransformer3DModel"]
|
||||
_import_structure["transformers.transformer_anyflow"] = ["AnyFlowTransformer3DModel"]
|
||||
_import_structure["transformers.transformer_anyflow_far"] = ["AnyFlowFARTransformer3DModel"]
|
||||
_import_structure["transformers.transformer_bria"] = ["BriaTransformer2DModel"]
|
||||
_import_structure["transformers.transformer_bria_fibo"] = ["BriaFiboTransformer2DModel"]
|
||||
_import_structure["transformers.transformer_chroma"] = ["ChromaTransformer2DModel"]
|
||||
@@ -214,6 +216,8 @@ if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
|
||||
from .transformers import (
|
||||
AceStepTransformer1DModel,
|
||||
AllegroTransformer3DModel,
|
||||
AnyFlowFARTransformer3DModel,
|
||||
AnyFlowTransformer3DModel,
|
||||
AuraFlowTransformer2DModel,
|
||||
BriaFiboTransformer2DModel,
|
||||
BriaTransformer2DModel,
|
||||
|
||||
@@ -269,6 +269,10 @@ class T2IAdapter(ModelMixin, ConfigMixin):
|
||||
each representing information extracted at a different scale from the input. The length of the list is
|
||||
determined by the number of downsample blocks in the Adapter, as specified by the `channels` and
|
||||
`num_res_blocks` parameters during initialization.
|
||||
|
||||
Args:
|
||||
x (`torch.Tensor`):
|
||||
The input tensor to process through the adapter model.
|
||||
"""
|
||||
return self.adapter(x)
|
||||
|
||||
|
||||
@@ -166,6 +166,9 @@ class AsymmetricAutoencoderKL(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -706,6 +706,12 @@ class AutoencoderDC(ModelMixin, AutoencoderMixin, ConfigMixin, FromOriginalModel
|
||||
return DecoderOutput(sample=decoded)
|
||||
|
||||
def forward(self, sample: torch.Tensor, return_dict: bool = True) -> torch.Tensor:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
"""
|
||||
encoded = self.encode(sample, return_dict=False)[0]
|
||||
decoded = self.decode(encoded, return_dict=False)[0]
|
||||
if not return_dict:
|
||||
|
||||
@@ -424,6 +424,9 @@ class AutoencoderKL(
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -1409,6 +1409,17 @@ class AutoencoderKLCogVideoX(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> torch.Tensor | torch.Tensor:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -1078,6 +1078,17 @@ class AutoencoderKLCosmos(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> tuple[torch.Tensor] | DecoderOutput:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -441,6 +441,9 @@ class AutoencoderKLFlux2(
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -1061,6 +1061,9 @@ class AutoencoderKLHunyuanVideo(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -674,8 +674,13 @@ class AutoencoderKLHunyuanImage(ModelMixin, AutoencoderMixin, ConfigMixin, FromO
|
||||
"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
posterior = self.encode(sample).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -908,6 +908,9 @@ class AutoencoderKLHunyuanImageRefiner(ModelMixin, AutoencoderMixin, ConfigMixin
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -941,6 +941,9 @@ class AutoencoderKLHunyuanVideo15(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -787,6 +787,9 @@ class AutoencoderKLKVAE(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -942,6 +942,17 @@ class AutoencoderKLKVAEVideo(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
|
||||
return_dict: bool = True,
|
||||
generator: Optional[torch.Generator] = None,
|
||||
) -> Union[DecoderOutput, torch.Tensor]:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -1522,6 +1522,19 @@ class AutoencoderKLLTXVideo(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrigi
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> torch.Tensor | torch.Tensor:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
temb (`torch.Tensor`, *optional*):
|
||||
Optional timestep embedding tensor used to condition the decoder.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -1542,6 +1542,23 @@ class AutoencoderKLLTX2Video(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> torch.Tensor | torch.Tensor:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
temb (`torch.Tensor`, *optional*):
|
||||
Optional timestep embedding tensor used to condition the decoder.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
encoder_causal (`bool`, *optional*):
|
||||
Whether the encoder should use causal convolutions. If `None`, falls back to the model default.
|
||||
decoder_causal (`bool`, *optional*):
|
||||
Whether the decoder should use causal convolutions. If `None`, falls back to the model default.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x, causal=encoder_causal).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -792,6 +792,17 @@ class AutoencoderKLLTX2Audio(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> DecoderOutput | torch.Tensor:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
posterior = self.encode(sample).latent_dist
|
||||
if sample_posterior:
|
||||
z = posterior.sample(generator=generator)
|
||||
|
||||
@@ -1057,6 +1057,9 @@ class AutoencoderKLMagvit(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -1093,6 +1093,17 @@ class AutoencoderKLMochi(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> torch.Tensor | torch.Tensor:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
if sample_posterior:
|
||||
|
||||
@@ -1043,8 +1043,13 @@ class AutoencoderKLQwenImage(ModelMixin, AutoencoderMixin, ConfigMixin, FromOrig
|
||||
"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -287,6 +287,11 @@ class AutoencoderKLTemporalDecoder(ModelMixin, AttentionMixin, AutoencoderMixin,
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
num_frames (`int`, *optional*, defaults to 1):
|
||||
The number of frames to decode per batch.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -1416,8 +1416,13 @@ class AutoencoderKLWan(ModelMixin, AutoencoderMixin, ConfigMixin, FromOriginalMo
|
||||
"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -393,6 +393,17 @@ class LongCatAudioDiTVae(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
return_dict: bool = True,
|
||||
generator: torch.Generator | None = None,
|
||||
) -> LongCatAudioDiTVaeDecoderOutput | tuple[torch.Tensor]:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `False`):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`LongCatAudioDiTVaeDecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
latents = self.encode(sample, sample_posterior=sample_posterior, return_dict=True, generator=generator).latents
|
||||
decoded = self.decode(latents, return_dict=True).sample
|
||||
if not return_dict:
|
||||
|
||||
@@ -528,6 +528,9 @@ class AutoencoderOobleck(ModelMixin, AutoencoderMixin, ConfigMixin):
|
||||
Whether to sample from the posterior.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`OobleckDecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
posterior = self.encode(x).latent_dist
|
||||
|
||||
@@ -682,6 +682,15 @@ class AutoencoderRAE(ModelMixin, AttentionMixin, AutoencoderMixin, ConfigMixin):
|
||||
def forward(
|
||||
self, sample: torch.Tensor, return_dict: bool = True, generator: torch.Generator | None = None
|
||||
) -> DecoderOutput | tuple[torch.Tensor]:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
latents = self.encode(sample, return_dict=False, generator=generator)[0]
|
||||
decoded = self.decode(latents, return_dict=False)[0]
|
||||
if not return_dict:
|
||||
|
||||
@@ -1440,6 +1440,19 @@ class AutoencoderVidTok(ModelMixin, ConfigMixin):
|
||||
return_dict: bool = True,
|
||||
generator: Optional[torch.Generator] = None,
|
||||
) -> Union[torch.Tensor, DecoderOutput]:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`): Input sample.
|
||||
sample_posterior (`bool`, *optional*, defaults to `True`):
|
||||
Whether to sample from the posterior.
|
||||
encoder_mode (`bool`, *optional*, defaults to `False`):
|
||||
If `True`, only run the encoder and return the encoded latent without decoding.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`DecoderOutput`] instead of a plain tuple.
|
||||
generator (`torch.Generator`, *optional*):
|
||||
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make sampling
|
||||
deterministic.
|
||||
"""
|
||||
x = sample
|
||||
res = 1 if self.is_causal else 0
|
||||
if self.is_causal:
|
||||
|
||||
@@ -188,8 +188,12 @@ class FluxControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAdapterMi
|
||||
from the embeddings of input conditions.
|
||||
timestep ( `torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
img_ids (`torch.Tensor`):
|
||||
Positional ids for the image tokens.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Positional ids for the text tokens.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale tensor used by guidance-distilled variants of the model.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
@@ -355,6 +359,35 @@ class FluxMultiControlNetModel(ModelMixin):
|
||||
joint_attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> FluxControlNetOutput | tuple:
|
||||
r"""
|
||||
Args:
|
||||
hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
|
||||
Input `hidden_states`.
|
||||
controlnet_cond (`list` of `torch.Tensor`):
|
||||
A list of conditional input tensors, one per ControlNet.
|
||||
controlnet_mode (`list` of `torch.Tensor`):
|
||||
A list of mode tensors selecting the control type for each ControlNet.
|
||||
conditioning_scale (`list` of `float`):
|
||||
A list of scale factors applied to the ControlNet outputs.
|
||||
encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`):
|
||||
Embeddings projected from the embeddings of input conditions.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
img_ids (`torch.Tensor`):
|
||||
Positional ids for the image tokens.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Positional ids for the text tokens.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale tensor used by guidance-distilled variants of the model.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`FluxControlNetOutput`] instead of a plain tuple.
|
||||
"""
|
||||
# ControlNet-Union with multiple conditions
|
||||
# only load one ControlNet for saving memories
|
||||
if len(self.nets) == 1:
|
||||
|
||||
@@ -286,6 +286,32 @@ class QwenImageMultiControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, F
|
||||
joint_attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> QwenImageControlNetOutput | tuple:
|
||||
r"""
|
||||
Args:
|
||||
hidden_states (`torch.FloatTensor`):
|
||||
Input `hidden_states`.
|
||||
controlnet_cond (`list` of `torch.Tensor`):
|
||||
A list of conditional input tensors, one per ControlNet.
|
||||
conditioning_scale (`list` of `float`):
|
||||
A list of scale factors applied to the ControlNet outputs.
|
||||
encoder_hidden_states (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts).
|
||||
encoder_hidden_states_mask (`torch.Tensor`, *optional*):
|
||||
Mask for the encoder hidden states.
|
||||
timestep (`torch.LongTensor`, *optional*):
|
||||
Used to indicate denoising step.
|
||||
img_shapes (`list` of `tuple[int, int, int]`, *optional*):
|
||||
Per-sample image shapes used to construct positional encodings.
|
||||
txt_seq_lens (`list` of `int`, *optional*):
|
||||
Deprecated. The text sequence length is now inferred from `encoder_hidden_states` and
|
||||
`encoder_hidden_states_mask`.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`QwenImageControlNetOutput`] instead of a plain tuple.
|
||||
"""
|
||||
if txt_seq_lens is not None:
|
||||
deprecate(
|
||||
"txt_seq_lens",
|
||||
|
||||
@@ -130,6 +130,30 @@ class SanaControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAdapterMi
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor, ...] | Transformer2DModelOutput:
|
||||
r"""
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, channel, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
controlnet_cond (`torch.Tensor`):
|
||||
The conditional input tensor for the ControlNet.
|
||||
conditioning_scale (`float`, *optional*, defaults to `1.0`):
|
||||
The scale factor for ControlNet outputs.
|
||||
encoder_attention_mask (`torch.Tensor`, *optional*):
|
||||
Attention mask applied to `encoder_hidden_states`.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Attention mask applied to `hidden_states`.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
# ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
|
||||
# we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
|
||||
# we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
|
||||
|
||||
@@ -402,6 +402,27 @@ class SD3MultiControlNetModel(ModelMixin):
|
||||
joint_attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> SD3ControlNetOutput | tuple:
|
||||
r"""
|
||||
Args:
|
||||
hidden_states (`torch.Tensor`):
|
||||
Input `hidden_states`.
|
||||
controlnet_cond (`list` of `torch.Tensor`):
|
||||
A list of conditional input tensors, one per ControlNet.
|
||||
conditioning_scale (`list` of `float`):
|
||||
A list of scale factors applied to the ControlNet outputs.
|
||||
pooled_projections (`torch.Tensor`):
|
||||
Embeddings projected from the embeddings of input conditions.
|
||||
encoder_hidden_states (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`, *optional*):
|
||||
Used to indicate denoising step.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`SD3ControlNetOutput`] instead of a plain tuple.
|
||||
"""
|
||||
for i, (image, scale, controlnet) in enumerate(zip(controlnet_cond, conditioning_scale, self.nets)):
|
||||
block_samples = controlnet(
|
||||
hidden_states=hidden_states,
|
||||
|
||||
@@ -558,8 +558,6 @@ class SparseControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, FromOrigina
|
||||
The conditional input tensor of shape `(batch_size, sequence_length, hidden_size)`.
|
||||
conditioning_scale (`float`, defaults to `1.0`):
|
||||
The scale factor for ControlNet outputs.
|
||||
class_labels (`torch.Tensor`, *optional*, defaults to `None`):
|
||||
Optional class labels for conditioning. Their embeddings will be summed with the timestep embeddings.
|
||||
timestep_cond (`torch.Tensor`, *optional*, defaults to `None`):
|
||||
Additional conditional embeddings for timestep. If provided, the embeddings will be summed with the
|
||||
timestep_embedding passed through the `self.time_embedding` layer to obtain the final timestep
|
||||
@@ -568,8 +566,8 @@ class SparseControlNetModel(ModelMixin, AttentionMixin, ConfigMixin, FromOrigina
|
||||
An attention mask of shape `(batch, key_tokens)` is applied to `encoder_hidden_states`. If `1` the mask
|
||||
is kept, otherwise if `0` it is discarded. Mask will be converted into a bias, which adds large
|
||||
negative values to the attention scores corresponding to "discard" tokens.
|
||||
added_cond_kwargs (`dict`):
|
||||
Additional conditions for the Stable Diffusion XL UNet.
|
||||
conditioning_mask (`torch.Tensor`, *optional*, defaults to `None`):
|
||||
Optional mask indicating which frames in `controlnet_cond` are valid conditioning frames.
|
||||
cross_attention_kwargs (`dict[str]`, *optional*, defaults to `None`):
|
||||
A kwargs dictionary that if specified is passed along to the `AttnProcessor`.
|
||||
guess_mode (`bool`, defaults to `False`):
|
||||
|
||||
@@ -661,6 +661,23 @@ class ZImageControlNetModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOrigi
|
||||
patch_size=2,
|
||||
f_patch_size=1,
|
||||
):
|
||||
r"""
|
||||
Args:
|
||||
x (`list` of `torch.Tensor`):
|
||||
A list of input image latents, one tensor per sample in the batch.
|
||||
t (`torch.Tensor`):
|
||||
Timestep tensor used to indicate the denoising step.
|
||||
cap_feats (`list` of `torch.Tensor`):
|
||||
A list of caption (text) feature tensors, one per sample.
|
||||
control_context (`list` of `torch.Tensor`):
|
||||
A list of control conditioning feature tensors, one per sample.
|
||||
conditioning_scale (`float`, *optional*, defaults to `1.0`):
|
||||
The scale factor for ControlNet outputs.
|
||||
patch_size (`int`, *optional*, defaults to `2`):
|
||||
Spatial patch size used to tokenize the latent.
|
||||
f_patch_size (`int`, *optional*, defaults to `1`):
|
||||
Temporal (frame) patch size used to tokenize the latent.
|
||||
"""
|
||||
if (
|
||||
self.t_scale is None
|
||||
or self.t_embedder is None
|
||||
|
||||
@@ -44,6 +44,34 @@ class MultiControlNetModel(ModelMixin):
|
||||
guess_mode: bool = False,
|
||||
return_dict: bool = True,
|
||||
) -> ControlNetOutput | tuple:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`):
|
||||
The noisy input tensor.
|
||||
timestep (`torch.Tensor`, `float`, or `int`):
|
||||
The number of timesteps to denoise an input.
|
||||
encoder_hidden_states (`torch.Tensor`):
|
||||
The encoder hidden states.
|
||||
controlnet_cond (`list` of `torch.Tensor`):
|
||||
A list of conditional input tensors, one per ControlNet.
|
||||
conditioning_scale (`list` of `float`):
|
||||
A list of scale factors applied to the ControlNet outputs.
|
||||
class_labels (`torch.Tensor`, *optional*):
|
||||
Optional class labels for conditioning.
|
||||
timestep_cond (`torch.Tensor`, *optional*):
|
||||
Additional conditional embeddings for timestep.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Attention mask applied to `encoder_hidden_states`.
|
||||
added_cond_kwargs (`dict`, *optional*):
|
||||
Additional conditions for the Stable Diffusion XL UNet.
|
||||
cross_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttnProcessor`.
|
||||
guess_mode (`bool`, *optional*, defaults to `False`):
|
||||
In this mode, the ControlNet encoder tries its best to recognize the input content even if you remove
|
||||
all prompts.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`ControlNetOutput`] instead of a plain tuple.
|
||||
"""
|
||||
for i, (image, scale, controlnet) in enumerate(zip(controlnet_cond, conditioning_scale, self.nets)):
|
||||
down_samples, mid_sample = controlnet(
|
||||
sample=sample,
|
||||
|
||||
@@ -47,6 +47,38 @@ class MultiControlNetUnionModel(ModelMixin):
|
||||
guess_mode: bool = False,
|
||||
return_dict: bool = True,
|
||||
) -> ControlNetOutput | tuple:
|
||||
r"""
|
||||
Args:
|
||||
sample (`torch.Tensor`):
|
||||
The noisy input tensor.
|
||||
timestep (`torch.Tensor`, `float`, or `int`):
|
||||
The number of timesteps to denoise an input.
|
||||
encoder_hidden_states (`torch.Tensor`):
|
||||
The encoder hidden states.
|
||||
controlnet_cond (`list` of `torch.Tensor`):
|
||||
A list of conditional input tensors, one per ControlNet.
|
||||
control_type (`list` of `torch.Tensor`):
|
||||
A list of control type tensors, one per ControlNet, indicating the active control types.
|
||||
control_type_idx (`list` of `list` of `int`):
|
||||
Per-ControlNet list of control type indices corresponding to `controlnet_cond`.
|
||||
conditioning_scale (`list` of `float`):
|
||||
A list of scale factors applied to the ControlNet outputs.
|
||||
class_labels (`torch.Tensor`, *optional*):
|
||||
Optional class labels for conditioning.
|
||||
timestep_cond (`torch.Tensor`, *optional*):
|
||||
Additional conditional embeddings for timestep.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Attention mask applied to `encoder_hidden_states`.
|
||||
added_cond_kwargs (`dict`, *optional*):
|
||||
Additional conditions for the Stable Diffusion XL UNet.
|
||||
cross_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttnProcessor`.
|
||||
guess_mode (`bool`, *optional*, defaults to `False`):
|
||||
In this mode, the ControlNet encoder tries its best to recognize the input content even if you remove
|
||||
all prompts.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`ControlNetOutput`] instead of a plain tuple.
|
||||
"""
|
||||
down_block_res_samples, mid_block_res_sample = None, None
|
||||
for i, (image, ctype, ctype_idx, scale, controlnet) in enumerate(
|
||||
zip(controlnet_cond, control_type, control_type_idx, conditioning_scale, self.nets)
|
||||
|
||||
@@ -18,6 +18,8 @@ if is_torch_available():
|
||||
from .t5_film_transformer import T5FilmDecoder
|
||||
from .transformer_2d import Transformer2DModel
|
||||
from .transformer_allegro import AllegroTransformer3DModel
|
||||
from .transformer_anyflow import AnyFlowTransformer3DModel
|
||||
from .transformer_anyflow_far import AnyFlowFARTransformer3DModel
|
||||
from .transformer_bria import BriaTransformer2DModel
|
||||
from .transformer_bria_fibo import BriaFiboTransformer2DModel
|
||||
from .transformer_chroma import ChromaTransformer2DModel
|
||||
|
||||
@@ -406,6 +406,28 @@ class AuraFlowTransformer2DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAd
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`AuraFlowTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.FloatTensor` of shape `(batch size, channel, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
height, width = hidden_states.shape[-2:]
|
||||
|
||||
# Apply patch embedding, timestep embedding, and project the caption embeddings.
|
||||
|
||||
@@ -375,6 +375,35 @@ class CogVideoXTransformer3DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftA
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`CogVideoXTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_frames, channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
timestep_cond (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
|
||||
through the `self.time_embedding` layer to obtain the final timestep embeddings.
|
||||
ofs (`torch.Tensor`, *optional*):
|
||||
Offset embeddings used in CogVideoX-5b-I2V.
|
||||
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
|
||||
Pre-computed rotary positional embeddings.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_frames, channels, height, width = hidden_states.shape
|
||||
|
||||
# 1. Time embedding
|
||||
|
||||
@@ -633,6 +633,37 @@ class ConsisIDTransformer3DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAd
|
||||
id_vit_hidden: torch.Tensor | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`ConsisIDTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_frames, channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
timestep_cond (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
|
||||
through the `self.time_embedding` layer to obtain the final timestep embeddings.
|
||||
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
|
||||
Pre-computed rotary positional embeddings.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
id_cond (`torch.Tensor`, *optional*):
|
||||
The face embedding extracted by the local facial extractor used for identity conditioning.
|
||||
id_vit_hidden (`torch.Tensor`, *optional*):
|
||||
The ViT hidden states extracted from face images used for identity conditioning.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
# fuse clip and insightface
|
||||
valid_face_emb = None
|
||||
if self.is_train_face:
|
||||
|
||||
@@ -392,6 +392,8 @@ class HunyuanDiT2DModel(ModelMixin, AttentionMixin, ConfigMixin):
|
||||
Conditional embedding indicate the style
|
||||
image_rotary_emb (`torch.Tensor`):
|
||||
The image rotary embeddings to apply on query and key tensors during attention calculation.
|
||||
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
return_dict: bool
|
||||
Whether to return a dictionary.
|
||||
"""
|
||||
|
||||
@@ -176,7 +176,7 @@ class LatteTransformer3DModel(ModelMixin, ConfigMixin, CacheMixin):
|
||||
The [`LatteTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states shape `(batch size, channel, num_frame, height, width)`:
|
||||
hidden_states (`torch.Tensor` of shape `(batch size, channel, num_frame, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep ( `torch.LongTensor`, *optional*):
|
||||
Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`.
|
||||
|
||||
@@ -306,6 +306,15 @@ class LuminaNextDiT2DModel(ModelMixin, ConfigMixin):
|
||||
timestep (torch.Tensor): Tensor of diffusion timesteps of shape (N,).
|
||||
encoder_hidden_states (torch.Tensor): Tensor of caption features of shape (N, D).
|
||||
encoder_mask (torch.Tensor): Tensor of caption masks of shape (N, L).
|
||||
image_rotary_emb (`torch.Tensor`):
|
||||
Pre-computed rotary positional embeddings.
|
||||
cross_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
hidden_states, mask, img_size, image_rotary_emb = self.patch_embedder(hidden_states, image_rotary_emb)
|
||||
image_rotary_emb = image_rotary_emb.to(hidden_states.device)
|
||||
|
||||
@@ -427,6 +427,36 @@ class SanaTransformer2DModel(ModelMixin, AttentionMixin, ConfigMixin, PeftAdapte
|
||||
controlnet_block_samples: tuple[torch.Tensor] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor, ...] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`SanaTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding.
|
||||
encoder_attention_mask (`torch.Tensor`, *optional*):
|
||||
Cross-attention mask applied to `encoder_hidden_states`.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Self-attention mask applied to `hidden_states`.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
controlnet_block_samples (`tuple` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
# ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
|
||||
# we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
|
||||
# we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
|
||||
|
||||
@@ -90,6 +90,18 @@ class T5FilmDecoder(ModelMixin, ConfigMixin):
|
||||
return mask.unsqueeze(-3)
|
||||
|
||||
def forward(self, encodings_and_masks, decoder_input_tokens, decoder_noise_time):
|
||||
"""
|
||||
The [`T5FilmDecoder`] forward method.
|
||||
|
||||
Args:
|
||||
encodings_and_masks (`list` of `tuple` of `torch.Tensor`):
|
||||
A list of `(encoding, mask)` tuples produced by upstream encoders. The encodings are concatenated and
|
||||
cross-attended to by the decoder.
|
||||
decoder_input_tokens (`torch.Tensor` of shape `(batch_size, seq_length, input_dims)`):
|
||||
Input tokens for the decoder.
|
||||
decoder_noise_time (`torch.Tensor` of shape `(batch_size,)`):
|
||||
Diffusion timesteps in `[0, 1)` used to condition the decoder.
|
||||
"""
|
||||
batch, _, _ = decoder_input_tokens.shape
|
||||
assert decoder_noise_time.shape == (batch,)
|
||||
|
||||
|
||||
@@ -312,6 +312,30 @@ class AllegroTransformer3DModel(ModelMixin, ConfigMixin, CacheMixin):
|
||||
image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | None = None,
|
||||
return_dict: bool = True,
|
||||
):
|
||||
"""
|
||||
The [`AllegroTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Self-attention mask applied to `hidden_states`.
|
||||
encoder_attention_mask (`torch.Tensor`, *optional*):
|
||||
Cross-attention mask applied to `encoder_hidden_states`.
|
||||
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
|
||||
Pre-computed rotary positional embeddings.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p_t = self.config.patch_size_t
|
||||
p = self.config.patch_size
|
||||
|
||||
726
src/diffusers/models/transformers/transformer_anyflow.py
Normal file
726
src/diffusers/models/transformers/transformer_anyflow.py
Normal file
@@ -0,0 +1,726 @@
|
||||
# Copyright 2026 The AnyFlow Team, NVIDIA Corp., and The HuggingFace Team. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
#
|
||||
# This file derives from the FAR architecture (Gu et al., 2025, arXiv:2503.19325) and adds the
|
||||
# AnyFlow dual-timestep flow-map embedding (AnyFlowDualTimestepTextImageEmbedding) introduced by
|
||||
# Yuchao Gu, Guian Fang et al. (arXiv:2605.13724). The base 3D DiT structure is adapted from the
|
||||
# v0.35.1 Wan2.1 transformer (transformer_wan.py); upstream Wan has since been refactored, so
|
||||
# this file is intentionally self-contained rather than annotated with `# Copied from`.
|
||||
|
||||
import math
|
||||
from typing import Any, Dict, Optional, Tuple, Union
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
from ...configuration_utils import ConfigMixin, register_to_config
|
||||
from ...loaders import FromOriginalModelMixin, PeftAdapterMixin
|
||||
from ...utils import apply_lora_scale, logging
|
||||
from ..attention import AttentionModuleMixin, FeedForward
|
||||
from ..attention_dispatch import dispatch_attention_fn
|
||||
from ..embeddings import PixArtAlphaTextProjection, TimestepEmbedding, Timesteps, get_1d_rotary_pos_embed
|
||||
from ..modeling_outputs import Transformer2DModelOutput
|
||||
from ..modeling_utils import ModelMixin
|
||||
from ..normalization import FP32LayerNorm, RMSNorm
|
||||
|
||||
|
||||
logger = logging.get_logger(__name__) # pylint: disable=invalid-name
|
||||
|
||||
|
||||
def apply_rotary_emb(hidden_states: torch.Tensor, freqs: torch.Tensor):
|
||||
# MPS / NPU backends do not support complex128 / float64; fall back to float32 on those devices.
|
||||
is_mps = hidden_states.device.type == "mps"
|
||||
is_npu = hidden_states.device.type == "npu"
|
||||
rotary_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
|
||||
x_rotated = torch.view_as_complex(hidden_states.to(rotary_dtype).unflatten(3, (-1, 2)))
|
||||
x_out = torch.view_as_real(x_rotated * freqs).flatten(3, 4)
|
||||
return x_out.type_as(hidden_states)
|
||||
|
||||
|
||||
class AnyFlowAttnProcessor:
|
||||
"""
|
||||
Bidirectional self-attention processor for AnyFlow. Routes through
|
||||
:func:`~diffusers.models.attention_dispatch.dispatch_attention_fn` so any SDPA-compatible backend is supported
|
||||
(SDPA, flash-attn, xformers, flex, …). FAR causal generation lives in
|
||||
:class:`~diffusers.models.transformers.transformer_anyflow_far.AnyFlowCausalAttnProcessor`.
|
||||
"""
|
||||
|
||||
_attention_backend = None
|
||||
_parallel_config = None
|
||||
|
||||
def __init__(self):
|
||||
if not hasattr(F, "scaled_dot_product_attention"):
|
||||
raise ImportError(
|
||||
"AnyFlowAttnProcessor requires PyTorch 2.0. To use it, please upgrade PyTorch to 2.0 or higher."
|
||||
)
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
attn: "AnyFlowAttention",
|
||||
hidden_states: torch.Tensor,
|
||||
encoder_hidden_states: Optional[torch.Tensor] = None,
|
||||
attention_mask: Optional[Any] = None,
|
||||
rotary_emb: Optional[Dict[str, torch.Tensor]] = None,
|
||||
) -> torch.Tensor:
|
||||
if encoder_hidden_states is None:
|
||||
encoder_hidden_states = hidden_states
|
||||
|
||||
query = attn.to_q(hidden_states)
|
||||
key = attn.to_k(encoder_hidden_states)
|
||||
value = attn.to_v(encoder_hidden_states)
|
||||
|
||||
if attn.norm_q is not None:
|
||||
query = attn.norm_q(query)
|
||||
if attn.norm_k is not None:
|
||||
key = attn.norm_k(key)
|
||||
|
||||
# Layout (B, H, L, D) for rotary application; transposed to (B, L, H, D) before dispatch.
|
||||
query = query.unflatten(2, (attn.heads, -1)).transpose(1, 2)
|
||||
key = key.unflatten(2, (attn.heads, -1)).transpose(1, 2)
|
||||
value = value.unflatten(2, (attn.heads, -1)).transpose(1, 2)
|
||||
|
||||
if rotary_emb is not None:
|
||||
query = apply_rotary_emb(query, rotary_emb["query"])
|
||||
key = apply_rotary_emb(key, rotary_emb["key"])
|
||||
|
||||
hidden_states = dispatch_attention_fn(
|
||||
query.transpose(1, 2),
|
||||
key.transpose(1, 2),
|
||||
value.transpose(1, 2),
|
||||
attn_mask=attention_mask,
|
||||
dropout_p=0.0,
|
||||
is_causal=False,
|
||||
backend=self._attention_backend,
|
||||
parallel_config=self._parallel_config,
|
||||
)
|
||||
hidden_states = hidden_states.flatten(2, 3)
|
||||
hidden_states = hidden_states.type_as(query)
|
||||
hidden_states = attn.to_out[0](hidden_states)
|
||||
hidden_states = attn.to_out[1](hidden_states)
|
||||
return hidden_states
|
||||
|
||||
|
||||
class AnyFlowCrossAttnProcessor:
|
||||
"""
|
||||
Cross-attention processor for AnyFlow. Always uses the dispatched SDPA-compatible backend; no rotary embedding or
|
||||
KV cache is applied to the text→video cross-attention path.
|
||||
"""
|
||||
|
||||
_attention_backend = None
|
||||
_parallel_config = None
|
||||
|
||||
def __init__(self):
|
||||
if not hasattr(F, "scaled_dot_product_attention"):
|
||||
raise ImportError(
|
||||
"AnyFlowCrossAttnProcessor requires PyTorch 2.0. To use it, please upgrade PyTorch to 2.0 or higher."
|
||||
)
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
attn: "AnyFlowAttention",
|
||||
hidden_states: torch.Tensor,
|
||||
encoder_hidden_states: Optional[torch.Tensor] = None,
|
||||
attention_mask: Optional[torch.Tensor] = None,
|
||||
) -> torch.Tensor:
|
||||
query = attn.to_q(hidden_states)
|
||||
key = attn.to_k(encoder_hidden_states)
|
||||
value = attn.to_v(encoder_hidden_states)
|
||||
|
||||
if attn.norm_q is not None:
|
||||
query = attn.norm_q(query)
|
||||
if attn.norm_k is not None:
|
||||
key = attn.norm_k(key)
|
||||
|
||||
# (B, L, H, D) layout for dispatch_attention_fn.
|
||||
query = query.unflatten(2, (attn.heads, -1))
|
||||
key = key.unflatten(2, (attn.heads, -1))
|
||||
value = value.unflatten(2, (attn.heads, -1))
|
||||
|
||||
hidden_states = dispatch_attention_fn(
|
||||
query,
|
||||
key,
|
||||
value,
|
||||
attn_mask=attention_mask,
|
||||
dropout_p=0.0,
|
||||
is_causal=False,
|
||||
backend=self._attention_backend,
|
||||
parallel_config=self._parallel_config,
|
||||
)
|
||||
hidden_states = hidden_states.flatten(2, 3)
|
||||
hidden_states = hidden_states.type_as(query)
|
||||
hidden_states = attn.to_out[0](hidden_states)
|
||||
hidden_states = attn.to_out[1](hidden_states)
|
||||
return hidden_states
|
||||
|
||||
|
||||
class AnyFlowAttention(torch.nn.Module, AttentionModuleMixin):
|
||||
"""
|
||||
Attention module used by :class:`AnyFlowTransformerBlock`. Layout matches the legacy
|
||||
:class:`~diffusers.models.attention_processor.Attention` so existing AnyFlow checkpoints load bit-exactly into this
|
||||
class.
|
||||
"""
|
||||
|
||||
_default_processor_cls = AnyFlowAttnProcessor
|
||||
_available_processors = [AnyFlowAttnProcessor, AnyFlowCrossAttnProcessor]
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
dim: int,
|
||||
heads: int,
|
||||
dim_head: int,
|
||||
eps: float = 1e-6,
|
||||
processor: Optional[Any] = None,
|
||||
):
|
||||
super().__init__()
|
||||
self.heads = heads
|
||||
self.inner_dim = heads * dim_head
|
||||
|
||||
self.to_q = torch.nn.Linear(dim, self.inner_dim, bias=True)
|
||||
self.to_k = torch.nn.Linear(dim, self.inner_dim, bias=True)
|
||||
self.to_v = torch.nn.Linear(dim, self.inner_dim, bias=True)
|
||||
self.to_out = torch.nn.ModuleList(
|
||||
[
|
||||
torch.nn.Linear(self.inner_dim, dim, bias=True),
|
||||
torch.nn.Dropout(0.0),
|
||||
]
|
||||
)
|
||||
# ``rms_norm_across_heads`` per-axis: normalize Q and K across the entire ``heads * dim_head``
|
||||
# channel axis. We use diffusers' RMSNorm (rather than ``torch.nn.RMSNorm``) so the numerics
|
||||
# match the legacy Attention class that produced the released checkpoints.
|
||||
self.norm_q = RMSNorm(self.inner_dim, eps=eps)
|
||||
self.norm_k = RMSNorm(self.inner_dim, eps=eps)
|
||||
|
||||
self.set_processor(processor if processor is not None else self._default_processor_cls())
|
||||
|
||||
def forward(self, hidden_states: torch.Tensor, **kwargs) -> torch.Tensor:
|
||||
return self.processor(self, hidden_states, **kwargs)
|
||||
|
||||
|
||||
class AnyFlowImageEmbedding(torch.nn.Module):
|
||||
def __init__(self, in_features: int, out_features: int):
|
||||
super().__init__()
|
||||
|
||||
self.norm1 = FP32LayerNorm(in_features)
|
||||
self.ff = FeedForward(in_features, out_features, mult=1, activation_fn="gelu")
|
||||
self.norm2 = FP32LayerNorm(out_features)
|
||||
|
||||
def forward(self, encoder_hidden_states_image: torch.Tensor) -> torch.Tensor:
|
||||
hidden_states = self.norm1(encoder_hidden_states_image)
|
||||
hidden_states = self.ff(hidden_states)
|
||||
hidden_states = self.norm2(hidden_states)
|
||||
return hidden_states
|
||||
|
||||
|
||||
class AnyFlowDualTimestepTextImageEmbedding(nn.Module):
|
||||
def __init__(
|
||||
self,
|
||||
dim: int,
|
||||
gate_value: float,
|
||||
deltatime_type: str,
|
||||
time_freq_dim: int,
|
||||
time_proj_dim: int,
|
||||
text_embed_dim: int,
|
||||
image_embed_dim: Optional[int] = None,
|
||||
):
|
||||
super().__init__()
|
||||
|
||||
self.timesteps_proj = Timesteps(num_channels=time_freq_dim, flip_sin_to_cos=True, downscale_freq_shift=0)
|
||||
self.time_embedder = TimestepEmbedding(in_channels=time_freq_dim, time_embed_dim=dim)
|
||||
self.delta_embedder = TimestepEmbedding(in_channels=time_freq_dim, time_embed_dim=dim)
|
||||
self.act_fn = nn.SiLU()
|
||||
self.time_proj = nn.Linear(dim, time_proj_dim)
|
||||
self.text_embedder = PixArtAlphaTextProjection(text_embed_dim, dim, act_fn="gelu_tanh")
|
||||
|
||||
self.image_embedder = None
|
||||
if image_embed_dim is not None:
|
||||
self.image_embedder = AnyFlowImageEmbedding(image_embed_dim, dim)
|
||||
|
||||
self.register_buffer("delta_emb_gate", torch.tensor([gate_value], dtype=torch.float32), persistent=False)
|
||||
self.deltatime_type = deltatime_type
|
||||
|
||||
def forward_timestep(
|
||||
self, timestep: torch.Tensor, delta_timestep: torch.Tensor, encoder_hidden_states, token_per_frame
|
||||
):
|
||||
batch_size, num_frames = timestep.shape
|
||||
timestep = timestep.reshape(-1)
|
||||
delta_timestep = delta_timestep.reshape(-1)
|
||||
|
||||
timestep = self.timesteps_proj(timestep)
|
||||
|
||||
time_embedder_dtype = next(iter(self.time_embedder.parameters())).dtype
|
||||
if timestep.dtype != time_embedder_dtype and time_embedder_dtype != torch.int8:
|
||||
timestep = timestep.to(time_embedder_dtype)
|
||||
temb = self.time_embedder(timestep).type_as(encoder_hidden_states)
|
||||
|
||||
delta_timestep = self.timesteps_proj(delta_timestep)
|
||||
|
||||
delta_embedder_dtype = next(iter(self.delta_embedder.parameters())).dtype
|
||||
if delta_timestep.dtype != delta_embedder_dtype and delta_embedder_dtype != torch.int8:
|
||||
delta_timestep = delta_timestep.to(delta_embedder_dtype)
|
||||
delta_emb = self.delta_embedder(delta_timestep).type_as(encoder_hidden_states)
|
||||
|
||||
gate = self.delta_emb_gate.to(delta_embedder_dtype)
|
||||
|
||||
rt_emb = (1 - gate) * temb + gate * delta_emb
|
||||
timestep_proj = self.time_proj(self.act_fn(rt_emb))
|
||||
|
||||
rt_emb = rt_emb.unflatten(0, (batch_size, num_frames)).repeat_interleave(token_per_frame, dim=1)
|
||||
timestep_proj = timestep_proj.unflatten(0, (batch_size, num_frames)).repeat_interleave(token_per_frame, dim=1)
|
||||
|
||||
return rt_emb, timestep_proj
|
||||
|
||||
def forward(
|
||||
self,
|
||||
timestep: torch.Tensor,
|
||||
r_timestep: torch.Tensor,
|
||||
encoder_hidden_states: torch.Tensor,
|
||||
encoder_hidden_states_image: Optional[torch.Tensor] = None,
|
||||
layout_cfg=None,
|
||||
):
|
||||
if self.deltatime_type == "r":
|
||||
delta_timestep = r_timestep
|
||||
elif self.deltatime_type == "t-r":
|
||||
delta_timestep = timestep - r_timestep
|
||||
else:
|
||||
raise NotImplementedError
|
||||
|
||||
timestep, timestep_proj = self.forward_timestep(
|
||||
timestep, delta_timestep, encoder_hidden_states, layout_cfg["full_token_per_frame"]
|
||||
)
|
||||
|
||||
encoder_hidden_states = self.text_embedder(encoder_hidden_states)
|
||||
if encoder_hidden_states_image is not None:
|
||||
encoder_hidden_states_image = self.image_embedder(encoder_hidden_states_image)
|
||||
|
||||
return timestep, timestep_proj, encoder_hidden_states, encoder_hidden_states_image
|
||||
|
||||
|
||||
class AnyFlowRotaryPosEmbed(nn.Module):
|
||||
"""Rotary positional embedding for the bidirectional AnyFlow transformer.
|
||||
|
||||
The FAR causal variant lives in :mod:`~diffusers.models.transformers.transformer_anyflow_far` and additionally
|
||||
handles compressed-frame chunks; this bidi class produces frequencies for the single full-resolution token grid
|
||||
only.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
attention_head_dim: int,
|
||||
patch_size: Tuple[int, int, int],
|
||||
max_seq_len: int,
|
||||
theta: float = 10000.0,
|
||||
):
|
||||
super().__init__()
|
||||
|
||||
self.attention_head_dim = attention_head_dim
|
||||
self.patch_size = patch_size
|
||||
self.max_seq_len = max_seq_len
|
||||
self.theta = theta
|
||||
|
||||
# Frequency table is lazily built per-device in ``_build_freqs``: MPS / NPU don't support
|
||||
# complex128, so we downcast to complex64 there.
|
||||
self._freqs_cache: Optional[Tuple[Any, torch.Tensor]] = None
|
||||
|
||||
def _build_freqs(self, device: torch.device) -> torch.Tensor:
|
||||
cache_key = (device.type, str(device))
|
||||
if self._freqs_cache is not None and self._freqs_cache[0] == cache_key:
|
||||
return self._freqs_cache[1]
|
||||
|
||||
is_mps = device.type == "mps"
|
||||
is_npu = device.type == "npu"
|
||||
freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.float64
|
||||
|
||||
h_dim = w_dim = 2 * (self.attention_head_dim // 6)
|
||||
t_dim = self.attention_head_dim - h_dim - w_dim
|
||||
|
||||
freqs_list = []
|
||||
for dim in (t_dim, h_dim, w_dim):
|
||||
f = get_1d_rotary_pos_embed(
|
||||
dim,
|
||||
self.max_seq_len,
|
||||
self.theta,
|
||||
use_real=False,
|
||||
repeat_interleave_real=False,
|
||||
freqs_dtype=freqs_dtype,
|
||||
)
|
||||
freqs_list.append(f.to(device))
|
||||
freqs = torch.cat(freqs_list, dim=1)
|
||||
self._freqs_cache = (cache_key, freqs)
|
||||
return freqs
|
||||
|
||||
def _forward_full_frame(self, num_frames, height, width, device) -> torch.Tensor:
|
||||
ppf, pph, ppw = num_frames, height, width
|
||||
|
||||
freqs_full = self._build_freqs(device)
|
||||
if min(ppf, pph, ppw) <= 0:
|
||||
freq_channels = self.attention_head_dim // 2
|
||||
return torch.empty((ppf, pph, ppw, freq_channels), dtype=freqs_full.dtype, device=device)
|
||||
|
||||
freqs = freqs_full.split_with_sizes(
|
||||
[
|
||||
self.attention_head_dim // 2 - 2 * (self.attention_head_dim // 6),
|
||||
self.attention_head_dim // 6,
|
||||
self.attention_head_dim // 6,
|
||||
],
|
||||
dim=1,
|
||||
)
|
||||
|
||||
freqs_f = freqs[0][:ppf].view(ppf, 1, 1, -1).expand(ppf, pph, ppw, -1)
|
||||
freqs_h = freqs[1][:pph].view(1, pph, 1, -1).expand(ppf, pph, ppw, -1)
|
||||
freqs_w = freqs[2][:ppw].view(1, 1, ppw, -1).expand(ppf, pph, ppw, -1)
|
||||
freqs = torch.cat([freqs_f, freqs_h, freqs_w], dim=-1)
|
||||
return freqs
|
||||
|
||||
def forward(self, layout_cfg, device):
|
||||
freqs = self._forward_full_frame(
|
||||
num_frames=layout_cfg["total_frames"],
|
||||
height=layout_cfg["full_frame_shape"][0],
|
||||
width=layout_cfg["full_frame_shape"][1],
|
||||
device=device,
|
||||
)
|
||||
freqs = freqs.flatten(start_dim=0, end_dim=2)
|
||||
freqs = freqs[None, None, ...]
|
||||
return {"query": freqs, "key": freqs}
|
||||
|
||||
|
||||
class AnyFlowTransformerBlock(nn.Module):
|
||||
"""AnyFlow transformer block.
|
||||
|
||||
The self-attention processor is chosen at construction by ``is_causal``: the bidirectional transformer passes
|
||||
``is_causal=False`` (the default), the FAR causal transformer passes ``is_causal=True``. The forward pass is
|
||||
identical in both modes — only the processor differs, so all causal-specific machinery (BlockMask, KV cache) lives
|
||||
inside the processor.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
dim: int,
|
||||
ffn_dim: int,
|
||||
num_heads: int,
|
||||
cross_attn_norm: bool = False,
|
||||
eps: float = 1e-6,
|
||||
is_causal: bool = False,
|
||||
):
|
||||
super().__init__()
|
||||
|
||||
self.is_causal = is_causal
|
||||
|
||||
# 1. Self-attention. The causal processor lives in the FAR sibling module; lazy-import to
|
||||
# avoid a circular import at module load time.
|
||||
if is_causal:
|
||||
from .transformer_anyflow_far import AnyFlowCausalAttnProcessor
|
||||
|
||||
self_attn_processor = AnyFlowCausalAttnProcessor()
|
||||
else:
|
||||
self_attn_processor = AnyFlowAttnProcessor()
|
||||
|
||||
self.norm1 = FP32LayerNorm(dim, eps, elementwise_affine=False)
|
||||
self.attn1 = AnyFlowAttention(
|
||||
dim=dim,
|
||||
heads=num_heads,
|
||||
dim_head=dim // num_heads,
|
||||
eps=eps,
|
||||
processor=self_attn_processor,
|
||||
)
|
||||
|
||||
# 2. Cross-attention
|
||||
self.attn2 = AnyFlowAttention(
|
||||
dim=dim,
|
||||
heads=num_heads,
|
||||
dim_head=dim // num_heads,
|
||||
eps=eps,
|
||||
processor=AnyFlowCrossAttnProcessor(),
|
||||
)
|
||||
self.norm2 = FP32LayerNorm(dim, eps, elementwise_affine=True) if cross_attn_norm else nn.Identity()
|
||||
|
||||
# 3. Feed-forward
|
||||
self.ffn = FeedForward(dim, inner_dim=ffn_dim, activation_fn="gelu-approximate")
|
||||
self.norm3 = FP32LayerNorm(dim, eps, elementwise_affine=False)
|
||||
|
||||
self.scale_shift_table = nn.Parameter(torch.randn(1, 6, dim) / dim**0.5)
|
||||
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
encoder_hidden_states: torch.Tensor,
|
||||
temb: torch.Tensor,
|
||||
rotary_emb: torch.Tensor,
|
||||
attention_mask: torch.Tensor,
|
||||
kv_cache=None,
|
||||
kv_cache_flag=None,
|
||||
) -> torch.Tensor:
|
||||
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
|
||||
self.scale_shift_table + temb.float()
|
||||
).chunk(6, dim=2)
|
||||
shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
|
||||
shift_msa.squeeze(2),
|
||||
scale_msa.squeeze(2),
|
||||
gate_msa.squeeze(2),
|
||||
c_shift_msa.squeeze(2),
|
||||
c_scale_msa.squeeze(2),
|
||||
c_gate_msa.squeeze(2),
|
||||
) # noqa: E501
|
||||
|
||||
# 1. Self-attention
|
||||
norm_hidden_states = (self.norm1(hidden_states.float()) * (1 + scale_msa) + shift_msa).type_as(hidden_states)
|
||||
attn1_kwargs = {
|
||||
"hidden_states": norm_hidden_states,
|
||||
"rotary_emb": rotary_emb,
|
||||
"attention_mask": attention_mask,
|
||||
}
|
||||
# KV cache kwargs are only consumed by the FAR causal processor; the bidi processor
|
||||
# doesn't accept them, so we forward them only when they're actually populated.
|
||||
if kv_cache is not None:
|
||||
attn1_kwargs["kv_cache"] = kv_cache
|
||||
attn1_kwargs["kv_cache_flag"] = kv_cache_flag
|
||||
attn_output = self.attn1(**attn1_kwargs)
|
||||
hidden_states = (hidden_states.float() + attn_output * gate_msa).type_as(hidden_states)
|
||||
|
||||
# 2. Cross-attention
|
||||
norm_hidden_states = self.norm2(hidden_states.float()).type_as(hidden_states)
|
||||
attn_output = self.attn2(hidden_states=norm_hidden_states, encoder_hidden_states=encoder_hidden_states)
|
||||
hidden_states = hidden_states + attn_output
|
||||
|
||||
# 3. Feed-forward
|
||||
norm_hidden_states = (self.norm3(hidden_states.float()) * (1 + c_scale_msa) + c_shift_msa).type_as(
|
||||
hidden_states
|
||||
)
|
||||
ff_output = self.ffn(norm_hidden_states)
|
||||
hidden_states = (hidden_states.float() + ff_output.float() * c_gate_msa).type_as(hidden_states)
|
||||
|
||||
return hidden_states
|
||||
|
||||
|
||||
class AnyFlowTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):
|
||||
r"""
|
||||
Bidirectional 3D Transformer for AnyFlow flow-map sampling.
|
||||
|
||||
The architecture is the v0.35.1 Wan2.1 3D DiT backbone with one structural change: the timestep embedder is
|
||||
replaced by ``AnyFlowDualTimestepTextImageEmbedding`` so that every forward call conditions on both the source
|
||||
timestep ``t`` and the target timestep ``r``. This is the embedding required to learn the flow map
|
||||
:math:`\Phi_{r\leftarrow t}` introduced in [AnyFlow](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian
|
||||
Fang et al.
|
||||
|
||||
For frame-level autoregressive (FAR causal) generation, use ``AnyFlowFARTransformer3DModel`` instead; that variant
|
||||
adds the FAR causal block-mask and a compressed-frame patch embedding on top of the same backbone.
|
||||
|
||||
Args:
|
||||
patch_size (`Tuple[int]`, defaults to `(1, 2, 2)`):
|
||||
3D patch dimensions for video embedding (t_patch, h_patch, w_patch).
|
||||
num_attention_heads (`int`, defaults to `40`):
|
||||
Number of attention heads.
|
||||
attention_head_dim (`int`, defaults to `128`):
|
||||
The number of channels in each head.
|
||||
in_channels (`int`, defaults to `16`):
|
||||
The number of channels in the input latent.
|
||||
out_channels (`int`, defaults to `16`):
|
||||
The number of channels in the output latent.
|
||||
text_dim (`int`, defaults to `4096`):
|
||||
Input dimension for text embeddings (UMT5).
|
||||
freq_dim (`int`, defaults to `256`):
|
||||
Dimension for sinusoidal time embeddings.
|
||||
ffn_dim (`int`, defaults to `13824`):
|
||||
Intermediate dimension in feed-forward network.
|
||||
num_layers (`int`, defaults to `40`):
|
||||
Number of transformer blocks.
|
||||
cross_attn_norm (`bool`, defaults to `True`):
|
||||
Enable cross-attention normalization.
|
||||
eps (`float`, defaults to `1e-6`):
|
||||
Epsilon for normalization layers.
|
||||
image_dim (`Optional[int]`, *optional*, defaults to `None`):
|
||||
Image embedding dimension for I2V conditioning (`1280` for the original Wan2.1-I2V model).
|
||||
rope_max_seq_len (`int`, defaults to `1024`):
|
||||
Maximum sequence length used to precompute rotary position frequencies.
|
||||
gate_value (`float`, defaults to `0.25`):
|
||||
Mixing gate between source-timestep and delta-timestep embeddings (the AnyFlow paper's :math:`g` parameter,
|
||||
fixed at 0.25 in stage-1 distillation).
|
||||
deltatime_type (`str`, defaults to `'r'`):
|
||||
Either ``"r"`` (delta is the target timestep) or ``"t-r"`` (delta is the absolute interval).
|
||||
"""
|
||||
|
||||
_supports_gradient_checkpointing = True
|
||||
_skip_layerwise_casting_patterns = ["patch_embedding", "condition_embedder", "norm"]
|
||||
_no_split_modules = ["AnyFlowTransformerBlock"]
|
||||
_keep_in_fp32_modules = ["time_embedder", "scale_shift_table", "norm1", "norm2", "norm3"]
|
||||
_repeated_blocks = ["AnyFlowTransformerBlock"]
|
||||
|
||||
@register_to_config
|
||||
def __init__(
|
||||
self,
|
||||
patch_size: Tuple[int] = (1, 2, 2),
|
||||
num_attention_heads: int = 40,
|
||||
attention_head_dim: int = 128,
|
||||
in_channels: int = 16,
|
||||
out_channels: int = 16,
|
||||
text_dim: int = 4096,
|
||||
freq_dim: int = 256,
|
||||
ffn_dim: int = 13824,
|
||||
num_layers: int = 40,
|
||||
cross_attn_norm: bool = True,
|
||||
eps: float = 1e-6,
|
||||
image_dim: Optional[int] = None,
|
||||
rope_max_seq_len: int = 1024,
|
||||
gate_value: float = 0.25,
|
||||
deltatime_type: str = "r",
|
||||
) -> None:
|
||||
super().__init__()
|
||||
|
||||
inner_dim = num_attention_heads * attention_head_dim
|
||||
out_channels = out_channels or in_channels
|
||||
|
||||
# 1. Patch & position embedding (full-frame only).
|
||||
self.rope = AnyFlowRotaryPosEmbed(attention_head_dim, patch_size, rope_max_seq_len)
|
||||
self.patch_embedding = nn.Conv3d(in_channels, inner_dim, kernel_size=patch_size, stride=patch_size)
|
||||
|
||||
# 2. Condition embedding (always dual-timestep for AnyFlow distilled checkpoints).
|
||||
self.condition_embedder = AnyFlowDualTimestepTextImageEmbedding(
|
||||
dim=inner_dim,
|
||||
gate_value=gate_value,
|
||||
deltatime_type=deltatime_type,
|
||||
time_freq_dim=freq_dim,
|
||||
time_proj_dim=inner_dim * 6,
|
||||
text_embed_dim=text_dim,
|
||||
image_embed_dim=image_dim,
|
||||
)
|
||||
|
||||
# 3. Transformer blocks
|
||||
self.blocks = nn.ModuleList(
|
||||
[
|
||||
AnyFlowTransformerBlock(inner_dim, ffn_dim, num_attention_heads, cross_attn_norm, eps)
|
||||
for _ in range(num_layers)
|
||||
]
|
||||
)
|
||||
|
||||
# 4. Output norm & projection
|
||||
self.norm_out = FP32LayerNorm(inner_dim, eps, elementwise_affine=False)
|
||||
self.proj_out = nn.Linear(inner_dim, out_channels * math.prod(patch_size))
|
||||
self.scale_shift_table = nn.Parameter(torch.randn(1, 2, inner_dim) / inner_dim**0.5)
|
||||
|
||||
self.gradient_checkpointing = False
|
||||
|
||||
def _unpack_latent_sequence(self, latents, num_frames, height, width, patch_size):
|
||||
batch_size, num_patches, channels = latents.shape
|
||||
height, width = height // patch_size, width // patch_size
|
||||
|
||||
latents = latents.view(
|
||||
batch_size * num_frames, height, width, patch_size, patch_size, channels // (patch_size * patch_size)
|
||||
)
|
||||
latents = latents.permute(0, 5, 1, 3, 2, 4)
|
||||
latents = latents.reshape(
|
||||
batch_size, num_frames, channels // (patch_size * patch_size), height * patch_size, width * patch_size
|
||||
)
|
||||
return latents
|
||||
|
||||
@apply_lora_scale("attention_kwargs")
|
||||
def forward(
|
||||
self,
|
||||
hidden_states: torch.Tensor,
|
||||
timestep: torch.Tensor,
|
||||
r_timestep: torch.Tensor,
|
||||
encoder_hidden_states: torch.Tensor,
|
||||
encoder_hidden_states_image: Optional[torch.Tensor] = None,
|
||||
attention_kwargs: Optional[Dict[str, Any]] = None,
|
||||
return_dict: bool = True,
|
||||
) -> Union[Transformer2DModelOutput, Tuple]:
|
||||
"""
|
||||
Bidirectional flow-map forward pass. ``hidden_states`` is laid out as ``(B, F, C, H, W)`` (per-frame latents).
|
||||
The input is patchified with the standard ``patch_embedding`` (kernel = stride = ``patch_size``) and denoised
|
||||
with global bidirectional self-attention over the resulting flat token sequence.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_frames, num_channels, height, width)`):
|
||||
Input video latents.
|
||||
timestep (`torch.Tensor`):
|
||||
Source (noisier) flow-map timestep `t`.
|
||||
r_timestep (`torch.Tensor`):
|
||||
Target (cleaner) flow-map timestep `r`; defines the destination of the flow-map step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Text-conditioning embeddings.
|
||||
encoder_hidden_states_image (`torch.Tensor`, *optional*):
|
||||
Image-conditioning embeddings; concatenated before the text tokens when provided.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
Kwargs forwarded to the `AttentionProcessor` as defined under `self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain tuple.
|
||||
|
||||
Returns:
|
||||
[`~models.transformer_2d.Transformer2DModelOutput`] if `return_dict` is True, otherwise a `tuple` whose
|
||||
first element is the predicted velocity tensor.
|
||||
"""
|
||||
hidden_states = hidden_states.permute(0, 2, 1, 3, 4)
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
|
||||
full_token_per_frame = (height * width) // (self.config.patch_size[1] * self.config.patch_size[2])
|
||||
|
||||
layout_cfg = {
|
||||
"total_frames": num_frames,
|
||||
"full_frame_shape": (height // self.config.patch_size[1], width // self.config.patch_size[2]),
|
||||
"full_token_per_frame": full_token_per_frame,
|
||||
}
|
||||
|
||||
rotary_emb = self.rope(layout_cfg=layout_cfg, device=hidden_states.device)
|
||||
|
||||
hidden_states = self.patch_embedding(hidden_states)
|
||||
hidden_states = hidden_states.flatten(2).transpose(1, 2)
|
||||
|
||||
temb, timestep_proj, encoder_hidden_states, encoder_hidden_states_image = self.condition_embedder(
|
||||
timestep,
|
||||
r_timestep,
|
||||
encoder_hidden_states,
|
||||
encoder_hidden_states_image,
|
||||
layout_cfg=layout_cfg,
|
||||
)
|
||||
timestep_proj = timestep_proj.unflatten(2, (6, -1))
|
||||
|
||||
attention_mask = None
|
||||
|
||||
if encoder_hidden_states_image is not None:
|
||||
encoder_hidden_states = torch.concat([encoder_hidden_states_image, encoder_hidden_states], dim=1)
|
||||
|
||||
if torch.is_grad_enabled() and self.gradient_checkpointing:
|
||||
for block in self.blocks:
|
||||
hidden_states = self._gradient_checkpointing_func(
|
||||
block, hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, attention_mask
|
||||
)
|
||||
else:
|
||||
for block in self.blocks:
|
||||
hidden_states = block(hidden_states, encoder_hidden_states, timestep_proj, rotary_emb, attention_mask)
|
||||
|
||||
# Output norm, projection & unpatchify.
|
||||
# `temb` is always 3D from `condition_embedder.forward()` (broadcast over total tokens).
|
||||
shift, scale = (self.scale_shift_table.unsqueeze(0) + temb.unsqueeze(2)).chunk(2, dim=2)
|
||||
shift = shift.squeeze(2)
|
||||
scale = scale.squeeze(2)
|
||||
|
||||
# Move shift/scale to hidden_states' device for multi-GPU accelerate inference.
|
||||
shift = shift.to(hidden_states.device)
|
||||
scale = scale.to(hidden_states.device)
|
||||
|
||||
hidden_states = (self.norm_out(hidden_states.float()) * (1 + scale) + shift).type_as(hidden_states)
|
||||
hidden_states = self.proj_out(hidden_states)
|
||||
|
||||
output = self._unpack_latent_sequence(
|
||||
hidden_states,
|
||||
num_frames=layout_cfg["total_frames"],
|
||||
height=height,
|
||||
width=width,
|
||||
patch_size=self.config.patch_size[1],
|
||||
)
|
||||
|
||||
if not return_dict:
|
||||
return (output,)
|
||||
|
||||
return Transformer2DModelOutput(sample=output)
|
||||
1507
src/diffusers/models/transformers/transformer_anyflow_far.py
Normal file
1507
src/diffusers/models/transformers/transformer_anyflow_far.py
Normal file
File diff suppressed because it is too large
Load Diff
@@ -608,8 +608,16 @@ class BriaTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOrig
|
||||
from the embeddings of input conditions.
|
||||
timestep ( `torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
|
||||
img_ids (`torch.Tensor`):
|
||||
Image position ids used to compute the rotary positional embeddings.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Text position ids used to compute the rotary positional embeddings.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
controlnet_single_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of single transformer blocks.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
|
||||
@@ -529,10 +529,18 @@ class BriaFiboTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, From
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.FloatTensor` of shape `(batch size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
text_encoder_layers (`list` of `torch.Tensor`):
|
||||
Per-block text encoder hidden states, one tensor per transformer block.
|
||||
pooled_projections (`torch.FloatTensor` of shape `(batch_size, projection_dim)`): Embeddings projected
|
||||
from the embeddings of input conditions.
|
||||
timestep ( `torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
img_ids (`torch.Tensor`):
|
||||
Image position ids used to compute the rotary positional embeddings.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Text position ids used to compute the rotary positional embeddings.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
|
||||
@@ -498,8 +498,18 @@ class ChromaTransformer2DModel(
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep ( `torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
|
||||
img_ids (`torch.Tensor`):
|
||||
Image position ids used to compute the rotary positional embeddings.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Text position ids used to compute the rotary positional embeddings.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
controlnet_single_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of single transformer blocks.
|
||||
controlnet_blocks_repeat (`bool`, *optional*, defaults to `False`):
|
||||
Whether to repeat the controlnet block samples across all transformer blocks.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
|
||||
@@ -651,6 +651,30 @@ class ChronoEditTransformer3DModel(
|
||||
return_dict: bool = True,
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
) -> torch.Tensor | dict[str, torch.Tensor]:
|
||||
"""
|
||||
The [`ChronoEditTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_hidden_states_image (`torch.Tensor`, *optional*):
|
||||
Conditional image embeddings for image-conditioned generation.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p_t, p_h, p_w = self.config.patch_size
|
||||
post_patch_num_frames = num_frames // p_t
|
||||
|
||||
@@ -713,6 +713,38 @@ class CogView4Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
|
||||
attention_mask: torch.Tensor | None = None,
|
||||
image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | list[tuple[torch.Tensor, torch.Tensor]] | None = None,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`CogView4Transformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
original_size (`torch.Tensor`):
|
||||
Original image size conditioning.
|
||||
target_size (`torch.Tensor`):
|
||||
Target image size conditioning.
|
||||
crop_coords (`torch.Tensor`):
|
||||
Crop coordinates conditioning.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Mask applied to attention scores.
|
||||
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
|
||||
Pre-computed rotary positional embeddings.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, height, width = hidden_states.shape
|
||||
|
||||
# 1. RoPE
|
||||
|
||||
@@ -697,6 +697,34 @@ class CosmosTransformer3DModel(ModelMixin, ConfigMixin, FromOriginalModelMixin,
|
||||
padding_mask: torch.Tensor | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`CosmosTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
block_controlnet_hidden_states (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
fps (`int`, *optional*):
|
||||
Frames per second of the input video used to compute the rotary positional embeddings.
|
||||
condition_mask (`torch.Tensor`, *optional*):
|
||||
Mask channel concatenated to `hidden_states` to indicate the conditioning region.
|
||||
padding_mask (`torch.Tensor`, *optional*):
|
||||
Padding mask concatenated to `hidden_states` when `concat_padding_mask` is enabled.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
|
||||
# 1. Concatenate padding mask if needed & prepare attention mask
|
||||
|
||||
@@ -469,6 +469,33 @@ class EasyAnimateTransformer3DModel(ModelMixin, ConfigMixin):
|
||||
control_latents: torch.Tensor | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`EasyAnimateTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
timestep_cond (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings for timestep. If provided, the embeddings will be summed with the samples passed
|
||||
through the `self.time_embedding` layer to obtain the final timestep embeddings.
|
||||
encoder_hidden_states (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_hidden_states_t5 (`torch.Tensor`, *optional*):
|
||||
Additional conditional embeddings computed from a T5 text encoder.
|
||||
inpaint_latents (`torch.Tensor`, *optional*):
|
||||
Latents concatenated to `hidden_states` for inpainting variants of the model.
|
||||
control_latents (`torch.Tensor`, *optional*):
|
||||
Latents concatenated to `hidden_states` for control variants of the model.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, channels, video_length, height, width = hidden_states.size()
|
||||
p = self.config.patch_size
|
||||
post_patch_height = height // p
|
||||
|
||||
@@ -350,6 +350,23 @@ class ErnieImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin):
|
||||
text_lens: torch.Tensor,
|
||||
return_dict: bool = True,
|
||||
):
|
||||
"""
|
||||
The [`ErnieImageTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
text_bth (`torch.Tensor`):
|
||||
Conditional text embeddings (embeddings computed from the input conditions such as prompts) to use,
|
||||
shaped `(batch_size, text_length, embed_dims)`.
|
||||
text_lens (`torch.Tensor`):
|
||||
Per-sample text sequence lengths used to build the attention mask.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
device, dtype = hidden_states.device, hidden_states.dtype
|
||||
B, C, H, W = hidden_states.shape
|
||||
p, Hp, Wp = self.patch_size, H // self.patch_size, W // self.patch_size
|
||||
|
||||
@@ -662,8 +662,18 @@ class FluxTransformer2DModel(
|
||||
from the embeddings of input conditions.
|
||||
timestep ( `torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
|
||||
img_ids (`torch.Tensor`):
|
||||
Image position ids used to compute the rotary positional embeddings.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Text position ids used to compute the rotary positional embeddings.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
controlnet_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
controlnet_single_block_samples (`list` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of single transformer blocks.
|
||||
controlnet_blocks_repeat (`bool`, *optional*, defaults to `False`):
|
||||
Whether to repeat the controlnet block samples across all transformer blocks.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
|
||||
@@ -1201,6 +1201,12 @@ class Flux2Transformer2DModel(
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
img_ids (`torch.Tensor`):
|
||||
Image position ids used to compute the rotary positional embeddings.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Text position ids used to compute the rotary positional embeddings.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
joint_attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
|
||||
@@ -609,6 +609,42 @@ class GlmImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Cach
|
||||
kv_caches: GlmImageKVCache | None = None,
|
||||
image_rotary_emb: tuple[torch.Tensor, torch.Tensor] | list[tuple[torch.Tensor, torch.Tensor]] | None = None,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`GlmImageTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
prior_token_id (`torch.Tensor`):
|
||||
Token ids for the prior embedding lookup.
|
||||
prior_token_drop (`torch.Tensor`):
|
||||
Boolean mask indicating which prior embeddings should be dropped (zeroed out).
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
target_size (`torch.Tensor`):
|
||||
Target image size conditioning.
|
||||
crop_coords (`torch.Tensor`):
|
||||
Crop coordinates conditioning.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Mask applied to attention scores.
|
||||
kv_caches (`GlmImageKVCache`, *optional*):
|
||||
Pre-computed key/value caches used to speed up inference.
|
||||
image_rotary_emb (`tuple` of `torch.Tensor`, *optional*):
|
||||
Pre-computed rotary positional embeddings.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, height, width = hidden_states.shape
|
||||
|
||||
# 1. RoPE
|
||||
|
||||
@@ -671,6 +671,42 @@ class HeliosTransformer3DModel(
|
||||
return_dict: bool = True,
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
) -> torch.Tensor | dict[str, torch.Tensor]:
|
||||
"""
|
||||
The [`HeliosTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
indices_hidden_states (`torch.Tensor`, *optional*):
|
||||
Frame indices for `hidden_states` used to compute the rotary positional embeddings.
|
||||
indices_latents_history_short (`torch.Tensor`, *optional*):
|
||||
Frame indices for the short history latents.
|
||||
indices_latents_history_mid (`torch.Tensor`, *optional*):
|
||||
Frame indices for the mid history latents.
|
||||
indices_latents_history_long (`torch.Tensor`, *optional*):
|
||||
Frame indices for the long history latents.
|
||||
latents_history_short (`torch.Tensor`, *optional*):
|
||||
Short history latents conditioning.
|
||||
latents_history_mid (`torch.Tensor`, *optional*):
|
||||
Mid history latents conditioning.
|
||||
latents_history_long (`torch.Tensor`, *optional*):
|
||||
Long history latents conditioning.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
# 1. Input
|
||||
batch_size = hidden_states.shape[0]
|
||||
p_t, p_h, p_w = self.config.patch_size
|
||||
|
||||
@@ -788,6 +788,38 @@ class HiDreamImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin,
|
||||
return_dict: bool = True,
|
||||
**kwargs,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`HiDreamImageTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)` or `(batch_size, patch_height * patch_width, patch_size * patch_size * channels)`):
|
||||
Input `hidden_states`.
|
||||
timesteps (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states_t5 (`torch.Tensor`):
|
||||
Conditional embeddings computed from the T5 text encoder.
|
||||
encoder_hidden_states_llama3 (`torch.Tensor`):
|
||||
Conditional embeddings computed from the Llama3 text encoder.
|
||||
pooled_embeds (`torch.Tensor`):
|
||||
Pooled text embeddings used for additional conditioning.
|
||||
img_ids (`torch.Tensor`, *optional*):
|
||||
Image position ids for the patched hidden states.
|
||||
img_sizes (`list` of `tuple` of `int`, *optional*):
|
||||
Per-sample patch grid sizes used to unpatchify the output.
|
||||
hidden_states_masks (`torch.Tensor`, *optional*):
|
||||
Mask over patched `hidden_states`.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
encoder_hidden_states = kwargs.get("encoder_hidden_states", None)
|
||||
|
||||
if encoder_hidden_states is not None:
|
||||
|
||||
@@ -1003,6 +1003,34 @@ class HunyuanVideoTransformer3DModel(
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`HunyuanVideoTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
pooled_projections (`torch.Tensor` of shape `(batch_size, projection_dim)`):
|
||||
Embeddings projected from the embeddings of input conditions.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p, p_t = self.config.patch_size, self.config.patch_size_t
|
||||
post_patch_num_frames = num_frames // p_t
|
||||
|
||||
@@ -634,6 +634,38 @@ class HunyuanVideo15Transformer3DModel(
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`HunyuanVideo15Transformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
timestep_r (`torch.LongTensor`, *optional*):
|
||||
Refiner timestep conditioning.
|
||||
encoder_hidden_states_2 (`torch.Tensor`, *optional*):
|
||||
Additional conditional embeddings computed from a second text encoder (ByT5).
|
||||
encoder_attention_mask_2 (`torch.Tensor`, *optional*):
|
||||
Mask applied to `encoder_hidden_states_2` during attention.
|
||||
image_embeds (`torch.Tensor`, *optional*):
|
||||
Image embeddings for image-conditioned generation.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p_t, p_h, p_w = self.config.patch_size_t, self.config.patch_size, self.config.patch_size
|
||||
post_patch_num_frames = num_frames // p_t
|
||||
|
||||
@@ -218,6 +218,50 @@ class HunyuanVideoFramepackTransformer3DModel(
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`HunyuanVideoFramepackTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
pooled_projections (`torch.Tensor` of shape `(batch_size, projection_dim)`):
|
||||
Embeddings projected from the embeddings of input conditions.
|
||||
image_embeds (`torch.Tensor`):
|
||||
Image embeddings for image-conditioned generation.
|
||||
indices_latents (`torch.Tensor`):
|
||||
Frame indices for `hidden_states` used to compute the rotary positional embeddings.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
latents_clean (`torch.Tensor`, *optional*):
|
||||
Clean (denoised) history latents conditioning.
|
||||
indices_latents_clean (`torch.Tensor`, *optional*):
|
||||
Frame indices for `latents_clean`.
|
||||
latents_history_2x (`torch.Tensor`, *optional*):
|
||||
2x downsampled history latents conditioning.
|
||||
indices_latents_history_2x (`torch.Tensor`, *optional*):
|
||||
Frame indices for `latents_history_2x`.
|
||||
latents_history_4x (`torch.Tensor`, *optional*):
|
||||
4x downsampled history latents conditioning.
|
||||
indices_latents_history_4x (`torch.Tensor`, *optional*):
|
||||
Frame indices for `latents_history_4x`.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p, p_t = self.config.patch_size, self.config.patch_size_t
|
||||
post_patch_num_frames = num_frames // p_t
|
||||
|
||||
@@ -754,6 +754,38 @@ class HunyuanImageTransformer2DModel(
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> torch.Tensor | dict[str, torch.Tensor]:
|
||||
"""
|
||||
The [`HunyuanImageTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, height, width)` or `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
timestep_r (`torch.LongTensor`, *optional*):
|
||||
Refiner timestep conditioning.
|
||||
encoder_hidden_states_2 (`torch.Tensor`, *optional*):
|
||||
Additional conditional embeddings computed from a second text encoder.
|
||||
encoder_attention_mask_2 (`torch.Tensor`, *optional*):
|
||||
Mask applied to `encoder_hidden_states_2` during attention.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
if hidden_states.ndim == 4:
|
||||
batch_size, channels, height, width = hidden_states.shape
|
||||
sizes = (height, width)
|
||||
|
||||
@@ -526,6 +526,20 @@ class JoyImageEditTransformer3DModel(ModelMixin, ConfigMixin, AttentionMixin):
|
||||
encoder_hidden_states: torch.Tensor = None,
|
||||
return_dict: bool = True,
|
||||
):
|
||||
"""
|
||||
The [`JoyImageEditTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)` or `(batch_size, num_items, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor`, *optional*):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
# handle multi-item input (b, n, c, t, h, w)
|
||||
is_multi_item = hidden_states.ndim == 6
|
||||
num_items = 0
|
||||
|
||||
@@ -545,6 +545,25 @@ class LongCatAudioDiTTransformer(ModelMixin, ConfigMixin):
|
||||
latent_cond: torch.Tensor | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> LongCatAudioDiTTransformerOutput | tuple[torch.Tensor]:
|
||||
"""
|
||||
The [`LongCatAudioDiTTransformer`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, sequence_length, in_channels)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_attention_mask (`torch.BoolTensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
attention_mask (`torch.BoolTensor`, *optional*):
|
||||
Mask applied to `hidden_states` during self-attention.
|
||||
latent_cond (`torch.Tensor`, *optional*):
|
||||
Latent conditioning concatenated to `hidden_states`.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`LongCatAudioDiTTransformerOutput`] instead of a plain tuple.
|
||||
"""
|
||||
dtype = hidden_states.dtype
|
||||
encoder_hidden_states = encoder_hidden_states.to(dtype)
|
||||
timestep = timestep.to(dtype)
|
||||
|
||||
@@ -483,8 +483,12 @@ class LongCatImageTransformer2DModel(
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep ( `torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
block_controlnet_hidden_states: (`list` of `torch.Tensor`):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
img_ids (`torch.Tensor`):
|
||||
Image position ids used to compute the rotary positional embeddings.
|
||||
txt_ids (`torch.Tensor`):
|
||||
Text position ids used to compute the rotary positional embeddings.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding used for guidance-distilled variants of the model.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
@@ -506,6 +506,36 @@ class LTXVideoTransformer3DModel(
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> torch.Tensor:
|
||||
"""
|
||||
The [`LTXVideoTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, sequence_length, in_channels)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
num_frames (`int`, *optional*):
|
||||
Number of frames in the video used to compute the rotary positional embeddings.
|
||||
height (`int`, *optional*):
|
||||
Height of the latent used to compute the rotary positional embeddings.
|
||||
width (`int`, *optional*):
|
||||
Width of the latent used to compute the rotary positional embeddings.
|
||||
rope_interpolation_scale (`tuple` of `float` or `torch.Tensor`, *optional*):
|
||||
Interpolation scale used by the rotary positional embeddings.
|
||||
video_coords (`torch.Tensor`, *optional*):
|
||||
Pre-computed video coordinates used by the rotary positional embeddings.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
image_rotary_emb = self.rope(hidden_states, num_frames, height, width, rope_interpolation_scale, video_coords)
|
||||
|
||||
# convert encoder_attention_mask to a bias the same way we do for attention_mask
|
||||
|
||||
@@ -465,6 +465,30 @@ class Lumina2Transformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromO
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> torch.Tensor | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`Lumina2Transformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
# 1. Condition, positional & patch embedding
|
||||
batch_size, _, height, width = hidden_states.shape
|
||||
|
||||
|
||||
@@ -414,6 +414,26 @@ class MochiTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOri
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> torch.Tensor:
|
||||
"""
|
||||
The [`MochiTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_attention_mask (`torch.Tensor`):
|
||||
Mask applied to `encoder_hidden_states` during attention.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p = self.config.patch_size
|
||||
|
||||
|
||||
@@ -415,6 +415,29 @@ class OmniGenTransformer2DModel(ModelMixin, ConfigMixin):
|
||||
position_ids: torch.Tensor,
|
||||
return_dict: bool = True,
|
||||
) -> Transformer2DModelOutput | tuple[torch.Tensor]:
|
||||
"""
|
||||
The [`OmniGenTransformer2DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
input_ids (`torch.Tensor`):
|
||||
Multimodal text token ids used as conditioning.
|
||||
input_img_latents (`list` of `torch.Tensor`):
|
||||
List of latents for input images used as conditioning.
|
||||
input_image_sizes (`dict` of `int` to `list` of `int`):
|
||||
Mapping from sample index to the positions where input image embeddings should be placed in the
|
||||
conditioning sequence.
|
||||
attention_mask (`torch.Tensor`):
|
||||
Attention mask for the joint multimodal sequence.
|
||||
position_ids (`torch.Tensor`):
|
||||
Position ids used to compute the positional embeddings.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
"""
|
||||
batch_size, num_channels, height, width = hidden_states.shape
|
||||
p = self.config.patch_size
|
||||
post_patch_height, post_patch_width = height // p, width // p
|
||||
|
||||
@@ -868,6 +868,8 @@ class QwenImageTransformer2DModel(
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
controlnet_block_samples (*optional*):
|
||||
ControlNet block samples to add to the transformer blocks.
|
||||
additional_t_cond (`torch.Tensor`, *optional*):
|
||||
Additional timestep conditioning added to the timestep embedding.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
@@ -583,6 +583,36 @@ class SanaVideoTransformer3DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, Fro
|
||||
controlnet_block_samples: tuple[torch.Tensor] | None = None,
|
||||
return_dict: bool = True,
|
||||
) -> tuple[torch.Tensor, ...] | Transformer2DModelOutput:
|
||||
"""
|
||||
The [`SanaVideoTransformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, in_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
guidance (`torch.Tensor`, *optional*):
|
||||
Guidance scale embedding.
|
||||
encoder_attention_mask (`torch.Tensor`, *optional*):
|
||||
Cross-attention mask applied to `encoder_hidden_states`.
|
||||
attention_mask (`torch.Tensor`, *optional*):
|
||||
Self-attention mask applied to `hidden_states`.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
controlnet_block_samples (`tuple` of `torch.Tensor`, *optional*):
|
||||
A list of tensors that if specified are added to the residuals of transformer blocks.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
# ensure attention_mask is a bias, and give it a singleton query_tokens dimension.
|
||||
# we may have done this conversion already, e.g. if we came here via UNet2DConditionModel#forward.
|
||||
# we can tell by counting dims; if ndim == 2: it's a mask rather than a bias.
|
||||
|
||||
@@ -642,6 +642,34 @@ class SkyReelsV2Transformer3DModel(
|
||||
return_dict: bool = True,
|
||||
attention_kwargs: dict[str, Any] | None = None,
|
||||
) -> torch.Tensor | dict[str, torch.Tensor]:
|
||||
"""
|
||||
The [`SkyReelsV2Transformer3DModel`] forward method.
|
||||
|
||||
Args:
|
||||
hidden_states (`torch.Tensor` of shape `(batch_size, num_channels, num_frames, height, width)`):
|
||||
Input `hidden_states`.
|
||||
timestep (`torch.LongTensor`):
|
||||
Used to indicate denoising step.
|
||||
encoder_hidden_states (`torch.Tensor` of shape `(batch_size, sequence_len, embed_dims)`):
|
||||
Conditional embeddings (embeddings computed from the input conditions such as prompts) to use.
|
||||
encoder_hidden_states_image (`torch.Tensor`, *optional*):
|
||||
Conditional image embeddings for image-conditioned generation.
|
||||
enable_diffusion_forcing (`bool`, *optional*, defaults to `False`):
|
||||
Whether to enable diffusion forcing (per-block causal masking).
|
||||
fps (`torch.Tensor`, *optional*):
|
||||
FPS conditioning embedding.
|
||||
return_dict (`bool`, *optional*, defaults to `True`):
|
||||
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
|
||||
tuple.
|
||||
attention_kwargs (`dict`, *optional*):
|
||||
A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under
|
||||
`self.processor` in
|
||||
[diffusers.models.attention_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
|
||||
|
||||
Returns:
|
||||
If `return_dict` is True, an [`~models.transformer_2d.Transformer2DModelOutput`] is returned, otherwise a
|
||||
`tuple` where the first element is the sample tensor.
|
||||
"""
|
||||
batch_size, num_channels, num_frames, height, width = hidden_states.shape
|
||||
p_t, p_h, p_w = self.config.patch_size
|
||||
post_patch_num_frames = num_frames // p_t
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user