Skip to content

fix: no-mmap safetensors loader for >4GB models on Windows ROCm/UMA#14587

Open
HD-LV wants to merge 3 commits into
Comfy-Org:masterfrom
HD-LV:master
Open

fix: no-mmap safetensors loader for >4GB models on Windows ROCm/UMA#14587
HD-LV wants to merge 3 commits into
Comfy-Org:masterfrom
HD-LV:master

Conversation

@HD-LV

@HD-LV HD-LV commented Jun 22, 2026

Copy link
Copy Markdown

Problem

On Windows with AMD ROCm using a UMA (Unified Memory Architecture) GPU (e.g. AMD Radeon 8050S / Strix Halo, gfx1151), loading safetensors files larger than ~4 GB causes an access violation crash.

Root cause: ROCm initializes by reserving ~14 GB of Windows virtual address space for the GPU. This exhausts the contiguous address space needed for mmap, so any file above ~4 GB fails to memory-map.

SD 1.5 checkpoints (~3.97 GB) are below the threshold and load fine. SDXL/RealVisXL (~6.46 GB) reliably crash.

Fix

In comfy/utils.py, when a .safetensors file exceeds 4 GB and CUDA/ROCm is active, route it through a new _load_safetensors_no_mmap() function that reads tensors sequentially via open() + seek() + read() instead of mmap. All other files continue using the existing mmap path — no behavior change for unaffected users.

Tested on

  • Device: AMD Radeon 8050S (gfx1151 / Strix Halo), 14.37 GB UMA VRAM
  • OS: Windows 11
  • ROCm: 6.5 (scottt/rocm-TheRock gfx1151 wheels)
  • PyTorch: 2.7.0a0
  • SD 1.5 (3.97 GB): unaffected, still uses mmap path ✅
  • RealVisXL V4.0 (6.46 GB): loads successfully, 768×1024 portrait ~5 it/s ✅

HD-LV added 2 commits June 22, 2026 18:52
- torch 2.7.0a0 + ROCm 6.5 via scottt/rocm-TheRock gfx1151 wheels
- numpy pinned to 1.26.4 for wheel compatibility
- SD1.5 512x512 20 steps ~5 it/s confirmed stable
- Saved workflow: sd15_test_rocm_workflow.json
- AMD Radeon 8050S, 14.37 GB UMA VRAM correctly detected
Root cause: Strix Halo UMA ROCm init reserves ~14 GB of Windows virtual
address space for GPU. This prevents safetensors from mmap-ing files
larger than ~4 GB (SDXL fp16 ~6.5 GB), causing access violations.
SD1.5 (3.97 GB) is below the threshold and unaffected.

Fix in comfy/utils.py:
- Add _LARGE_FILE_MMAP_THRESHOLD = 4_000_000_000
- Add _load_safetensors_no_mmap(): reads tensors via open()+seek()+read()
  instead of mmap, then clones each tensor for independent ownership
- In load_torch_file(): route files >4 GB with CUDA active through
  _load_safetensors_no_mmap() automatically

Tested: RealVisXL_V4.0.safetensors (6.46 GB) loads and generates
768x1024 portrait images at ~5 it/s on AMD Radeon 8050S (gfx1151).
SD1.5 baseline unaffected (still uses original mmap path).
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b4f2f59d-32f5-4c7e-8f34-7709e2c50033

📥 Commits

Reviewing files that changed from the base of the PR and between e912b91 and 0df0b0d.

📒 Files selected for processing (3)
  • baselines/system_info.txt
  • baselines/system_info_realvisxl.txt
  • comfy/utils.py
✅ Files skipped from review due to trivial changes (2)
  • baselines/system_info_realvisxl.txt
  • baselines/system_info.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • comfy/utils.py

📝 Walkthrough

Walkthrough

A new no-mmap fallback is introduced in comfy/utils.py for loading large safetensors files on Windows+CUDA systems. A constant _LARGE_FILE_MMAP_THRESHOLD (4 GB) and helper _load_safetensors_no_mmap are added; the helper parses the safetensors header and reads each tensor sequentially via seek+read, returning cloned PyTorch tensors. load_torch_file routes to this path when the file exceeds the threshold, aimdo is disabled, and CUDA is available. Alongside the code change, four baseline files are added: a pinned ROCm package environment list, two system-info snapshots (one for SD1.5, one for RealVisXL no-mmap), and a ComfyUI workflow JSON for SD1.5 ROCm testing.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: a no-mmap safetensors loader specifically targeting large models on Windows ROCm/UMA systems.
Description check ✅ Passed The description comprehensively explains the problem (mmap failures with large files on Windows ROCm), root cause (virtual address space exhaustion), solution (sequential read fallback), and testing validation across different model sizes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@baselines/system_info_realvisxl.txt`:
- Around line 25-27: Replace the hardcoded personal account path in the cd
command with a generic environment variable or placeholder. In the line
containing cd C:\Users\LvHHu\ComfyUI, replace the specific username path LvHHu
with either %USERPROFILE%\ComfyUI (to use the Windows environment variable) or
<COMFYUI_ROOT> (to use a generic placeholder), making the instructions reusable
across different machines and preventing exposure of personal identifiers.

In `@baselines/system_info.txt`:
- Around line 31-33: The baseline documentation at lines 31-33 contains a
personal machine username in the file path C:\Users\LvHHu\ComfyUI, which exposes
private information and reduces portability across different machines. Replace
the hardcoded username LvHHu with a neutral environment variable placeholder
such as %USERPROFILE% so the instructions become C:\Users\%USERPROFILE%\ComfyUI,
or alternatively use a generic placeholder like <COMFYUI_ROOT> to make the
baseline documentation portable and suitable for general use.

In `@comfy/utils.py`:
- Around line 125-126: The _load_safetensors_no_mmap function currently ignores
the device parameter and always returns CPU tensors, which breaks API
consistency with the existing safetensors loading path. Modify the
_load_safetensors_no_mmap function to accept a device parameter and move the
loaded tensors to the specified device before returning them, ensuring it
matches the device handling behavior of the standard safetensors loading path.
Apply the same fix to any other locations mentioned (lines 161-163) where the
device semantics need to be preserved.
- Around line 158-161: The condition on line 158 that checks
torch.cuda.is_available() applies the slower no-mmap loading path too broadly to
all CUDA environments with files greater than 4 GB. Instead of using
torch.cuda.is_available(), scope this fallback to specifically target the
affected platform by checking for Windows (using sys.platform or platform
module) and optionally detecting ROCm specifically, so that only the systems
with the UMA GPU initialization issue use the _load_safetensors_no_mmap fallback
instead of unnecessarily adding load-time and RAM overhead to all other CUDA
systems.
- Around line 139-144: The condition `if raw:` in this block only checks if any
bytes were read, but does not verify that the complete expected amount was read.
A partial read returning fewer bytes than `end - start` would pass this check
and create a corrupted tensor, while a failed read would silently create an
empty tensor. Replace the `if raw:` check with a condition that validates the
length of `raw` equals the expected length `end - start`. If the read is
incomplete or returns the wrong number of bytes, treat it as corruption and
raise an error or handle it appropriately rather than silently creating an empty
tensor with `torch.empty(shape, dtype=dtype)`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7fc4afed-789f-403f-a32e-f88f34c4b2e6

📥 Commits

Reviewing files that changed from the base of the PR and between 6978a46 and e912b91.

📒 Files selected for processing (5)
  • baselines/environment_rocm_working.txt
  • baselines/system_info.txt
  • baselines/system_info_realvisxl.txt
  • baselines/workflows/sd15_test_rocm_workflow.json
  • comfy/utils.py

Comment thread baselines/system_info_realvisxl.txt Outdated
Comment thread baselines/system_info.txt Outdated
Comment thread comfy/utils.py Outdated
Comment thread comfy/utils.py Outdated
Comment thread comfy/utils.py Outdated
- utils.py: add device param to _load_safetensors_no_mmap, move tensors
  to target device instead of always returning CPU tensors
- utils.py: validate read length == expected bytes; raise RuntimeError
  on partial/corrupt reads instead of silently creating empty tensors
- utils.py: scope no-mmap fallback to sys.platform == win32 to avoid
  unnecessary overhead on Linux/Mac CUDA systems; add sys import
- baselines: replace hardcoded LvHHu username with %USERPROFILE% in
  startup commands for portability

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants