fix: no-mmap safetensors loader for >4GB models on Windows ROCm/UMA#14587
fix: no-mmap safetensors loader for >4GB models on Windows ROCm/UMA#14587HD-LV wants to merge 3 commits into
Conversation
- torch 2.7.0a0 + ROCm 6.5 via scottt/rocm-TheRock gfx1151 wheels - numpy pinned to 1.26.4 for wheel compatibility - SD1.5 512x512 20 steps ~5 it/s confirmed stable - Saved workflow: sd15_test_rocm_workflow.json - AMD Radeon 8050S, 14.37 GB UMA VRAM correctly detected
Root cause: Strix Halo UMA ROCm init reserves ~14 GB of Windows virtual address space for GPU. This prevents safetensors from mmap-ing files larger than ~4 GB (SDXL fp16 ~6.5 GB), causing access violations. SD1.5 (3.97 GB) is below the threshold and unaffected. Fix in comfy/utils.py: - Add _LARGE_FILE_MMAP_THRESHOLD = 4_000_000_000 - Add _load_safetensors_no_mmap(): reads tensors via open()+seek()+read() instead of mmap, then clones each tensor for independent ownership - In load_torch_file(): route files >4 GB with CUDA active through _load_safetensors_no_mmap() automatically Tested: RealVisXL_V4.0.safetensors (6.46 GB) loads and generates 768x1024 portrait images at ~5 it/s on AMD Radeon 8050S (gfx1151). SD1.5 baseline unaffected (still uses original mmap path).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughA new no-mmap fallback is introduced in 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@baselines/system_info_realvisxl.txt`:
- Around line 25-27: Replace the hardcoded personal account path in the cd
command with a generic environment variable or placeholder. In the line
containing cd C:\Users\LvHHu\ComfyUI, replace the specific username path LvHHu
with either %USERPROFILE%\ComfyUI (to use the Windows environment variable) or
<COMFYUI_ROOT> (to use a generic placeholder), making the instructions reusable
across different machines and preventing exposure of personal identifiers.
In `@baselines/system_info.txt`:
- Around line 31-33: The baseline documentation at lines 31-33 contains a
personal machine username in the file path C:\Users\LvHHu\ComfyUI, which exposes
private information and reduces portability across different machines. Replace
the hardcoded username LvHHu with a neutral environment variable placeholder
such as %USERPROFILE% so the instructions become C:\Users\%USERPROFILE%\ComfyUI,
or alternatively use a generic placeholder like <COMFYUI_ROOT> to make the
baseline documentation portable and suitable for general use.
In `@comfy/utils.py`:
- Around line 125-126: The _load_safetensors_no_mmap function currently ignores
the device parameter and always returns CPU tensors, which breaks API
consistency with the existing safetensors loading path. Modify the
_load_safetensors_no_mmap function to accept a device parameter and move the
loaded tensors to the specified device before returning them, ensuring it
matches the device handling behavior of the standard safetensors loading path.
Apply the same fix to any other locations mentioned (lines 161-163) where the
device semantics need to be preserved.
- Around line 158-161: The condition on line 158 that checks
torch.cuda.is_available() applies the slower no-mmap loading path too broadly to
all CUDA environments with files greater than 4 GB. Instead of using
torch.cuda.is_available(), scope this fallback to specifically target the
affected platform by checking for Windows (using sys.platform or platform
module) and optionally detecting ROCm specifically, so that only the systems
with the UMA GPU initialization issue use the _load_safetensors_no_mmap fallback
instead of unnecessarily adding load-time and RAM overhead to all other CUDA
systems.
- Around line 139-144: The condition `if raw:` in this block only checks if any
bytes were read, but does not verify that the complete expected amount was read.
A partial read returning fewer bytes than `end - start` would pass this check
and create a corrupted tensor, while a failed read would silently create an
empty tensor. Replace the `if raw:` check with a condition that validates the
length of `raw` equals the expected length `end - start`. If the read is
incomplete or returns the wrong number of bytes, treat it as corruption and
raise an error or handle it appropriately rather than silently creating an empty
tensor with `torch.empty(shape, dtype=dtype)`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7fc4afed-789f-403f-a32e-f88f34c4b2e6
📒 Files selected for processing (5)
baselines/environment_rocm_working.txtbaselines/system_info.txtbaselines/system_info_realvisxl.txtbaselines/workflows/sd15_test_rocm_workflow.jsoncomfy/utils.py
- utils.py: add device param to _load_safetensors_no_mmap, move tensors to target device instead of always returning CPU tensors - utils.py: validate read length == expected bytes; raise RuntimeError on partial/corrupt reads instead of silently creating empty tensors - utils.py: scope no-mmap fallback to sys.platform == win32 to avoid unnecessary overhead on Linux/Mac CUDA systems; add sys import - baselines: replace hardcoded LvHHu username with %USERPROFILE% in startup commands for portability Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem
On Windows with AMD ROCm using a UMA (Unified Memory Architecture) GPU (e.g. AMD Radeon 8050S / Strix Halo, gfx1151), loading safetensors files larger than ~4 GB causes an access violation crash.
Root cause: ROCm initializes by reserving ~14 GB of Windows virtual address space for the GPU. This exhausts the contiguous address space needed for
mmap, so any file above ~4 GB fails to memory-map.SD 1.5 checkpoints (~3.97 GB) are below the threshold and load fine. SDXL/RealVisXL (~6.46 GB) reliably crash.
Fix
In
comfy/utils.py, when a.safetensorsfile exceeds 4 GB and CUDA/ROCm is active, route it through a new_load_safetensors_no_mmap()function that reads tensors sequentially viaopen() + seek() + read()instead of mmap. All other files continue using the existing mmap path — no behavior change for unaffected users.Tested on