This is a write‑up of what finally made ExecuTorch 1.0.0 LLMs run fast on macOS with the XNNPACK backend, after days of chasing Error::OperatorMissing and build issues.
TL;DR
- Exported with
export_llmusing XNNPACK + TorchAO 8da4w - Built and included
kernels_torchao.xcframework(TorchAO ops) in the app - Ensured ExecuTorch’s static operator registrations run at runtime by linking the static libs with
-Wl,-force_load,<libpath>(per lib) - Avoided duplicate linking by not also adding those
.xcframeworks in the “Link Binary With Libraries” build phase - Removed
-all_load(caused huge duplicate symbols) and fixed malformed linker flags - Built via the Xcode workspace (Pods) instead of the standalone project
Symptoms
- Loading an XNNPACK PTE failed with:
Error::OperatorMissing (20) - The PTE contained
llama::custom_sdpa.outandllama::update_cache.out, but runtime registry didn’t have them - Portable model also failed prefill with
Error::NotSupported (16)before fixes - Adding
-all_load“fixed” registrations but exploded into thousands of duplicate symbols (Skia, etc.)
Root cause
ExecuTorch’s custom operators (e.g. LLM llama::* ops, TorchAO ops) are registered via static initializers created by EXECUTORCH_LIBRARY(...). When linking static libraries inside .xcframeworks on Apple, those initializers may not be pulled in unless you force the linker to load the objects that contain them.
If you simply add the .xcframeworks and no code references the contained symbols, the linker can drop them — the registrations don’t execute — and you get OperatorMissing at runtime.
The fix (linker + layout)
- Choose a single, consistent way to link
- Do NOT both:
- add the ExecuTorch
.xcframeworks to the target’s “Link Binary With Libraries” phase, and - also
-force_loadtheir inner static.alibs in Other Linker Flags.
- add the ExecuTorch
- Pick one. We picked
-force_loadfor the inner.afiles and removed the.xcframeworkitems from “Link Binary With Libraries” to avoid duplicates.
- Force‑load the specific static libraries that contain registrations
For our XNNPACK + LLM + TorchAO build, we needed to force‑load at least these libs (adjust path roots if different):
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/executorch.xcframework/macos-arm64/libexecutorch_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/executorch_llm.xcframework/macos-arm64/libexecutorch_llm_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_llm.xcframework/macos-arm64/libkernels_llm_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_optimized.xcframework/macos-arm64/libkernels_optimized_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_quantized.xcframework/macos-arm64/libkernels_quantized_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_torchao.xcframework/macos-arm64/libkernels_torchao_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/backend_xnnpack.xcframework/macos-arm64/libbackend_xnnpack_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/threadpool.xcframework/macos-arm64/libthreadpool_macos.a
Notes:
- Use the single‐token
-Wl,-force_load,<path>form in Xcode’s “Other Linker Flags”. It’s much less error‑prone than manually sprinkling-Xlinkertokens. - Do not use
-all_load— it can pull in everything and cause thousands of duplicate symbols with other deps (e.g., Skia).
- Add header search paths (if you import ObjC headers from ExecuTorch)
$(SRCROOT)/ExecuTorchFrameworks/executorch.xcframework/macos-arm64/Headers
$(SRCROOT)/ExecuTorchFrameworks/executorch_llm.xcframework/macos-arm64/Headers
- Build with the workspace
- Use
AlienTavernMobile.xcworkspace(or your workspace) so CocoaPods products resolve. Building only the.xcodeprojcan mislead with unrelated missing libs (e.g., DoubleConversion) and isn’t how the app is normally linked.
TorchAO: include kernels_torchao
When exporting with TorchAO 8da4w, the graph includes torchao::quantize_affine/dequantize_affine/choose_qparams etc. Those kernels are not in kernels_quantized — you need kernels_torchao too.
How we built the Apple frameworks (including TorchAO):
cd <executorch_checkout>
./scripts/build_apple_frameworks.sh --Release --torchao
Then we copied the XCFramework(s) into the app’s tree (e.g., ExecuTorchFrameworks/) and added the -force_load entry for kernels_torchao as shown above.
Verifying the operators
Two good checks that caught the issue:
- Inspect the PTE to list operators in the plan (Python, optional):
- Convert PTE flatbuffer to JSON and inspect the operator table
- Cross‑check whether those ops should be provided by kernels you linked (e.g., TorchAO, LLM custom ops)
- Inspect the runtime registry at startup (C++):
- Log whether the registry contains the ops you need (e.g.,
llama::custom_sdpa.out,llama::update_cache.out) - If missing, your registrations didn’t run — a linking/initializer issue
Common pitfalls we hit
- Malformed
OTHER_LDFLAGS: a stray-Xlinkertoken on its own makes Xcode treat it as an input path (…/mobile/macos/-Xlinkernot found) - Using
-all_load: caused 6k+ duplicate symbols via transitive static libs (Skia, etc.) - Adding
.xcframeworks to the Frameworks phase AND also-force_loading their libs: double‑linking and duplicate symbols - Forgetting to include
kernels_torchaowhen exporting with TorchAO 8da4w - Building the
.xcodeprojinstead of the workspace: spurious CocoaPods missing‑lib warnings
Minimal troubleshooting checklist
- Confirm the PTE contains only ops you know are provided by the set of ExecuTorch libs you’re linking
- At runtime, log
registry_has_op_function("llama::custom_sdpa.out")andregistry_has_op_function("llama::update_cache.out") - If missing, fix linking so static initializers run:
- Remove
.xcframeworks from “Link Binary With Libraries” for the ExecuTorch libs you will-force_load - Add
-Wl,-force_load,<path>for each required.ainside those.xcframeworks - Remove
-all_loadand fix any stray-Xlinkertokens
- Remove
- Ensure
kernels_torchaois included when using TorchAO quantization - Build the workspace, not the project
Why this works
The EXECUTORCH_LIBRARY(namespace, name, fn) macros emit static initializers that register kernels when their object files are linked and loaded. If the linker never pulls in the object containing the initializer, the registration never runs. -force_load guarantees the linker brings in those objects even if nothing directly references them, and the registrations execute.
Suggested docs improvements
- Provide a “linker recipe” for Apple static linking that includes:
- The list of
.alibs to include for common scenarios (portable, XNNPACK, LLM, TorchAO) - A warning against mixing the Frameworks phase with
-force_loadfor the same.xcframework - Use of
-Wl,-force_load,<path>and avoiding-all_load
- The list of
- Add an explicit note that TorchAO exports require
kernels_torchaoat runtime - Include a small runtime snippet to validate op registration (e.g.,
registry_has_op_function) - Include a “PTE inspection” tip for debugging OperatorMissing (flatbuffer → JSON → list ops)
References
- ExecuTorch 1.0.0
- XNNPACK backend
- TorchAO 8da4w export path via
export_llm - GitHub issue: https://github.com/pytorch/executorch/issues/14809
Video of it running: https://www.youtube.com/shorts/KHXUIlop-1w