ExecuTorch on macOS with XNNPACK: from OperatorMissing to fast inference

This is a write‑up of what finally made ExecuTorch 1.0.0 LLMs run fast on macOS with the XNNPACK backend, after days of chasing Error::OperatorMissing and build issues.

TL;DR

Exported with export_llm using XNNPACK + TorchAO 8da4w
Built and included kernels_torchao.xcframework (TorchAO ops) in the app
Ensured ExecuTorch’s static operator registrations run at runtime by linking the static libs with -Wl,-force_load,<libpath> (per lib)
Avoided duplicate linking by not also adding those .xcframeworks in the “Link Binary With Libraries” build phase
Removed -all_load (caused huge duplicate symbols) and fixed malformed linker flags
Built via the Xcode workspace (Pods) instead of the standalone project

Symptoms

Loading an XNNPACK PTE failed with: Error::OperatorMissing (20)
The PTE contained llama::custom_sdpa.out and llama::update_cache.out, but runtime registry didn’t have them
Portable model also failed prefill with Error::NotSupported (16) before fixes
Adding -all_load “fixed” registrations but exploded into thousands of duplicate symbols (Skia, etc.)

Root cause

ExecuTorch’s custom operators (e.g. LLM llama::* ops, TorchAO ops) are registered via static initializers created by EXECUTORCH_LIBRARY(...). When linking static libraries inside .xcframeworks on Apple, those initializers may not be pulled in unless you force the linker to load the objects that contain them.

If you simply add the .xcframeworks and no code references the contained symbols, the linker can drop them — the registrations don’t execute — and you get OperatorMissing at runtime.

The fix (linker + layout)

Choose a single, consistent way to link

Do NOT both:
- add the ExecuTorch .xcframeworks to the target’s “Link Binary With Libraries” phase, and
- also -force_load their inner static .a libs in Other Linker Flags.
Pick one. We picked -force_load for the inner .a files and removed the .xcframework items from “Link Binary With Libraries” to avoid duplicates.

Force‑load the specific static libraries that contain registrations

For our XNNPACK + LLM + TorchAO build, we needed to force‑load at least these libs (adjust path roots if different):

-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/executorch.xcframework/macos-arm64/libexecutorch_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/executorch_llm.xcframework/macos-arm64/libexecutorch_llm_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_llm.xcframework/macos-arm64/libkernels_llm_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_optimized.xcframework/macos-arm64/libkernels_optimized_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_quantized.xcframework/macos-arm64/libkernels_quantized_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/kernels_torchao.xcframework/macos-arm64/libkernels_torchao_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/backend_xnnpack.xcframework/macos-arm64/libbackend_xnnpack_macos.a
-Wl,-force_load,$(SRCROOT)/ExecuTorchFrameworks/threadpool.xcframework/macos-arm64/libthreadpool_macos.a

Notes:

Use the single‐token -Wl,-force_load,<path> form in Xcode’s “Other Linker Flags”. It’s much less error‑prone than manually sprinkling -Xlinker tokens.
Do not use -all_load — it can pull in everything and cause thousands of duplicate symbols with other deps (e.g., Skia).

Add header search paths (if you import ObjC headers from ExecuTorch)

$(SRCROOT)/ExecuTorchFrameworks/executorch.xcframework/macos-arm64/Headers
$(SRCROOT)/ExecuTorchFrameworks/executorch_llm.xcframework/macos-arm64/Headers

Build with the workspace

Use AlienTavernMobile.xcworkspace (or your workspace) so CocoaPods products resolve. Building only the .xcodeproj can mislead with unrelated missing libs (e.g., DoubleConversion) and isn’t how the app is normally linked.

TorchAO: include kernels_torchao

When exporting with TorchAO 8da4w, the graph includes torchao::quantize_affine/dequantize_affine/choose_qparams etc. Those kernels are not in kernels_quantized — you need kernels_torchao too.

How we built the Apple frameworks (including TorchAO):

cd <executorch_checkout>
./scripts/build_apple_frameworks.sh --Release --torchao

Then we copied the XCFramework(s) into the app’s tree (e.g., ExecuTorchFrameworks/) and added the -force_load entry for kernels_torchao as shown above.

Verifying the operators

Two good checks that caught the issue:

Inspect the PTE to list operators in the plan (Python, optional):

Convert PTE flatbuffer to JSON and inspect the operator table
Cross‑check whether those ops should be provided by kernels you linked (e.g., TorchAO, LLM custom ops)

Inspect the runtime registry at startup (C++):

Log whether the registry contains the ops you need (e.g., llama::custom_sdpa.out, llama::update_cache.out)
If missing, your registrations didn’t run — a linking/initializer issue

Common pitfalls we hit

Malformed OTHER_LDFLAGS: a stray -Xlinker token on its own makes Xcode treat it as an input path (…/mobile/macos/-Xlinker not found)
Using -all_load: caused 6k+ duplicate symbols via transitive static libs (Skia, etc.)
Adding .xcframeworks to the Frameworks phase AND also -force_loading their libs: double‑linking and duplicate symbols
Forgetting to include kernels_torchao when exporting with TorchAO 8da4w
Building the .xcodeproj instead of the workspace: spurious CocoaPods missing‑lib warnings

Minimal troubleshooting checklist

Confirm the PTE contains only ops you know are provided by the set of ExecuTorch libs you’re linking
At runtime, log registry_has_op_function("llama::custom_sdpa.out") and registry_has_op_function("llama::update_cache.out")
If missing, fix linking so static initializers run:
- Remove .xcframeworks from “Link Binary With Libraries” for the ExecuTorch libs you will -force_load
- Add -Wl,-force_load,<path> for each required .a inside those .xcframeworks
- Remove -all_load and fix any stray -Xlinker tokens
Ensure kernels_torchao is included when using TorchAO quantization
Build the workspace, not the project

Why this works

The EXECUTORCH_LIBRARY(namespace, name, fn) macros emit static initializers that register kernels when their object files are linked and loaded. If the linker never pulls in the object containing the initializer, the registration never runs. -force_load guarantees the linker brings in those objects even if nothing directly references them, and the registrations execute.

Suggested docs improvements

Provide a “linker recipe” for Apple static linking that includes:
- The list of .a libs to include for common scenarios (portable, XNNPACK, LLM, TorchAO)
- A warning against mixing the Frameworks phase with -force_load for the same .xcframework
- Use of -Wl,-force_load,<path> and avoiding -all_load
Add an explicit note that TorchAO exports require kernels_torchao at runtime
Include a small runtime snippet to validate op registration (e.g., registry_has_op_function)
Include a “PTE inspection” tip for debugging OperatorMissing (flatbuffer → JSON → list ops)

References

ExecuTorch 1.0.0
XNNPACK backend
TorchAO 8da4w export path via export_llm
GitHub issue: https://github.com/pytorch/executorch/issues/14809

Video of it running: https://www.youtube.com/shorts/KHXUIlop-1w

ExecuTorch on macOS with XNNPACK: from OperatorMissing to fast inference

TL;DR

Symptoms

Root cause

The fix (linker + layout)

TorchAO: include kernels_torchao

Verifying the operators

Common pitfalls we hit

Minimal troubleshooting checklist

Why this works

Suggested docs improvements

References

Like this:

Related

Leave a Comment Cancel reply

TL;DR

Symptoms

Root cause

The fix (linker + layout)

TorchAO: include kernels_torchao

Verifying the operators

Common pitfalls we hit

Minimal troubleshooting checklist

Why this works

Suggested docs improvements

References

Share this:

Like this:

Related

Leave a Comment Cancel reply