You can distribute an LLM across TPU, GPU, and CPU by assigning compute-heavy layers to accelerators and offloading static or memory-intensive tasks to CPUs using device mapping.
Here is the code snippet below:

In the above code, we are using the following key points:
- 
Manual device_map defines precise device allocation for each model component.
 
- 
load_checkpoint_and_dispatch efficiently loads only needed model chunks.
 
- 
Accelerate ensures optimal hardware-aware deployment across heterogeneous devices.
 
Hence, cross-device mapping allows scalable and cost-efficient LLM deployment using available hardware tiers.