While PyTorch is very good at managing GPU RAM efficiently (and giving it back when not needed), it’s not entirely true with Python and CPU RAM. infer_auto_device_map() (or device_map='auto' in load_checkpoint_and_dispatch()) tries to maximize GPU and CPU RAM it sees available when you execute it.This will be fixed in further development. While this could theoretically work on just one CPU with potential disk offload, you need at least one GPU to run this API.We are aware of the current limitations in the API: don’t put one of the first weights on GPU 0, then weights on GPU 1 and the last weight back to GPU 0) to avoid making many transfers of data between the GPUs. To be the most efficient, make sure your device map puts the parameters on the GPUs in a sequential manner (e.g.