FSDP · Fully Sharded Data Parallel
A training method that shards model parameters, gradients and optimizer state across GPUs to fit larger models.
A training method that shards model parameters, gradients and optimizer state across GPUs to fit larger models.