Newsly - News Dashboard

--- Comments ---

- 1aurent29: sounds very similar to <a href="https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_shard.html#torch.distributed.fsdp.CPUOffloadPolicy" rel="nofollow">https://docs.pytorch.org/docs/stable/distributed.fsdp.fully_...</a>

i wonder how much this could be replicated using only this pytorch primitive

- internetguy: > MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state<p>This is pretty awesome. The only compute I have at home is an RTX 3080 with 10 GB of VRAM, so I struggle with training larger models (>40M, 50M params). I get OOM errors and have to optimize a lot.<p>I have a lot more CPU RAM in my PC, and this would likely increase the size of models I can train locally.

- WithinReason: I was wondering how well this would work :) You can definitely push this further, the question is: how well can the gradients and updates compress?

- olliepro: This would likely only get used for small finetuning jobs. It’s too slow for the scale of pretraining.

- l1n: Seems similar to Deep speed.

- edoardobambini-: [dead]

MegaTrain: Full Precision Training of 100B+ Parameter LLMs on a Single GPU