d4l3k 15 hours ago

Hey, nice to see this here!

I'm the primary author so happy to answer any questions you might have!

  • bwfan123 19 minutes ago

    Why isnt there more investments into semi-synchronous training - is it that the convergence is iffy ? Also, it would be great to refactor this code into a typed language, so it is easier to reason about and maintain.

zxexz 9 hours ago

This is awesome, can’t wait to try out these techniques. At least a week a year of my time for the past few years has gone towards recovering from a fault crashing a training run. Sometimes environment related, sometimes shared storage, sometimes just because a slightly faulty IB cable.

bjt12345 14 hours ago

This is severely underrated work, why aren't there more mid sized companies helping this? Ultra Ethernet just got released.

  • foobiekr 9 hours ago

    Ultra Ethernet will do almost nothing. It’s a rubber stamped version of Broadcom’s design and Marcel/Cisco/etc will just add it to their asics. Remains to be seen if SpecrumX will or Connectix. If not, none of it matters.

    These chips are $30m-$100m projects a pop. After the embarrassingly brutal failure of Barefoot nobody is going to do ASICs.

anonymousDan 6 hours ago

What kind of failures are you typically concerned with here?

timzaman 15 hours ago

300 L40s? What's this, 1998?

  • d4l3k 14 hours ago

    Hey Tim, how's it going?

    Interested in lending PyTorch some compute? :)

    torchft can handle much larger scales but for public multi-day demonstration run this is what we had available. Point of this blog was to demonstrate correctness of the quorum algorithm and recovery with a stock PyTorch stack and not so much peak flops.

    Stay tuned though -- planning on doing some much larger demos on B200s!