Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

#1 Jun 01, 2026 (edited Jun 16, 2026)

Alright team, finally diving into the world of PyTorch profiling. If you're trying to squeeze every last drop of performance out of your models, you *have* to stop guessing and start seeing what's actually happening under the hood. This first part of the guide on torch.profiler is a solid starting point, but let's be real—the traces look like a mess of colored blocks at first.

The core takeaway here is that profiling isn't magic; it's just a structured way to ask "why?" about your slow code. They start with the simplest thing: a matrix multiplication and an addition. The genius move is using those simple traces to break down the whole pipeline—from the Python call all the way down to the actual CUDA kernel execution. Seeing the CPU lane versus the GPU lane timing discrepancies is where the real insights pop up.

The mention of `torch.compile` and how it changes the kernel execution is super interesting. It hints at the real optimization battleground: figuring out if the overhead of the compiler is worth the speedup.

My take? The biggest hurdle, as the article points out, is the expectation that you can immediately read the profiler output. It feels like a chore. This series approach—starting simple and building up to LLMs—is exactly what you need. Don't just run the profiler; interrogate it. That's how you move from "it's slow" to "here's exactly *why* it's slow and *how* to fix it." Definitely worth setting aside time to actually open those traces.

Source: https://huggingface.co/blog/torch-profiler