What I have been reading: What is a ml compiler
1 What is a ml compiler?
It all started with a big failure. I had bought a coral TPU with the intention to create a Webcam Software which modifies the users appereance. Something similar to https://avatarify.ai/.
This is not possible, as the coral TPU only supports a subset of TensorflowLite instructions. Probably a project down the line, to make a clear writeup of the findings and learnings.
I came across Pete Warden’s article,and a few other resources, and I wanted to collect my thoughts and notes here.
This is not a deep dive into implementation details but rather an attempt to connect the dots on what ML compilers are, why they’re challenging, and where the ecosystem is headed.
2 ML Compilers aim at optimizing model execution
In frameworks like TensorFlow or PyTorch, ML models are represented as graphs—directed acyclic graphs (DAGs) of computations. These frameworks typically interpret the graph at runtime, similar to how Python interprets code line by line.
An ML compiler takes this model graph and optimizes it for performance and/or portability. nstead of executing the model exactly as defined, the compiler transforms the graph into a more efficient form or into a representation that can run on a wider range of devices (CPUs, GPUs, TPUs, edge hardware, etc.).
For example, XLA (Accelerated Linear Algebra) — TensorFlow’s compiler — takes the layers of a graph and converts them into HLOs
(High-Level Operations). These HLOs form an intermediate representation (IR) that XLA can analyze and optimize before generating code for the target device.
The “high” in High-Level Operation refers either to the level of abstraction or to the fact that it sits at the top of XLA’s compilation pipeline.
3 Not the same as your standard compiler
For traditional software engineers, a compiler usually means something straightforward:
- Take a text file (e.g., C++ source code)
- Turn it into a binary executable
- Run it directly on the target platform
ML compilers, on the other hand, often don’t produce a final executable. Instead, they transform the model into another intermediate representation. In many cases, the “compiled” model is not ready to execute on its own. It requires further processing before running.
This can make the term “compiler” a bit misleading. It’s less like GCC and more like a pipeline of transformations where performance tuning, graph simplification, and device-specific optimizations happen in stages.
4 Early stage of standardization
Machine learning is experiencing the age of the the wild west. Everybody is defining their own functions, operators, and layers.Reuse does not exist.
Unlike C++—which has around 60 keywords and ~105 STL algorithms—there’s no common “vocabulary” for ML models.
Some Symptoms:
- Even small 1% performance gains can lead teams to define new custom operators.
- These operators may improve benchmark scores but hurt portability.
- When it comes time to deploy models across devices, you quickly discover that many operations aren’t supported on certain hardware.
What’s missing is a meta-language for layers—a standard abstraction layer that frameworks, compilers, and hardware vendors could agree on. Without this, interoperability remains painful.
5 Digging Deeper High-Level IRs: A Key Abstraction
A great explanation of this comes from Udit Agarwal’s article.
Unlike traditional compilers, where intermediate representations (IRs) are close to the hardware,“high-level IRs are hardware-agnostic and provides a much-needed abstraction”
These IRs provide:
- A unified representation of the model
- A way to perform graph-level optimizations
- An abstraction layer that allows targeting multiple backends
Because ML models are represented as DAGs, the IR captures both the operations (nodes) and the data dependencies (edges). These DAGs can be symbolic (fully defined before execution, like in TensorFlow 1.x) or imperative (built on the fly, like in PyTorch).
6 Graph Optimization: Making Models Faster
Once a model is converted into a graph, ML compilers apply a range of optimization techniques:
- Operator fusion – Combine multiple layers into a single kernel
- Constant folding – Precompute values where possible
- Memory optimizations – Reuse buffers and reduce allocations
- Quantization – Use lower-precision arithmetic where safe
- the side lists different optimization techniques
These transformations can significantly improve performance on specialized hardware. If you want a deeper dive, check out another of Agarwal’s articles another of Agarwal’s articles