How Fast Is MLX? A Comprehensive Benchmark on 8 Apple Silicon Chips and 4 CUDA GPUs | by Tristan Bilot | Feb, 2024

A benchmark of the main operations and layers on MLX, PyTorch MPS and CUDA GPUs.

Image by author: Example of benchmark on the softmax operation

In less than two months since its first release, Apple’s ML research team’s latest creation, MLX, has already made significant strides in the ML community. It is remarkable to see how quickly the new framework has garnered attention, as evidenced by over 12k stars on GitHub and a growing community of over 500 members on Hugging Face 🤗.

In a previous article, we demonstrated how MLX performs in training a simple Graph Convolutional Network (GCN), benchmarking it against various devices including CPU, PyTorch’s MPS, and CUDA GPUs. The results were enlightening and showed the potential of MLX in running models efficiently.

In this exploration, we delve deeper, setting out to benchmark multiple key operations commonly leveraged in neural networks.

In our benchmark, each operation is evaluated based on a variety of experiments, varying in input shape and size. We’ve run these sequentially and multiple times across different processes to ensure stable and reliable runtime measures.

In the spirit of open collaboration, we’ve made the benchmark code open-source and easy to run. This allows contributors to easily add their own benchmarks based on their device and config.

Note: many thanks to all contributors, without whom this benchmark wouldn’t comprise as many baseline chips.

We successfully ran this benchmark across 8 different Apple Silicon chips and 4 high-efficiency CUDA GPUs:

Apple Silicon: M1, M1 Pro, M2, M2 Pro, M2 Max, M2 Ultra, M3 Pro, M3 Max

CUDA GPU: RTX4090 16GB (Laptop), Tesla V100 32GB (NVLink), Tesla V100 32GB (PCIe), A100 80GB (PCIe).