giancarlostoro 2 hours ago

Hoping the author can answer, I'm still learning about how this all works. My understanding is that inference is "using the model" so to speak. How is this faster than established inference engines specifically on Mac? Are models generic enough that if you build e.g. an inference engine focused on AMD GPUs or even Intel GPUs, would they achieve reasonable performance? I always assumed because Nvidia is king of AI that you had to suck it up, or is it just that most inference engines being used are married to Nvidia?

I would love to understand how universal these models can become.

smpanaro 5 hours ago

In practice, how often do the models use the ANE? It sounds like you are optimizing for speed which in my experience always favors GPU.

  • AlekseiSavin 4 hours ago

    You're right, modern edge devices are powerful enough to run small models, so the real bottleneck for a forward pass is usually memory bandwidth, which defines the upper theoretical limit for inference speed. Right now, we've figured out how to run computations in a granular way on specific processing units, but we expect the real benefits to come later when we add support for VLMs and advanced speculative decoding, where you process more than one token at a time

    • J_Shelby_J 4 hours ago

      VLMs = very large models?

      • mmorse1217 4 hours ago

        Probably vision language models.

greggh 5 hours ago

"trymirai", every time I hear the word Mirai I think of the large IOT DDoS botnet. Maybe it's just me though.

  • fnord77 an hour ago

    I think of the goofy Toyota fuel cell car. I think a grand total of about 6 have been sold (leased) in california

rnxrx 5 hours ago

I'm curious about why the performance gains mentioned were so substantial for Qwen vs Llama?

  • AlekseiSavin 4 hours ago

    it looks like llama.cpp has some performance issues with bf16

skybrian 4 hours ago

What are the units on the benchmark results? I’m guessing higher is better?

nodesocket an hour ago

I just spun up a AWS EC2 g6.xlarge instance to do some llm work. The GPU is NVIDIA L4 24GB and costs $0.8048/per hour. Starting to think about switching to an Apple mac2-m2.metal instance for $0.878/ per hour. Big question is the Mac instance only has 24GB of unified memory.

sharifulin 6 hours ago

Wow! Sounds super interesting

TheMagicHorsey 6 hours ago

Amazing!

How was your experience using Rust on this project? I'm considering a project in an adjacent space and I'm trying to decide between Rust, C, and Zig. Rust seems a bit burdensome with its complexity compared to C and Zig. Reminds me of C++ in its complexity (although not as bad). I find it difficult to walk through and understand a complicated Rust repository. I don't have that problem with C and Zig for the most part.

But I'm wondering if I just need to invest more time in Rust. How was your learning curve with the language?

  • adastra22 5 hours ago

    You are confusing familiarity with intrinsic complexity. I have 20 years experience with C/C++ before switching to rust a few years ago. After the initial hurdle, it is way easier and very simple to follow.

dcreater 4 hours ago

Somewhat faster on small models. Requires new format.

Not sure what the goal is for this project? Not seeing how this presents adequate benefits to get adopted by the community

  • worldsavior 3 hours ago

    It's utilizing Apple ANE and probably other optimization tools provided by Apple's framework. Not sure if llama.cpp uses them, but if they're not then the benchmark on GitHub says it all.

  • koakuma-chan 4 hours ago

    Written in Rust is a big one for me.

zdw 4 hours ago

How does this bench compared to MLX?

  • jasonjmcghee 4 hours ago

    I use MLX in lmstudio and it doesn't have whatever issues llama cpp is showing here.

    Qwen3-0.6B at 5 t/s doesn't make any sense. Something is clearly wrong for that specific model.

ewuhic 5 hours ago

>faster than llama cpp in all of the use cases

What's your deliberate, well-thought roadmap for achieving adoption similar to llama cpp?

  • pants2 5 hours ago

    Probably getting acquired by Apple :)

slavasmirnov 6 hours ago

that’s exactly we are looking for not to waste on apis. Wonder how significant trade offs are

cwlcwlcwlingg 5 hours ago

Wondering why use Rust other than C++

  • outworlder 2 hours ago

    Why use C++ for greenfield projects?

  • giancarlostoro 2 hours ago

    ...or D? or Go? or Java? C#? Zig? etc they chose what they were most comfortable with. Rust is fine, it's not for everyone clearly, but those who use it produce high quality software, I would argue similar with Go, without all the unnecessary mental overhead of C or C++

  • bee_rider 4 hours ago

    I wonder why they didn’t use Fortran.