Show HN: We wrote a book on LLM system evals with a bear and fox

11 points by iamwil 8 months ago

Hey all!

@sridatta and I wrote a book/zine called Forest Friends on system evals for LLM-driven apps. But it's a bit more whimsical, a bit more visual, and very much inspired by the meme of LLMs being a shoggoth polished into a smiley face with RLHF.

LLM system evals are important as companies move past the flashy AI demos to reliable production apps. System evals keep coming up as the answer for what you "should do", but it's not exactly a standard part of the software engineering toolkit.

So we pulled from @sridatta's seven years as a research engineer at Google, plus a ton of best practices from around the web, and we wrote a zine that could take you from zero to eval in a way that’s fun to read.

To be clear, model evals and system evals are two different things. The former compares different models, and the latter is a metric on how well you're servicing your customer queries. When you create a system eval, you're essentially defining what "good" looks like for your system. Lots of people use "vibes-based evals" (LGTM@K). It's a good place to start and will get you further than you think. But at some point, you need system evals as you get more users and more diversity in queries. To quote:

Garry Tan says "Don’t rawdog your prompts! Write evals!"https://x.com/garrytan/status/1842210665550983409

Swyx says "Production AI Engineering starts with Evals"https://x.com/latentspacepod/status/1844870676202783126

How did we end up doing a zine? Coming off collaborating on the Technium Podcast, we wanted to tackle a topic where we had deep expertise while also exercising our entrepreneurial muscles. We came up with writing a zine, inspired by Julia Evans' Wizard Zine and Sailor Mercury's Bubble Zine. Originally, we were shooting for 30 pages, but ended up with 60 pages.

This was also an experiment in image generation for a product. Initially, I created illustrations by hand. Midway, I decided to switch to using Midjourney to make our deadline. I needed to generate scenes with consistent characters in a specific style set in a specific architecture, and it turned out to be hard.

Initially, I would be generating images 8 to 10 hours a day. Eventually, I got better at predicting what would work, and generating a suitable image dropped to 1.5 hours. Rest assured, however, all the text is human-generated and hand-edited.

The issue has been well received so far. Here are some quotes from early readers:

"Thanks for this resource! It provides a comprehensive introduction to LLM evaluation systems that rings true to my daily work as an AI engineer—all in less than an hour of reading and with minimal jargon. I’ll be recommending this to my team."

"I was a fan of The Poignant Guide to Ruby many years ago, so it’s great to see a playfulness brought to the world of LLMs. I’m building an evals platform that makes it as easy as possible for any developer to get started with evals. This edition has been great to make sure we get the basics and terminology right."

"Here's an engaging intro to evals by @sridatta and @iamwil. They've clearly put a lot of care and effort into it, where the content is well organized with plenty of illustrations throughout. Across 60 pages, they explain model vs. system evals, vibe checks and property-based tests, designing eval criteria, aligning LLM evaluators, how to measure alignment via various metrics, how to analyze evals to improve our system, and more. Now I can just direct folks to [the zine] instead of having to write it myself haha"

The zine is available now. There's a preview if you want to check the vibe. https://forestfriends.tech/assets/preview.pdf

I'd love to hear any feedback on the first issue, or what other topics you'd like to see tackled in later issues. If you have questions about the process of making the zine, I'd be happy to answer those also.

Here's the link: https://forestfriends.tech

Here's where to buy: https://issue1.forestfriends.tech/