How Weights & Biases increased developer focus, reduced overhead, and added several in-demand languages
TL;DR: Weights & Biases faced substantial demand from their customers to create more SDKs for Weave, their LLM evaluation tool. The third-party libraries for their platform had quality issues, and maintaining more SDKs in-house was unsustainable for their team of eight. After a month-long structured assessment comparing Stainless and Speakeasy, they chose Stainless for superior ergonomics, more compact bundle sizes, and better reliability.
Weights & Biases is an AI platform to train and fine-tune models. Their newest product, Weave, helps developers monitor, debug, and improve their AI applications. Weave initially launched with a Python SDK, but the team was quickly flooded with requests to support more languages. They needed a more efficient solution to produce idiomatic client libraries rather than writing them in-house.
Since Weave's server API was built with FastAPI, they already had an OpenAPI spec that could be used to generate SDKs for TypeScript, Go, Java, and C#. This led the team to explore SDK generators as a solution.
Andrew Truong, staff machine learning engineer and the DRI for this project, established several critical requirements based on their existing SDK challenges:
Rock-solid reliability - SDKs must update automatically without breaking customer integrations.
Architectural flexibility - Generated SDKs need to mount as submodules under their existing package structure.
FastAPI compatibility - Complex OpenAPI specs, including types like
dict[str, Any]
should generate correctly.Superior bundle size - Minimal package size is critical for browser-based application deployments.
Idiomatic code generation - SDKs need to feel native and natural in every supported language.
The Bake-Off
The month-long structured evaluation: Stainless vs. Speakeasy.
For about a month, Andrew conducted a thorough head-to-head comparison between Stainless and Speakeasy across 11 technical criteria. The evaluation included multiple discovery calls, technical deep-dives, pairing sessions, and hands-on testing of both platforms by comparing the generated code across both platforms.
The table below is the raw rubric that Andrew used for his evaluation. Lines in purple are the only additions we’ve made.
Criteria | Stainless | Speakeasy |
---|---|---|
Customer support | Good — 1 AE + customer engineer equivalent assigned, and a few more engineers in the channel. Initially responses felt a bit slow, but over time responses were better. SMLE in particular was extremely responsive and went out of his way to fix issues and advocate for us. [Stainless Notes] - Our founder Alex hopped on a call and provided design support to the customer team | Excellent — Cofounder + 1 AE + 1 customer engineer equivalent assigned. We posted A LOT, and most were followed up quickly. One of the engineers (owner of?) the SDK generation process DM'd me to hop on a call and sort things out which was a nice touch! |
Ergonomics | Excellent — basically looks and feels like the OpenAI SDK, which will be familiar for all of our users.
| Good — most controls are in the right place.
|
Confidence that it works (testing, linting, etc.) | Good — covers the basics. They test endpoints against a mock server, and they have strict linting configs (in python, they use pyright and mypy strict, noxfile, etc — very similar to our setup). | OK — They claim to have more coverage, but I think it's a paid add-on. The repos I generated didn't have any tests. A few generations resulted in SDKs that failed their own linters (mypy) — I'm surprised the generate command succeeded. I feel less confident that I can trust the output because of this. |
Performance | Excellent — They use the same things under the hood, so perf is about the same. | Excellent — They use the same things under the hood, so perf is about the same. Supposedly the react hooks have a lot of optimizations, but I didn't get to measuring them. |
Error handling | Excellent — specific errors are raised and helpful messages are passed through. Clients can dispatch on error types | Excellent — specific errors are raised and helpful messages are passed through. Clients can dispatch on error types |
Repo/packaging | Excellent — Comes with a readme, examples, and utility scripts. I could quickly grok how to rebuild the package from scratch. For python, it uses They have integrations to publish to pypi/npm. Python wheel: 121kb Typescript bundle: 9.95 kB (minified, brotli, empty esbuild project) — notably does not have react hooks | Good — more spartan, but seems to do the job. The only script included is to generate a README for publishing (I found this to be strange). For python, they're using They have integrations to publish to pypi/npm. Python wheel: 259kb Typescript bundle: 31.34kb (minified, brotli, empty esbuild project) Speakeasy says they'll be working on improving package sizes by end of March. |
Generated React Hooks | Not available They said we should roll our own, or they can add this feature if we were adamant. They seemed confident that it would be easy to implement ourselves using their TS SDK as a base. | Good — Uses the popular TanStackQuery package, which is like Apollo for REST. There is some manual work required to specify which hooks are queries, but once set up it seems reasonable and nice to use. The main benefit here is the extra standardization we get from auto-generated hooks, and a side effect is that they use TanStackQuery by default which is probably more consistent and performant than the hand-rolled queries we have today. |
"Don't always need it, but nice when you do" | Good — endpoints have convenient methods like | OK — response format needs to be configured ahead of time and is locked in per generated client. Speakeasy says they will be able to match Stainless here by end of March. |
Forward-compatibility and "no crash mode" | Stainless does some work to allow (and ignore) extra keys coming back from the server. This prevents crashes when the client is out of date with the server. It's a nice touch. | ??? — Not sure if they do this |
Doc gen | Good ($) — paid offer has a nice OK (free) — Otherwise, docs are similar to what we have right now. Docs don't support calling the API on the page, but they are adding it. They expose a docs yaml file similar to Speakeasy that we can run through an open source generator for free. | OK (free) — Nothing special. Seems like the same thing we generate for free. Their pricing includes 3 docs seats "for free", so this technically has no cost for us. They also have markdown docs generated in the repo so you can learn about the API directly in the repo. I'm not sure why this is useful, but they felt it was important to call out. |
Generation process | OK — everything needs to be done in the UI, which is not always intuitive. Saving the config in the online editor triggers SDK generation, but sometimes it will say "no new changes saved" and do nothing. In my case, I caused a merge conflict that froze their pipeline and I needed a Stainless engineer to fix it for me. Not great! [Stainless Notes] - We ultimately released a CLI for this customer (within 2 days of this request) and are releasing a CLI GA in the next 1-2 months. Our current sprint is dedicated to UI improvements | Good — CLI is mostly intuitive, and editing yaml files offline is a breeze. Most settings are guessable or well-documented on their site, so you can customize generation quickly. |
Off-boarding | OK — Code is licensed to us in perpetuity. Much of the config happens in the proprietary Stainless format that we would have to re-implement if we moved. | OK — Code is licensed to us in perpetuity. Most of the configs were vanilla OpenAPI without many x-speakeasy-xyz extensions, so they could potentially be carried to another vendor, though some migration work would still be required. |
Why Weights & Biases chose Stainless
Several factors made Stainless the clear choice for Andrew’s team. The bundle size and performance advantages are critical for Weights and Biases’ browser-deployed TypeScript application. Beyond that, Stainless’ superior ergonomics, reliability, built-in tests, architectural flexibility, and compatibility made the choice obvious.
Implementation and Business Impact
"Stainless saves us the trouble of keeping multiple SDKs in sync and delivers what we need and in a package our customers expect"
Following Andrew's evaluation, Weights and Biases selected Stainless and began implementation planning. While the team still needs to add custom functionality like batch processing on top of the generated SDKs, Stainless handles the foundational multi-language SDK generation, allowing the team to focus more on core product development.
Stainless is also working closely with the Weights and Biases team to execute a phased rollout of their new SDKs - starting with TypeScript and Go, then Java with React hooks, with C# support to follow.
Business outcomes
Engineering velocity - Andrew’s team can focus more on core products instead of manual SDK maintenance across multiple languages.
Market expansion - Weights and Biases can now serve developers in Go, Java, and C# ecosystems who were previously unable to integrate with the platform.
Automated maintenance - API changes from FastAPI automatically propagate to all language SDKs without manual intervention.
Architectural integration - Generated SDKs integrate seamlessly with their existing package structure through custom submodule mounting.