How Weights & Biases increased developer focus, reduced overhead, and added several in-demand languages

TL;DR: Weights & Biases faced substantial demand from their customers to create more SDKs for Weave, their LLM evaluation tool. The third-party libraries for their platform had quality issues, and maintaining more SDKs in-house was unsustainable for their team of eight. After a month-long structured assessment comparing Stainless and Speakeasy, they chose Stainless for superior ergonomics, more compact bundle sizes, and better reliability.

Reliability is the top priority. It needs to work every time, with clear logging and contextual errors, and this was an important criteria for evaluation.

Andrew Truong

Staff Machine Learning Engineer - Weights & Biases

Reliability is the top priority. It needs to work every time, with clear logging and contextual errors, and this was an important criteria for evaluation.

Andrew Truong

Staff Machine Learning Engineer - Weights & Biases

Weights & Biases is an AI platform to train and fine-tune models. Their newest product, Weave, helps developers monitor, debug, and improve their AI applications. Weave initially launched with a Python SDK, but the team was quickly flooded with requests to support more languages. They needed a more efficient solution to produce idiomatic client libraries rather than writing them in-house.

Since Weave's server API was built with FastAPI, they already had an OpenAPI spec that could be used to generate SDKs for TypeScript, Go, Java, and C#. This led the team to explore SDK generators as a solution.

Andrew Truong, staff machine learning engineer and the DRI for this project, established several critical requirements based on their existing SDK challenges:

  1. Rock-solid reliability - SDKs must update automatically without breaking customer integrations.

  2. Architectural flexibility - Generated SDKs need to mount as submodules under their existing package structure.

  3. FastAPI compatibility - Complex OpenAPI specs, including types like dict[str, Any] should generate correctly.

  4. Superior bundle size - Minimal package size is critical for browser-based application deployments.

  5. Idiomatic code generation - SDKs need to feel native and natural in every supported language.

The Bake-Off

The month-long structured evaluation: Stainless vs. Speakeasy.

For about a month, Andrew conducted a thorough head-to-head comparison between Stainless and Speakeasy across 11 technical criteria. The evaluation included multiple discovery calls, technical deep-dives, pairing sessions, and hands-on testing of both platforms by comparing the generated code across both platforms.

The table below is the raw rubric that Andrew used for his evaluation. Lines in purple are the only additions we’ve made.

Criteria

Stainless

Speakeasy

Customer support

Good — 1 AE + customer engineer equivalent assigned, and a few more engineers in the channel. Initially responses felt a bit slow, but over time responses were better. SMLE in particular was extremely responsive and went out of his way to fix issues and advocate for us.

[Stainless Notes] - Our founder Alex hopped on a call and provided design support to the customer team

Excellent — Cofounder + 1 AE + 1 customer engineer equivalent assigned. We posted A LOT, and most were followed up quickly. One of the engineers (owner of?) the SDK generation process DM'd me to hop on a call and sort things out which was a nice touch!

Ergonomics

Excellent — basically looks and feels like the OpenAI SDK, which will be familiar for all of our users.

  1. Names are what you expect, and controls feel like they're in the right place. There's less configuration required to get the behaviors we want.

  2. One weird thing is that they expect auth to be provided even for unauthenticated endpoints (they just discard the auth; Speakeasy does what you expect — you don't need to provide auth).

  3. Supports SSE and JSONL streaming. JSONL is interesting because that's what our calls_query endpoint currently does, so it's less work for us.

Good — most controls are in the right place.

  1. Some naming is specific to OpenAPI so they feel a little out of place, e.g. base_url is called server_url. This might seem like a nit, but our env vars are called XYZ_BASE_URL, so it could cause confusion. Speakeasy suggests configuring server variables.

  2. Some settings are not exposed in a way that cmd+. could find, e.g. BackoffStrategy.

  3. Per-request hooks are unique to Speakeasy and are convenient. We can implement a basic write-ahead-log feature to save requests and then push them back up later. In stainless, you have to pass in a custom HTTP client

  4. Supports SSE, but we don't use this atm. It could be useful as we think about moving evals/scoring to the server. Speakeasy says they can add JSONL support by end of March.

Confidence that it works (testing, linting, etc.)

Good — covers the basics. They test endpoints against a mock server, and they have strict linting configs (in python, they use pyright and mypy strict, noxfile, etc — very similar to our setup).

OK — They claim to have more coverage, but I think it's a paid add-on. The repos I generated didn't have any tests. A few generations resulted in SDKs that failed their own linters (mypy) — I'm surprised the generate command succeeded. I feel less confident that I can trust the output because of this.

Performance

Excellent — They use the same things under the hood, so perf is about the same.

Excellent — They use the same things under the hood, so perf is about the same. Supposedly the react hooks have a lot of optimizations, but I didn't get to measuring them.

Error handling

Excellent — specific errors are raised and helpful messages are passed through. Clients can dispatch on error types

Excellent — specific errors are raised and helpful messages are passed through. Clients can dispatch on error types

Repo/packaging

Excellent — Comes with a readme, examples, and utility scripts. I could quickly grok how to rebuild the package from scratch. For python, it uses uv, which is what we use. They said they are working on a feature landing Feb 10 that will support mounting the generated package at /trace_server_bindings.

They have integrations to publish to pypi/npm.

Python wheel: 121kb

Typescript bundle: 9.95 kB (minified, brotli, empty esbuild project) — notably does not have react hooks

Good — more spartan, but seems to do the job. The only script included is to generate a README for publishing (I found this to be strange). For python, they're using poetry which is fine but different than our current uv. If we want to mount the generated package at /trace_server_bindings, I think they will need to switch to uv. Speakeasy says they can move to uv by end of March.

They have integrations to publish to pypi/npm.

Python wheel: 259kb

Typescript bundle: 31.34kb (minified, brotli, empty esbuild project)

Speakeasy says they'll be working on improving package sizes by end of March.

Generated React Hooks

Not available

They said we should roll our own, or they can add this feature if we were adamant. They seemed confident that it would be easy to implement ourselves using their TS SDK as a base.

Good — Uses the popular TanStackQuery package, which is like Apollo for REST. There is some manual work required to specify which hooks are queries, but once set up it seems reasonable and nice to use. The main benefit here is the extra standardization we get from auto-generated hooks, and a side effect is that they use TanStackQuery by default which is probably more consistent and performant than the hand-rolled queries we have today.

"Don't always need it, but nice when you do"

Good — endpoints have convenient methods like .with_raw_response which can be used to change the response format if you need eg. the raw httpx response.

OKresponse format needs to be configured ahead of time and is locked in per generated client. Speakeasy says they will be able to match Stainless here by end of March.

Forward-compatibility and "no crash mode"

Stainless does some work to allow (and ignore) extra keys coming back from the server. This prevents crashes when the client is out of date with the server. It's a nice touch.

??? — Not sure if they do this

Doc gen

Good ($) — paid offer has a nice cmd+K search that is very fast and helpful (see Cloudflare). Our docs here (more sparse, but can be improved). It's expensive, but cmd+K is a nice touch.

OK (free) — Otherwise, docs are similar to what we have right now. Docs don't support calling the API on the page, but they are adding it. They expose a docs yaml file similar to Speakeasy that we can run through an open source generator for free.

OK (free) — Nothing special. Seems like the same thing we generate for free. Their pricing includes 3 docs seats "for free", so this technically has no cost for us.

They also have markdown docs generated in the repo so you can learn about the API directly in the repo. I'm not sure why this is useful, but they felt it was important to call out.

Generation process

OK — everything needs to be done in the UI, which is not always intuitive. Saving the config in the online editor triggers SDK generation, but sometimes it will say "no new changes saved" and do nothing. In my case, I caused a merge conflict that froze their pipeline and I needed a Stainless engineer to fix it for me. Not great!

[Stainless Notes] - We ultimately released a CLI for this customer (within 2 days of this request) and are releasing a CLI GA in the next 1-2 months. Our current sprint is dedicated to UI improvements

Good — CLI is mostly intuitive, and editing yaml files offline is a breeze. Most settings are guessable or well-documented on their site, so you can customize generation quickly.

Off-boarding

OK — Code is licensed to us in perpetuity. Much of the config happens in the proprietary Stainless format that we would have to re-implement if we moved.

OK — Code is licensed to us in perpetuity. Most of the configs were vanilla OpenAPI without many x-speakeasy-xyz extensions, so they could potentially be carried to another vendor, though some migration work would still be required.

Why Weights & Biases chose Stainless

Several factors made Stainless the clear choice for Andrew’s team. The bundle size and performance advantages are critical for Weights and Biases’ browser-deployed TypeScript application. Beyond that, Stainless’ superior ergonomics, reliability, built-in tests, architectural flexibility, and compatibility made the choice obvious.

Implementation and Business Impact

"Stainless saves us the trouble of keeping multiple SDKs in sync and delivers what we need and in a package our customers expect"

Following Andrew's evaluation, Weights and Biases selected Stainless and began implementation planning. While the team still needs to add custom functionality like batch processing on top of the generated SDKs, Stainless handles the foundational multi-language SDK generation, allowing the team to focus more on core product development.

Stainless is also working closely with the Weights and Biases team to execute a phased rollout of their new SDKs - starting with TypeScript and Go, then Java with React hooks, with C# support to follow.

Business outcomes

  • Engineering velocity - Andrew’s team can focus more on core products instead of manual SDK maintenance across multiple languages.

  • Market expansion - Weights and Biases can now serve developers in Go, Java, and C# ecosystems who were previously unable to integrate with the platform.

  • Automated maintenance - API changes from FastAPI automatically propagate to all language SDKs without manual intervention.

  • Architectural integration - Generated SDKs integrate seamlessly with their existing package structure through custom submodule mounting.

The North Star for us is simple: whether you're new to the SDK or you know it inside and out, you should feel genuinely happy using it.

Andrew Truong

Staff Machine Learning Engineer - Weights & Biases

The North Star for us is simple: whether you're new to the SDK or you know it inside and out, you should feel genuinely happy using it.

Andrew Truong

Staff Machine Learning Engineer - Weights & Biases