lume/src/blog/2024/tigris-nomadic-compute.mdx


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529

---
title: "Nomadic Infrastructure Design for AI Workloads"
date: 2024-11-12
redirect_to: "https://tigrisdata.com/blog/nomadic-compute/"
hero:
  ai: "Flux [dev] by Black Forest Labs"
  file: "_yj_eBqjMOIe0Bv-oQxoy"
  prompt: "A nomadic server hunts for GPUs, powered by Taco Bell"
---

Taco Bell is a miracle of food preparation. They manage to have a menu of dozens
of items that all boil down to permutations of 8 basic items: meat, cheese,
beans, vegetables, bread, and sauces. Those basic fundamentals are combined in
new and interesting ways to give you the crunchwrap, the chalupa, the doritos
locos tacos, and more. Just add hot water and they’re ready to eat.

Even though the results are exciting, the ingredients for them are not. They’re
all really simple things. The best designed production systems I’ve ever used
take the same basic idea: build exciting things out of boring components that
are well understood across all facets of the industry (eg: S3, Postgres, HTTP,
JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the
industry-disrupting industry.

A bunch of companies want to sell you inference time for your AI workloads or
the results of them inferencing AI workloads for you, but nobody really tells
you how to make this yourself. That’s the special Mexican Pizza sauce that you
can’t replicate at home no matter how much you want to be able to.

Today, we’ll cover how you, a random nerd that likes reading architectural
articles, should design a production-ready AI system so that you can maximize
effectiveness per dollar, reduce dependency lock-in, and separate concerns down
to their cores. Buckle up, it’s gonna be a ride.

<Conv name="Mara" mood="hacker">
  The industry uses like a billion different terms for “unit of compute that has
  access to a network connection and the ability to store things for some amount
  of time” that all conflict in mutually incompatible ways. When you read
  “workload”, you should think about some program that has network access to
  some network and some amount of storage through some means running somewhere,
  probably in a container.
</Conv>

## The fundamentals of any workload

At the core, any workload (computer games, iPadOS apps, REST APIs, Kubernetes,
$5 Hetzner VPSen, etc.) is a combination of three basic factors:

- Compute, or the part that executes code and does math
- Network, or the part that lets you dial and accept sockets
- Storage, or the part that remembers things for next time

In reality, these things will overlap a little (compute has storage in the form
of ram, some network cards run their own Linux kernel, and storage is frequently
accessed over the network), but that still very cleanly maps to the basic things
that you’re billed for in the cloud:

- Gigabyte-core-seconds of compute
- Gigabytes egressed over the network
- Gigabytes stored in persistent storage

And of course, there’s a huge money premium for any of this being involved in AI
anything because people will pay. However, let’s take a look at that second
basic thing you’re billed for a bit closer:

> - Gigabytes egressed over the network

Note that it’s _egress_ out of your compute, not _ingress_ to your compute.
Providers generally want you to make it easy to put your data into their
platform and harder to get the data back out. This is usually combined with your
storage layer, which can make it annoying and expensive to deal with data that
is bigger than your local disk. Your local disk is frequently way too small to
store everything, so you have to make compromises.

What if your storage layer didn’t charge you per gigabyte of data you fetched
out of it? What classes of problems would that allow you to solve that were
previously too expensive to execute on?

If you put your storage in a service that is low-latency, close to your servers,
and has no egress fees, then it can actually be cheaper to pull things from
object storage just-in-time to use them than it is to store them persistently.

### Storage that is left idle is more expensive than compute time

In serverless (Lambda) scenarios, most of the time your application is turned
off. This is good. This is what you want. You want it to turn on when it’s
needed, and turn back off when it’s not. When you do a setup like this, you also
usually assume that the time it takes to do a cold start of the service is fast
enough that the user doesn’t mind.

Let’s say that your AI app requires 16 gigabytes of local disk space for your
Docker image with the inference engine and the downloaded model weights. In some
clouds (such as Vast.ai), this can cost you upwards of $4-10 per month to have
the data sitting there doing nothing, even if the actual compute time is as low
as $0.99 per hour. If you’re using Flux [dev] (12 billion parameters, 25 GB of
weight bytes) and those weights take 5 minutes to download, this means that you
are only spending $0.12 waiting things to download. If you’re only doing
inference in bulk scenarios where latency doesn’t matter as much, then it can be
much, much cheaper to dynamically mint new instances, download the model weights
from object storage, do all of the inference you need, and then slay those
instances off when you’re done.

Most of the time, any production workload’s request rate is going to follow a
sinusodal curve where there’s peak usage for about 8 hours in the middle of the
day and things will fall off overnight as everyone goes to bed. If you spin up
AI inference servers on demand following this curve, this means that the first
person of the day to use an AI feature could have it take a bit longer for the
server to get its coffee, but it’ll be hot’n’ready for the next user when they
use that feature.

You can even cheat further with optional features such that the first user
doesn’t actually see them, but it triggers the AI inference backend to wake up
for the next request.

### It may not be your money, but the amounts add up

When you set up cloud compute, it’s really easy to fall prey to the siren song
of the seemingly bottomless budget of the corporate card. At a certain point, we
all need to build sustainable business as the AI hype wears off and the free
tier ends. However, thanks to the idea of Taco Bell infrastructure design, you
can reduce the risk of lock-in and increase flexibility between providers so you
can lower your burn rate.

In many platforms, data ingress is free. Data _egress_ is where they get you.
It’s such a problem for businesses that the
[EU has had to step in and tell providers that people need an easy way out](https://commission.europa.eu/news/data-act-enters-force-what-it-means-you-2024-01-11_en).
Every gigabyte of data you put into those platforms is another $0.05 that it’ll
cost to move away should you need to.

This doesn’t sound like an issue, because the CTO negotiating dream is that
they’ll be able to play the “we’re gonna move our stuff elsewhere” card and
instantly win a discount and get a fantastic deal that will enable future growth
or whatever.

This is a nice dream.

In reality, the sales representative has a number in big red letters in front of
them. This number is the amount of money it would cost for you to move your 3
petabytes of data off of their cloud. You both know you’re stuck with eachother,
and you’ll happily take an additional measly 5% discount on top of the 10%
discount you negotiated last year. We all know that the actual cost of running
the service is 15% of even that cost; but the capitalism machine has to eat
somehow, right?

## On the nature of dependencies

Let’s be real, dependencies aren’t fundamentally bad things to have. All of us
have a hard dependency on the Internet, amd64 CPUs, water, and storage.
Everything’s a tradeoff. The potentially harmful part comes in when your
dependency locks you in so you can’t switch away easily.

This is normally pretty bad with traditional compute setups, but can be extra
insidious with AI workloads. AI workloads make cloud companies staggering
amounts of money, so they want to make sure that you keep your AI workloads on
their servers as much as possible so they can extract as much revenue out of you
as possible. Combine this with the big red number disadvantage in negotiations,
and you can find yourself backed into a corner.

### Strategic dependency choice

This is why picking your dependencies is such a huge thing to consider. There’s
a lot to be said about choosing dependencies to minimize vendor lock-in, and
that’s where the Taco Bell infrastructure philosophy comes in:

- Trigger compute with HTTP requests that use well-defined schemata.
- Find your target using DNS.
- Store things you want to keep in Postgres or object storage.
- Fetch things out of storage when you need them.
- Mint new workers when there is work to be done.
- Slay those workers off when they’re not needed anymore.

If you follow these rules, you can easily make your compute nomadic between
services. Capitalize on things like Kubernetes (the universal API for cloud
compute, as much as I hate that it won), and you make the underlying clouds an
implementation detail that can be swapped out as you find better strategic
partnerships that can offer you more than a measly 5% discount.

Just add water.

### How AI models become dependencies

There's an extra evil way that AI models can become production-critical
dependencies. Most of the time when you implement an application that uses an AI
model, you end up encoding "workarounds" for the model into the prompts you use.
This happens because AI models are fundamentally unpredictable and unreliable
tools that sometimes give you the output you want. As a result though, changing
out models _sounds_ like it's something that should be easy. You _just_ change
out the model and then you can take advantage of better accuracy, new features
like tool use, or JSON schema prompting, right?

In many cases, changing out a model will result in a service that superficially
looks and functions the same. You give it a meeting transcript, it tells you
what the action items are. The problem comes in with the subtle nuances of the
je ne sais quoi of the experience. Even subtle differences like
[the current date being in the month of December](https://arstechnica.com/information-technology/2023/12/is-chatgpt-becoming-lazier-because-its-december-people-run-tests-to-find-out/)
can _drastically_ change the quality of output. A
[recent paper from Apple](https://arxiv.org/pdf/2410.05229) concluded that
adding superficial details that wouldn't throw off a human can severely impact
the performance of large language models. Heck, they even struggle or fall prey
to fairly trivial questions that humans find easy, such as:

- How many r's are in the word "strawberry"?
- What's heavier: 2 pounds of bricks, one pound of heavy strawberries, or three
  pounds of air?

If changing the placement of a comma in a prompt can cause such huge impacts to
the user experience, what would changing the model do? What would being forced
to change the model because the provider is deprecating it so they can run newer
models that don't do the job as well as the model you currently use? This is a
really evil kind of dependency that you can only get when you rely on
cloud-hosted models. By controlling the weights and inference setups for your
machines, you have a better chance of being able to dictate the future of your
product and control all parts of the stack as much as possible.

## How it’s made prod-ready

Like I said earlier, the three basic needs of any workload are compute, network,
and storage. Production architectures usually have three basic planes to support
them:

- The compute plane, which is almost certainly going to be ether Docker or
  Kubernetes somehow.
- The network plane, which will be a Virtual Private Cloud (VPC) or overlay
  network that knits clusters together.
- The storage plane, which is usually the annoying exercise left to the reader,
  leading you to make yet another case for either using NFS or sparkly NFS like
  Longhorn.

Storage is the sticky bit; it’s not really changed since the beginning. You
either use a POSIX-compatible key-value store or an S3 compatible key-value
store. Both are used in practically the same ways that the framers intended in
the late 80’s and 2009 respectively. You chuck bytes into the system with a
name, and you get the bytes back when you give the name.

Storage is the really important part of your workloads. Your phone would not be
as useful if it didn’t remember your list of text messages when you rebooted it.
Many applications also (reasonably) assume that storage always works, is fast
enough that it’s not an issue, and is durable enough that they don’t have to
manually make backups.

What about latency? Human reaction time is about 250 milliseconds on average. It
takes about 250 milliseconds for a TCP session to be established between Berlin
and us-east-1. If you move your compute between providers, is your storage plane
also going to move data around to compensate?

If your storage plane doesn’t have egress costs and stores your data close to
where it’s used, this eliminates a lot of local storage complexity, at the cost
of additional compute time spent waiting to pull things and the network
throughput for them to arrive. Somehow compute is cheaper than storage in anno
dominium two-thousand twenty-four. No, I don’t get how that happened either.

### Pass-by-reference semantics for the cloud

Part of the secret for how people make these production platforms is that they
cheat: they don’t pass around values as much as possible. They pass a reference
to that value in the storage plane. When you upload an image to the ChatGPT API
to see if it’s a picture of a horse, you do a file upload call and then an
inference call with the ID of that upload. This makes it easier to sling bytes
around and overall makes things a lot more efficient at the design level. This
is a lot like pass-by-reference semantics in programming languages like Java or
a pointer to a value in Go.

### The big queue

The other big secret is that there’s a layer on top of all of the compute: an
orchestrator with a queue.

This is the rest of the owl that nobody talks about. Just having compute,
network, and storage is not good enough; there needs to be a layer on top that
spreads the load between workers, intelligently minting and slaying them off as
reality demands.

## Okay but where’s the code?

Yeah, yeah, I get it, you want to see this live and in action. I don’t have an
example totally ready yet, but in lieu of drawing the owl right now, I can tell
you what you’d need in order to make it a reality on the cheap.

Let’s imagine that this is all done in one app, let’s call it orodayagzou (c.f.
[Ôrödyagzou](https://www.youtube.com/watch?v=uuYmkZ-Aomo), Ithkuil for
“synesthesia”). This app is both a HTTP API and an orchestrator. It manages a
pool of worker nodes that do the actual AI inferencing.

So let’s say a user submits a request asking for a picture of a horse. That’ll
come in to the right HTTP route and it has logic like this:

```go
type ScaleToZeroProxy struct {
  cfg         Config
	ready       bool
	endpointURL string
	instanceID  int
	lock        sync.RWMutex
	lastUsed    time.Time
}

func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	s.lock.RLock()
	ready := s.ready
	s.lock.RUnlock()

	if !ready {
		// TODO: implement instance creation
	}

	s.lock.RLock()
	defer s.lock.RUnlock()
	u, err := url.Parse(s.endpointURL)
	if err != nil {
		panic(err)
	}

	u.Path = r.URL.Path
	u.RawQuery = r.URL.RawQuery

	next := httputil.NewSingleHostReverseProxy(u)

	next.ServeHTTP(w, r)
	s.lock.Lock()
	s.lastUsed = time.Now()
	s.lock.Unlock()
}
```

This is a simple little HTTP proxy in Go, it has an endpoint URL and an instance
ID in memory, some logic to check if the instance is “ready”, and if it’s not
then to create it. Let’s mint an instance using the [Vast.ai](http://Vast.ai)
CLI. First, some configuration:

```go
const (
	diskNeeded       = 36
  dockerImage      = "reg.xeiaso.net/runner/sdxl-tigris:latest"
  httpPort         = 5000
  modelBucketName  = "ciphanubakfu" // lojban: test-number-bag
  modelPath        = "glides/ponyxl"
  onStartCommand   = "python -m cog.server.http"
  publicBucketName = "xe-flux"

  searchCaveats = `verified=False cuda_max_good>=12.1 gpu_ram>=12 num_gpus=1 inet_down>=450`

  // assume awsAccessKeyID, awsSecretAccessKey, awsRegion, and awsEndpointURLS3 exist
)

type Config struct {
	diskNeeded     int // gigabytes
	dockerImage    string
	environment    map[string]string
	httpPort       int
	onStartCommand string
}
```

Then we can search for potential machines with some terrible wrappers to the
CLI:

```go
func runJSON[T any](ctx context.Context, args ...any) (T, error) {
	return trivial.andThusAnExerciseForTheReader[T](ctx, args)
}

func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
	s.lock.Lock()
	defer s.lock.Unlock()
	candidates, err := runJSON[[]vastai.SearchResponse](
		ctx,
		"vastai", "search", "offers",
		searchCaveats,
		"-o", "dph+", // sort by price (dollars per hour) increasing, cheapest option is first
		"--raw",      // output JSON
	)
	if err != nil {
		return fmt.Errorf("can't search for instances: %w", err)
	}

	// grab the cheapest option
	candidate := candidates[0]

	contractID := candidate.AskContractID
	slog.Info("found candidate instance",
		"contractID", contractID,
		"gpuName", candidate.GPUName,
		"cost", candidate.Search.TotalHour,
	)
	// ...
}
```

Then you can try to create it:

```go
func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
	// ...
	instanceData, err := runJSON[vastai.NewInstance](
		ctx,
		"vastai", "create", "instance",
		contractID,
		"--image", s.cfg.dockerImage,
		// dump ports and envvars into format vast.ai wants
		"--env", s.cfg.FormatEnvString(),
		"--disk", s.cfg.diskNeeded,
		"--onstart-cmd", s.cfg.onStartCommand,
		"--raw",
	)
	if err != nil {
		return fmt.Errorf("can't create new instance: %w", err)
	}

	slog.Info("created new instance", "instanceID", instanceData.NewContract)
	s.instanceID = instanceData.NewContract
	// ...
```

Then collect the endpoint URL:

```go
func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
	// ...
	instance, err := runJSON[vastai.Instance](
		ctx,
		"vastai", "show", "instance",
		instanceData.NewContract,
		"--raw",
	)
	if err != nil {
		return fmt.Errorf("can't show instance %d: %w", instanceData.NewContract, err)
	}

	s.EndpointURL = fmt.Sprintf(
		"http://%s:%d",
		instance.PublicIPAddr,
		instance.Ports[fmt.Sprintf("%d/tcp", s.cfg.httpPort)][0].HostPort,
	)

	return nil
}
```

And then finally wire it up and have it test if the instance is ready somehow:

```go
func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	// ...

	if !ready {
		if err := s.mintInstance(r.Context()); err != nil {
			slog.Error("can't mint new instance", "err", err)
			http.Error(w, err.Error(), http.StatusInternalServerError)
			return
		}

		t := time.NewTicker(5 * time.Second)
		defer t.Stop()
		for range t.C {
			if ok := s.testReady(r.Context()); ok {
				break
			}
		}
	}

	// ...
```

Then the rest of the logic will run through, the request will be passed to the
GPU instance and then a response will be fired. All that’s left is to slay the
instances off when they’re unused for about 5 minutes:

```go
func (s *ScaleToZeroProxy) maybeSlayLoop(ctx context.Context) {
	t := time.NewTicker(5 * time.Minute)
	defer t.Stop()

	for {
		select {
		case <-t.C:
			s.lock.RLock()
			lastUsed := s.lastUsed
			s.lock.RUnlock()

			if lastUsed.Add(5 * time.Minute).Before(time.Now) {
				if err := s.slay(ctx); err != nil {
					slog.Error("can't slay instance", "err", err)
				}
			}
		case <-ctx.Done():
			return
		}
	}
}
```

Et voila! Run `maybeSlayLoop` in the background and implement the `slay()`
method to use the `vastai destroy instance` command, then you have yourself
nomadic compute that makes and destroys itself on demand to the lowest bidder.

Of course, any production-ready implementation would have limits like “don’t
have more than 20 workers” and segment things into multiple work queues. This is
all really hypothetical right now, I wish I had a thing to say you could
`kubectl apply` and use right now, but I don’t.

I’m going to be working on this this on my Friday streams
[on Twitch](https://twitch.tv/princessxen) until it’s done. I’m going to
implement it from an empty folder and then work on making it a Kubernetes
operator to run any task you want. It’s going to involve generative AI, API
reverse engineering, eternal torment, and hopefully not getting banned from the
providers I’m going to be using. It should be a blast!

## Conclusion

Every workload involves compute, network, and storage on top of production’s
compute plane, network plane, and storage plane. Design your production clusters
to take advantage of very well-understood fundamentals like HTTP, queues, and
object storage so that you can reduce your dependencies to the bare minimum.
Make your app an orchestrator of vast amounts of cheap compute so you don’t need
to pay for compute or storage that nobody is using while everyone is asleep.

This basic pattern is applicable to just about anything on any platform, not
just AI or not just with Tigris. We hope that by publishing this architectural
design, you’ll take it to heart when building your production workloads of the
future so that we can all use the cloud responsibly. Certain parts of the
economics of this pattern work best when you have free (or basically free)
egress costs though.

We’re excited about building the best possible storage layer based on the
lessons learned building the storage layer Uber uses to service millions of
rides per month. If you try us and disagree, that’s fine, we won’t nickel and
dime you on the way out because we don’t charge egress costs.

When all of these concerns are made easier, all that’s left for you is to draw
the rest of the owl and get out there disrupting industries.