blog/2024: add nomadic compute linkpost

Signed-off-by: Xe Iaso <me@xeiaso.net>
author: Xe Iaso <me@xeiaso.net> 2024-11-12 10:04:12 -0500
committer: Xe Iaso <me@xeiaso.net> 2024-11-12 10:04:12 -0500
commit: 2f5122adae02b0da8cd31156e847b8d906d0eb0b (patch)
tree: a457155ca1422f6ada3cc2e289c44838bda1e326
parent: 7f5d2dc5340e5117d76645b0552252850c26d62c (diff)
download: xesite-2f5122adae02b0da8cd31156e847b8d906d0eb0b.tar.xz
xesite-2f5122adae02b0da8cd31156e847b8d906d0eb0b.zip
1 files changed, 529 insertions, 0 deletions
diff --git a/lume/src/blog/2024/tigris-nomadic-compute.mdx b/lume/src/blog/2024/tigris-nomadic-compute.mdx
new file mode 100644
index 0000000..c73d8ba
--- /dev/null
+++ b/lume/src/blog/2024/tigris-nomadic-compute.mdx
@@ -0,0 +1,529 @@
+---
+title: "Nomadic Infrastructure Design for AI Workloads"
+date: 2024-11-12
+redirect_to: "https://tigrisdata.com/blog/nomadic-compute/"
+hero:
+  ai: "Flux [dev] by Black Forest Labs"
+  file: "_yj_eBqjMOIe0Bv-oQxoy"
+  prompt: "A nomadic server hunts for GPUs, powered by Taco Bell"
+---
+
+Taco Bell is a miracle of food preparation. They manage to have a menu of dozens
+of items that all boil down to permutations of 8 basic items: meat, cheese,
+beans, vegetables, bread, and sauces. Those basic fundamentals are combined in
+new and interesting ways to give you the crunchwrap, the chalupa, the doritos
+locos tacos, and more. Just add hot water and they’re ready to eat.
+
+Even though the results are exciting, the ingredients for them are not. They’re
+all really simple things. The best designed production systems I’ve ever used
+take the same basic idea: build exciting things out of boring components that
+are well understood across all facets of the industry (eg: S3, Postgres, HTTP,
+JSON, YAML, etc.). This adds up to your pitch deck aiming at disrupting the
+industry-disrupting industry.
+
+A bunch of companies want to sell you inference time for your AI workloads or
+the results of them inferencing AI workloads for you, but nobody really tells
+you how to make this yourself. That’s the special Mexican Pizza sauce that you
+can’t replicate at home no matter how much you want to be able to.
+
+Today, we’ll cover how you, a random nerd that likes reading architectural
+articles, should design a production-ready AI system so that you can maximize
+effectiveness per dollar, reduce dependency lock-in, and separate concerns down
+to their cores. Buckle up, it’s gonna be a ride.
+
+<Conv name="Mara" mood="hacker">
+  The industry uses like a billion different terms for “unit of compute that has
+  access to a network connection and the ability to store things for some amount
+  of time” that all conflict in mutually incompatible ways. When you read
+  “workload”, you should think about some program that has network access to
+  some network and some amount of storage through some means running somewhere,
+  probably in a container.
+</Conv>
+
+## The fundamentals of any workload
+
+At the core, any workload (computer games, iPadOS apps, REST APIs, Kubernetes,
+$5 Hetzner VPSen, etc.) is a combination of three basic factors:
+
+- Compute, or the part that executes code and does math
+- Network, or the part that lets you dial and accept sockets
+- Storage, or the part that remembers things for next time
+
+In reality, these things will overlap a little (compute has storage in the form
+of ram, some network cards run their own Linux kernel, and storage is frequently
+accessed over the network), but that still very cleanly maps to the basic things
+that you’re billed for in the cloud:
+
+- Gigabyte-core-seconds of compute
+- Gigabytes egressed over the network
+- Gigabytes stored in persistent storage
+
+And of course, there’s a huge money premium for any of this being involved in AI
+anything because people will pay. However, let’s take a look at that second
+basic thing you’re billed for a bit closer:
+
+> - Gigabytes egressed over the network
+
+Note that it’s _egress_ out of your compute, not _ingress_ to your compute.
+Providers generally want you to make it easy to put your data into their
+platform and harder to get the data back out. This is usually combined with your
+storage layer, which can make it annoying and expensive to deal with data that
+is bigger than your local disk. Your local disk is frequently way too small to
+store everything, so you have to make compromises.
+
+What if your storage layer didn’t charge you per gigabyte of data you fetched
+out of it? What classes of problems would that allow you to solve that were
+previously too expensive to execute on?
+
+If you put your storage in a service that is low-latency, close to your servers,
+and has no egress fees, then it can actually be cheaper to pull things from
+object storage just-in-time to use them than it is to store them persistently.
+
+### Storage that is left idle is more expensive than compute time
+
+In serverless (Lambda) scenarios, most of the time your application is turned
+off. This is good. This is what you want. You want it to turn on when it’s
+needed, and turn back off when it’s not. When you do a setup like this, you also
+usually assume that the time it takes to do a cold start of the service is fast
+enough that the user doesn’t mind.
+
+Let’s say that your AI app requires 16 gigabytes of local disk space for your
+Docker image with the inference engine and the downloaded model weights. In some
+clouds (such as Vast.ai), this can cost you upwards of $4-10 per month to have
+the data sitting there doing nothing, even if the actual compute time is as low
+as $0.99 per hour. If you’re using Flux [dev] (12 billion parameters, 25 GB of
+weight bytes) and those weights take 5 minutes to download, this means that you
+are only spending $0.12 waiting things to download. If you’re only doing
+inference in bulk scenarios where latency doesn’t matter as much, then it can be
+much, much cheaper to dynamically mint new instances, download the model weights
+from object storage, do all of the inference you need, and then slay those
+instances off when you’re done.
+
+Most of the time, any production workload’s request rate is going to follow a
+sinusodal curve where there’s peak usage for about 8 hours in the middle of the
+day and things will fall off overnight as everyone goes to bed. If you spin up
+AI inference servers on demand following this curve, this means that the first
+person of the day to use an AI feature could have it take a bit longer for the
+server to get its coffee, but it’ll be hot’n’ready for the next user when they
+use that feature.
+
+You can even cheat further with optional features such that the first user
+doesn’t actually see them, but it triggers the AI inference backend to wake up
+for the next request.
+
+### It may not be your money, but the amounts add up
+
+When you set up cloud compute, it’s really easy to fall prey to the siren song
+of the seemingly bottomless budget of the corporate card. At a certain point, we
+all need to build sustainable business as the AI hype wears off and the free
+tier ends. However, thanks to the idea of Taco Bell infrastructure design, you
+can reduce the risk of lock-in and increase flexibility between providers so you
+can lower your burn rate.
+
+In many platforms, data ingress is free. Data _egress_ is where they get you.
+It’s such a problem for businesses that the
+[EU has had to step in and tell providers that people need an easy way out](https://commission.europa.eu/news/data-act-enters-force-what-it-means-you-2024-01-11_en).
+Every gigabyte of data you put into those platforms is another $0.05 that it’ll
+cost to move away should you need to.
+
+This doesn’t sound like an issue, because the CTO negotiating dream is that
+they’ll be able to play the “we’re gonna move our stuff elsewhere” card and
+instantly win a discount and get a fantastic deal that will enable future growth
+or whatever.
+
+This is a nice dream.
+
+In reality, the sales representative has a number in big red letters in front of
+them. This number is the amount of money it would cost for you to move your 3
+petabytes of data off of their cloud. You both know you’re stuck with eachother,
+and you’ll happily take an additional measly 5% discount on top of the 10%
+discount you negotiated last year. We all know that the actual cost of running
+the service is 15% of even that cost; but the capitalism machine has to eat
+somehow, right?
+
+## On the nature of dependencies
+
+Let’s be real, dependencies aren’t fundamentally bad things to have. All of us
+have a hard dependency on the Internet, amd64 CPUs, water, and storage.
+Everything’s a tradeoff. The potentially harmful part comes in when your
+dependency locks you in so you can’t switch away easily.
+
+This is normally pretty bad with traditional compute setups, but can be extra
+insidious with AI workloads. AI workloads make cloud companies staggering
+amounts of money, so they want to make sure that you keep your AI workloads on
+their servers as much as possible so they can extract as much revenue out of you
+as possible. Combine this with the big red number disadvantage in negotiations,
+and you can find yourself backed into a corner.
+
+### Strategic dependency choice
+
+This is why picking your dependencies is such a huge thing to consider. There’s
+a lot to be said about choosing dependencies to minimize vendor lock-in, and
+that’s where the Taco Bell infrastructure philosophy comes in:
+
+- Trigger compute with HTTP requests that use well-defined schemata.
+- Find your target using DNS.
+- Store things you want to keep in Postgres or object storage.
+- Fetch things out of storage when you need them.
+- Mint new workers when there is work to be done.
+- Slay those workers off when they’re not needed anymore.
+
+If you follow these rules, you can easily make your compute nomadic between
+services. Capitalize on things like Kubernetes (the universal API for cloud
+compute, as much as I hate that it won), and you make the underlying clouds an
+implementation detail that can be swapped out as you find better strategic
+partnerships that can offer you more than a measly 5% discount.
+
+Just add water.
+
+### How AI models become dependencies
+
+There's an extra evil way that AI models can become production-critical
+dependencies. Most of the time when you implement an application that uses an AI
+model, you end up encoding "workarounds" for the model into the prompts you use.
+This happens because AI models are fundamentally unpredictable and unreliable
+tools that sometimes give you the output you want. As a result though, changing
+out models _sounds_ like it's something that should be easy. You _just_ change
+out the model and then you can take advantage of better accuracy, new features
+like tool use, or JSON schema prompting, right?
+
+In many cases, changing out a model will result in a service that superficially
+looks and functions the same. You give it a meeting transcript, it tells you
+what the action items are. The problem comes in with the subtle nuances of the
+je ne sais quoi of the experience. Even subtle differences like
+[the current date being in the month of December](https://arstechnica.com/information-technology/2023/12/is-chatgpt-becoming-lazier-because-its-december-people-run-tests-to-find-out/)
+can _drastically_ change the quality of output. A
+[recent paper from Apple](https://arxiv.org/pdf/2410.05229) concluded that
+adding superficial details that wouldn't throw off a human can severely impact
+the performance of large language models. Heck, they even struggle or fall prey
+to fairly trivial questions that humans find easy, such as:
+
+- How many r's are in the word "strawberry"?
+- What's heavier: 2 pounds of bricks, one pound of heavy strawberries, or three
+  pounds of air?
+
+If changing the placement of a comma in a prompt can cause such huge impacts to
+the user experience, what would changing the model do? What would being forced
+to change the model because the provider is deprecating it so they can run newer
+models that don't do the job as well as the model you currently use? This is a
+really evil kind of dependency that you can only get when you rely on
+cloud-hosted models. By controlling the weights and inference setups for your
+machines, you have a better chance of being able to dictate the future of your
+product and control all parts of the stack as much as possible.
+
+## How it’s made prod-ready
+
+Like I said earlier, the three basic needs of any workload are compute, network,
+and storage. Production architectures usually have three basic planes to support
+them:
+
+- The compute plane, which is almost certainly going to be ether Docker or
+  Kubernetes somehow.
+- The network plane, which will be a Virtual Private Cloud (VPC) or overlay
+  network that knits clusters together.
+- The storage plane, which is usually the annoying exercise left to the reader,
+  leading you to make yet another case for either using NFS or sparkly NFS like
+  Longhorn.
+
+Storage is the sticky bit; it’s not really changed since the beginning. You
+either use a POSIX-compatible key-value store or an S3 compatible key-value
+store. Both are used in practically the same ways that the framers intended in
+the late 80’s and 2009 respectively. You chuck bytes into the system with a
+name, and you get the bytes back when you give the name.
+
+Storage is the really important part of your workloads. Your phone would not be
+as useful if it didn’t remember your list of text messages when you rebooted it.
+Many applications also (reasonably) assume that storage always works, is fast
+enough that it’s not an issue, and is durable enough that they don’t have to
+manually make backups.
+
+What about latency? Human reaction time is about 250 milliseconds on average. It
+takes about 250 milliseconds for a TCP session to be established between Berlin
+and us-east-1. If you move your compute between providers, is your storage plane
+also going to move data around to compensate?
+
+If your storage plane doesn’t have egress costs and stores your data close to
+where it’s used, this eliminates a lot of local storage complexity, at the cost
+of additional compute time spent waiting to pull things and the network
+throughput for them to arrive. Somehow compute is cheaper than storage in anno
+dominium two-thousand twenty-four. No, I don’t get how that happened either.
+
+### Pass-by-reference semantics for the cloud
+
+Part of the secret for how people make these production platforms is that they
+cheat: they don’t pass around values as much as possible. They pass a reference
+to that value in the storage plane. When you upload an image to the ChatGPT API
+to see if it’s a picture of a horse, you do a file upload call and then an
+inference call with the ID of that upload. This makes it easier to sling bytes
+around and overall makes things a lot more efficient at the design level. This
+is a lot like pass-by-reference semantics in programming languages like Java or
+a pointer to a value in Go.
+
+### The big queue
+
+The other big secret is that there’s a layer on top of all of the compute: an
+orchestrator with a queue.
+
+This is the rest of the owl that nobody talks about. Just having compute,
+network, and storage is not good enough; there needs to be a layer on top that
+spreads the load between workers, intelligently minting and slaying them off as
+reality demands.
+
+## Okay but where’s the code?
+
+Yeah, yeah, I get it, you want to see this live and in action. I don’t have an
+example totally ready yet, but in lieu of drawing the owl right now, I can tell
+you what you’d need in order to make it a reality on the cheap.
+
+Let’s imagine that this is all done in one app, let’s call it orodayagzou (c.f.
+[Ôrödyagzou](https://www.youtube.com/watch?v=uuYmkZ-Aomo), Ithkuil for
+“synesthesia”). This app is both a HTTP API and an orchestrator. It manages a
+pool of worker nodes that do the actual AI inferencing.
+
+So let’s say a user submits a request asking for a picture of a horse. That’ll
+come in to the right HTTP route and it has logic like this:
+
+```go
+type ScaleToZeroProxy struct {
+  cfg         Config
+	ready       bool
+	endpointURL string
+	instanceID  int
+	lock        sync.RWMutex
+	lastUsed    time.Time
+}
+
+func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
+	s.lock.RLock()
+	ready := s.ready
+	s.lock.RUnlock()
+
+	if !ready {
+		// TODO: implement instance creation
+	}
+
+	s.lock.RLock()
+	defer s.lock.RUnlock()
+	u, err := url.Parse(s.endpointURL)
+	if err != nil {
+		panic(err)
+	}
+
+	u.Path = r.URL.Path
+	u.RawQuery = r.URL.RawQuery
+
+	next := httputil.NewSingleHostReverseProxy(u)
+
+	next.ServeHTTP(w, r)
+	s.lock.Lock()
+	s.lastUsed = time.Now()
+	s.lock.Unlock()
+}
+```
+
+This is a simple little HTTP proxy in Go, it has an endpoint URL and an instance
+ID in memory, some logic to check if the instance is “ready”, and if it’s not
+then to create it. Let’s mint an instance using the [Vast.ai](http://Vast.ai)
+CLI. First, some configuration:
+
+```go
+const (
+	diskNeeded       = 36
+  dockerImage      = "reg.xeiaso.net/runner/sdxl-tigris:latest"
+  httpPort         = 5000
+  modelBucketName  = "ciphanubakfu" // lojban: test-number-bag
+  modelPath        = "glides/ponyxl"
+  onStartCommand   = "python -m cog.server.http"
+  publicBucketName = "xe-flux"
+
+  searchCaveats = `verified=False cuda_max_good>=12.1 gpu_ram>=12 num_gpus=1 inet_down>=450`
+
+  // assume awsAccessKeyID, awsSecretAccessKey, awsRegion, and awsEndpointURLS3 exist
+)
+
+type Config struct {
+	diskNeeded     int // gigabytes
+	dockerImage    string
+	environment    map[string]string
+	httpPort       int
+	onStartCommand string
+}
+```
+
+Then we can search for potential machines with some terrible wrappers to the
+CLI:
+
+```go
+func runJSON[T any](ctx context.Context, args ...any) (T, error) {
+	return trivial.andThusAnExerciseForTheReader[T](ctx, args)
+}
+
+func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
+	s.lock.Lock()
+	defer s.lock.Unlock()
+	candidates, err := runJSON[[]vastai.SearchResponse](
+		ctx,
+		"vastai", "search", "offers",
+		searchCaveats,
+		"-o", "dph+", // sort by price (dollars per hour) increasing, cheapest option is first
+		"--raw",      // output JSON
+	)
+	if err != nil {
+		return fmt.Errorf("can't search for instances: %w", err)
+	}
+
+	// grab the cheapest option
+	candidate := candidates[0]
+
+	contractID := candidate.AskContractID
+	slog.Info("found candidate instance",
+		"contractID", contractID,
+		"gpuName", candidate.GPUName,
+		"cost", candidate.Search.TotalHour,
+	)
+	// ...
+}
+```
+
+Then you can try to create it:
+
+```go
+func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
+	// ...
+	instanceData, err := runJSON[vastai.NewInstance](
+		ctx,
+		"vastai", "create", "instance",
+		contractID,
+		"--image", s.cfg.dockerImage,
+		// dump ports and envvars into format vast.ai wants
+		"--env", s.cfg.FormatEnvString(),
+		"--disk", s.cfg.diskNeeded,
+		"--onstart-cmd", s.cfg.onStartCommand,
+		"--raw",
+	)
+	if err != nil {
+		return fmt.Errorf("can't create new instance: %w", err)
+	}
+
+	slog.Info("created new instance", "instanceID", instanceData.NewContract)
+	s.instanceID = instanceData.NewContract
+	// ...
+```
+
+Then collect the endpoint URL:
+
+```go
+func (s *ScaleToZeroProxy) mintInstance(ctx context.Context) error {
+	// ...
+	instance, err := runJSON[vastai.Instance](
+		ctx,
+		"vastai", "show", "instance",
+		instanceData.NewContract,
+		"--raw",
+	)
+	if err != nil {
+		return fmt.Errorf("can't show instance %d: %w", instanceData.NewContract, err)
+	}
+
+	s.EndpointURL = fmt.Sprintf(
+		"http://%s:%d",
+		instance.PublicIPAddr,
+		instance.Ports[fmt.Sprintf("%d/tcp", s.cfg.httpPort)][0].HostPort,
+	)
+
+	return nil
+}
+```
+
+And then finally wire it up and have it test if the instance is ready somehow:
+
+```go
+func (s *ScaleToZeroProxy) ServeHTTP(w http.ResponseWriter, r *http.Request) {
+	// ...
+
+	if !ready {
+		if err := s.mintInstance(r.Context()); err != nil {
+			slog.Error("can't mint new instance", "err", err)
+			http.Error(w, err.Error(), http.StatusInternalServerError)
+			return
+		}
+
+		t := time.NewTicker(5 * time.Second)
+		defer t.Stop()
+		for range t.C {
+			if ok := s.testReady(r.Context()); ok {
+				break
+			}
+		}
+	}
+
+	// ...
+```
+
+Then the rest of the logic will run through, the request will be passed to the
+GPU instance and then a response will be fired. All that’s left is to slay the
+instances off when they’re unused for about 5 minutes:
+
+```go
+func (s *ScaleToZeroProxy) maybeSlayLoop(ctx context.Context) {
+	t := time.NewTicker(5 * time.Minute)
+	defer t.Stop()
+
+	for {
+		select {
+		case <-t.C:
+			s.lock.RLock()
+			lastUsed := s.lastUsed
+			s.lock.RUnlock()
+
+			if lastUsed.Add(5 * time.Minute).Before(time.Now) {
+				if err := s.slay(ctx); err != nil {
+					slog.Error("can't slay instance", "err", err)
+				}
+			}
+		case <-ctx.Done():
+			return
+		}
+	}
+}
+```
+
+Et voila! Run `maybeSlayLoop` in the background and implement the `slay()`
+method to use the `vastai destroy instance` command, then you have yourself
+nomadic compute that makes and destroys itself on demand to the lowest bidder.
+
+Of course, any production-ready implementation would have limits like “don’t
+have more than 20 workers” and segment things into multiple work queues. This is
+all really hypothetical right now, I wish I had a thing to say you could
+`kubectl apply` and use right now, but I don’t.
+
+I’m going to be working on this this on my Friday streams
+[on Twitch](https://twitch.tv/princessxen) until it’s done. I’m going to
+implement it from an empty folder and then work on making it a Kubernetes
+operator to run any task you want. It’s going to involve generative AI, API
+reverse engineering, eternal torment, and hopefully not getting banned from the
+providers I’m going to be using. It should be a blast!
+
+## Conclusion
+
+Every workload involves compute, network, and storage on top of production’s
+compute plane, network plane, and storage plane. Design your production clusters
+to take advantage of very well-understood fundamentals like HTTP, queues, and
+object storage so that you can reduce your dependencies to the bare minimum.
+Make your app an orchestrator of vast amounts of cheap compute so you don’t need
+to pay for compute or storage that nobody is using while everyone is asleep.
+
+This basic pattern is applicable to just about anything on any platform, not
+just AI or not just with Tigris. We hope that by publishing this architectural
+design, you’ll take it to heart when building your production workloads of the
+future so that we can all use the cloud responsibly. Certain parts of the
+economics of this pattern work best when you have free (or basically free)
+egress costs though.
+
+We’re excited about building the best possible storage layer based on the
+lessons learned building the storage layer Uber uses to service millions of
+rides per month. If you try us and disagree, that’s fine, we won’t nickel and
+dime you on the way out because we don’t charge egress costs.
+
+When all of these concerns are made easier, all that’s left for you is to draw
+the rest of the owl and get out there disrupting industries.
author	Xe Iaso <me@xeiaso.net>	2024-11-12 10:04:12 -0500
committer	Xe Iaso <me@xeiaso.net>	2024-11-12 10:04:12 -0500
commit	2f5122adae02b0da8cd31156e847b8d906d0eb0b (patch)
tree	a457155ca1422f6ada3cc2e289c44838bda1e326
parent	7f5d2dc5340e5117d76645b0552252850c26d62c (diff)
download	xesite-2f5122adae02b0da8cd31156e847b8d906d0eb0b.tar.xz xesite-2f5122adae02b0da8cd31156e847b8d906d0eb0b.zip