diff options
| author | Xe Iaso <me@xeiaso.net> | 2023-06-24 23:21:42 -0400 |
|---|---|---|
| committer | Xe Iaso <me@xeiaso.net> | 2023-06-24 23:21:52 -0400 |
| commit | 1f4fc577ffa7a2310505b0a4cd2b09ae53392028 (patch) | |
| tree | 04240b6344f25d616c0cdd99a9e8504b2162b22f | |
| parent | 08ede42837052c0517319884018361ef9d4718e7 (diff) | |
| download | xesite-1f4fc577ffa7a2310505b0a4cd2b09ae53392028.tar.xz xesite-1f4fc577ffa7a2310505b0a4cd2b09ae53392028.zip | |
blog: nosleep
Signed-off-by: Xe Iaso <me@xeiaso.net>
| -rw-r--r-- | blog/nosleep.markdown | 195 |
1 files changed, 195 insertions, 0 deletions
diff --git a/blog/nosleep.markdown b/blog/nosleep.markdown new file mode 100644 index 0000000..1fa7799 --- /dev/null +++ b/blog/nosleep.markdown @@ -0,0 +1,195 @@ +--- +title: Time is not a synchronization primitive +date: 2023-06-25 +--- + +Programming is so complicated. I know this is an example of the +nostalgia paradox in action, but it easily feels like everything has +gotten so much more complicated over the course of my career. One of +the biggest things that is really complicated is the fact that working +with other people is always super complicated. + +One of the axioms you end up working with is "assume best intent". +This has sometimes been used as a dog-whistle to defend pathological +behavior; but really there is a good idea at the core of this: +everyone is really trying to do the best that they can given their +limited time and energy and it's usually better to start from the +position of "the system that allowed this failure to happen is the +thing that must be fixed". + +However, we work with other people and this can result in things that +can troll you on accident. One of the biggest sources of friction is +when people end up creating tests that can fail for no reason. To make +this even more fun, this will end up breaking people's trust in CI +systems. This lack of trust trains people that it's okay for CI to +fail because sometimes it's not your fault. This leads to hacks like +the flaky attribute on python where it will ignore test failures. Or +even worse, it trains people to merge broken code to main because +they're trained that sometimes CI just fails but everything is okay. + +Today I want to talk about one of the most common ways that I see +things fall apart. This has caused tests, production-load-bearing bash +scripts, and normal application code to be unresponsive at best and +randomly break at worst. It's when people use time as a +synchronization mechanism. + +## Time as an effect + +<xeblog-conv name="Aoi" mood="wut">What do you mean by that? That +sounds mathy as all heck.</xeblog-conv> + +I think that the best way to explain this is to start with a flaky +test that I wrote years ago and break it down to explain why things +are flaky and what I mean by a "synchronization mechanism". Consider +this Go test: + +```go +func TestListener(t *testing.T) { + ctx, cancel := context.WithCancel(context.Background()) + defer cancel() + + go func() { + lis, err := net.Listen("tcp", ":1337") + if err != nil { + t.Error(err) + return + } + defer lis.Close() + for { + select { + case <- ctx.Done(): + return + default: + } + conn, err := lis.Accept() + if err != nil { + t.Error(err) + return + } + + // do something with conn + }() + + time.Sleep(150*time.Millisecond) + conn, err := net.Dial("tcp", "127.0.0.1:1337") + if err != nil { + t.Error(err) + return + } + // do something with conn +} +``` + +This code starts a new goroutine that opens a network listener on port +1337 and then waits for it to be active before connecting to it. Most +of the time, this will work out okay. However there's a huge problem +lurking at the core of this: This test will take a minimum of 150 +milliseconds to run no matter what. If the logic of starting a test +server is lifted into a helper function then every time you create a +test server from any downstream test function, you spend that +additional 150 milliseconds. + +Additionally, the TCP listener is probably ready near instantly, but +also if you run multiple tests in parallel then they'll all fight for +that one port and then everything will fail randomly. + +This is what I mean by "synchronization primitive". The idea here is +that by having the main test goroutine wait for the other one to be +ready, we are using the effect of time passing (and the Go runtime +scheduling/executing that other goroutine) as a way to make sure that +the server is ready for the client to connect. When you are +synchronizing the state of two goroutines (the client being ready to +connect and the server being ready for connections), you generally +want to use something that synchronizes that state, such as a channel +or even by eliminating the need to synchronize things at all. + +Consider this version of that test: + +```go +func TestListener(t *testing.T) { + ctx, cancel := context.WithCancel(context.Background()) + defer cancel() + + lis, err := net.Listen("tcp", ":0") + if err != nil { + t.Error(err) + return + } + + go func() { + defer lis.Close() + for { + select { + case <- ctx.Done(): + return + default: + } + conn, err := lis.Accept() + if err != nil { + t.Error(err) + return + } + + // do something with conn + }() + + conn, err := net.Dial(lis.Addr().Network(), lis.Addr().String()) + if err != nil { + t.Error(err) + return + } + // do something with conn +} +``` + +Not only have we gotten rid of that time.Sleep call, we also made it +support having multiple instances of the server in parallel! This code +is ultimately much more robust than the old test ever was and will +easily scale for your needs. If your tests took a total of 600 ms to +run each, cutting out that one 150 ms sleep removes 25% of the wait! + +<xeblog-conv name="Aoi" mood="wut">I see, I see. I'm still not sure +what you're getting at though. If time isn't a reliable way to +synchronize things, why do people use it?</xeblog-conv> +<xeblog-conv name="Cadey" mood="coffee">Well, the basic idea is that +time is an _effect_, not a cause. When you are trying to make the state of multiple +concurrent/parallel tasks synchronized to the same state, you can +imagine time as incidental to the actions, not an inherent cause of +them. Consider what happens when you leave wet bread out in the open +for a while: it gets moldy. Time passing didn't cause the mold to +appear on the bread, the bread being in the open and wet caused it to +be a decent substrate for mold to grow on top of. Time is incidental +to the mold developing, not causal. It is the same way with computer +programs. Waiting one second does not make the service ready. The +service being ready makes it ready. The worst part is that waiting a +second or two will usually work well enough taht you don't have to +care.</xeblog-conv> +<xeblog-conv name="Aoi" mood="grin">I see, thanks!</xeblog-conv> + +## Putting it into practice + +So let's put this into practice and make this kind of behavior more +difficult to cause. Let's add a roadblock for trying to use +`time.Sleep` in tests by using the `nosleep` linter. `nosleep` is a Go +linter that checks for the presence of `time.Sleep` in your test code +and fails your code if it finds it. That's it. That's the whole tool. + +You can run it against your Go code by installing it with `go install`: + +``` +go install within.website/x/linters/cmd/nosleep@latest +``` + +And then you can run it with the `nosleep` command: + +``` +nosleep ./... +``` + +I do recognize that sometimes you actually do need to use time as a +synchronization method because god is dead and you have no other +option. If this does genuinely happen, you can use the magic command +`//nosleep:bypass here's a very good reason`. If you don't put a +reason there, the magic comment won't work. + +Let me know how it works for you! Add it to your CI config if you dare. |
