aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorXe Iaso <me@xeiaso.net>2023-06-24 23:21:42 -0400
committerXe Iaso <me@xeiaso.net>2023-06-24 23:21:52 -0400
commit1f4fc577ffa7a2310505b0a4cd2b09ae53392028 (patch)
tree04240b6344f25d616c0cdd99a9e8504b2162b22f
parent08ede42837052c0517319884018361ef9d4718e7 (diff)
downloadxesite-1f4fc577ffa7a2310505b0a4cd2b09ae53392028.tar.xz
xesite-1f4fc577ffa7a2310505b0a4cd2b09ae53392028.zip
blog: nosleep
Signed-off-by: Xe Iaso <me@xeiaso.net>
-rw-r--r--blog/nosleep.markdown195
1 files changed, 195 insertions, 0 deletions
diff --git a/blog/nosleep.markdown b/blog/nosleep.markdown
new file mode 100644
index 0000000..1fa7799
--- /dev/null
+++ b/blog/nosleep.markdown
@@ -0,0 +1,195 @@
+---
+title: Time is not a synchronization primitive
+date: 2023-06-25
+---
+
+Programming is so complicated. I know this is an example of the
+nostalgia paradox in action, but it easily feels like everything has
+gotten so much more complicated over the course of my career. One of
+the biggest things that is really complicated is the fact that working
+with other people is always super complicated.
+
+One of the axioms you end up working with is "assume best intent".
+This has sometimes been used as a dog-whistle to defend pathological
+behavior; but really there is a good idea at the core of this:
+everyone is really trying to do the best that they can given their
+limited time and energy and it's usually better to start from the
+position of "the system that allowed this failure to happen is the
+thing that must be fixed".
+
+However, we work with other people and this can result in things that
+can troll you on accident. One of the biggest sources of friction is
+when people end up creating tests that can fail for no reason. To make
+this even more fun, this will end up breaking people's trust in CI
+systems. This lack of trust trains people that it's okay for CI to
+fail because sometimes it's not your fault. This leads to hacks like
+the flaky attribute on python where it will ignore test failures. Or
+even worse, it trains people to merge broken code to main because
+they're trained that sometimes CI just fails but everything is okay.
+
+Today I want to talk about one of the most common ways that I see
+things fall apart. This has caused tests, production-load-bearing bash
+scripts, and normal application code to be unresponsive at best and
+randomly break at worst. It's when people use time as a
+synchronization mechanism.
+
+## Time as an effect
+
+<xeblog-conv name="Aoi" mood="wut">What do you mean by that? That
+sounds mathy as all heck.</xeblog-conv>
+
+I think that the best way to explain this is to start with a flaky
+test that I wrote years ago and break it down to explain why things
+are flaky and what I mean by a "synchronization mechanism". Consider
+this Go test:
+
+```go
+func TestListener(t *testing.T) {
+ ctx, cancel := context.WithCancel(context.Background())
+ defer cancel()
+
+ go func() {
+ lis, err := net.Listen("tcp", ":1337")
+ if err != nil {
+ t.Error(err)
+ return
+ }
+ defer lis.Close()
+ for {
+ select {
+ case <- ctx.Done():
+ return
+ default:
+ }
+ conn, err := lis.Accept()
+ if err != nil {
+ t.Error(err)
+ return
+ }
+
+ // do something with conn
+ }()
+
+ time.Sleep(150*time.Millisecond)
+ conn, err := net.Dial("tcp", "127.0.0.1:1337")
+ if err != nil {
+ t.Error(err)
+ return
+ }
+ // do something with conn
+}
+```
+
+This code starts a new goroutine that opens a network listener on port
+1337 and then waits for it to be active before connecting to it. Most
+of the time, this will work out okay. However there's a huge problem
+lurking at the core of this: This test will take a minimum of 150
+milliseconds to run no matter what. If the logic of starting a test
+server is lifted into a helper function then every time you create a
+test server from any downstream test function, you spend that
+additional 150 milliseconds.
+
+Additionally, the TCP listener is probably ready near instantly, but
+also if you run multiple tests in parallel then they'll all fight for
+that one port and then everything will fail randomly.
+
+This is what I mean by "synchronization primitive". The idea here is
+that by having the main test goroutine wait for the other one to be
+ready, we are using the effect of time passing (and the Go runtime
+scheduling/executing that other goroutine) as a way to make sure that
+the server is ready for the client to connect. When you are
+synchronizing the state of two goroutines (the client being ready to
+connect and the server being ready for connections), you generally
+want to use something that synchronizes that state, such as a channel
+or even by eliminating the need to synchronize things at all.
+
+Consider this version of that test:
+
+```go
+func TestListener(t *testing.T) {
+ ctx, cancel := context.WithCancel(context.Background())
+ defer cancel()
+
+ lis, err := net.Listen("tcp", ":0")
+ if err != nil {
+ t.Error(err)
+ return
+ }
+
+ go func() {
+ defer lis.Close()
+ for {
+ select {
+ case <- ctx.Done():
+ return
+ default:
+ }
+ conn, err := lis.Accept()
+ if err != nil {
+ t.Error(err)
+ return
+ }
+
+ // do something with conn
+ }()
+
+ conn, err := net.Dial(lis.Addr().Network(), lis.Addr().String())
+ if err != nil {
+ t.Error(err)
+ return
+ }
+ // do something with conn
+}
+```
+
+Not only have we gotten rid of that time.Sleep call, we also made it
+support having multiple instances of the server in parallel! This code
+is ultimately much more robust than the old test ever was and will
+easily scale for your needs. If your tests took a total of 600 ms to
+run each, cutting out that one 150 ms sleep removes 25% of the wait!
+
+<xeblog-conv name="Aoi" mood="wut">I see, I see. I'm still not sure
+what you're getting at though. If time isn't a reliable way to
+synchronize things, why do people use it?</xeblog-conv>
+<xeblog-conv name="Cadey" mood="coffee">Well, the basic idea is that
+time is an _effect_, not a cause. When you are trying to make the state of multiple
+concurrent/parallel tasks synchronized to the same state, you can
+imagine time as incidental to the actions, not an inherent cause of
+them. Consider what happens when you leave wet bread out in the open
+for a while: it gets moldy. Time passing didn't cause the mold to
+appear on the bread, the bread being in the open and wet caused it to
+be a decent substrate for mold to grow on top of. Time is incidental
+to the mold developing, not causal. It is the same way with computer
+programs. Waiting one second does not make the service ready. The
+service being ready makes it ready. The worst part is that waiting a
+second or two will usually work well enough taht you don't have to
+care.</xeblog-conv>
+<xeblog-conv name="Aoi" mood="grin">I see, thanks!</xeblog-conv>
+
+## Putting it into practice
+
+So let's put this into practice and make this kind of behavior more
+difficult to cause. Let's add a roadblock for trying to use
+`time.Sleep` in tests by using the `nosleep` linter. `nosleep` is a Go
+linter that checks for the presence of `time.Sleep` in your test code
+and fails your code if it finds it. That's it. That's the whole tool.
+
+You can run it against your Go code by installing it with `go install`:
+
+```
+go install within.website/x/linters/cmd/nosleep@latest
+```
+
+And then you can run it with the `nosleep` command:
+
+```
+nosleep ./...
+```
+
+I do recognize that sometimes you actually do need to use time as a
+synchronization method because god is dead and you have no other
+option. If this does genuinely happen, you can use the magic command
+`//nosleep:bypass here's a very good reason`. If you don't put a
+reason there, the magic comment won't work.
+
+Let me know how it works for you! Add it to your CI config if you dare.