aboutsummaryrefslogtreecommitdiff
path: root/blog/nosleep.markdown
diff options
context:
space:
mode:
Diffstat (limited to 'blog/nosleep.markdown')
-rw-r--r--blog/nosleep.markdown195
1 files changed, 0 insertions, 195 deletions
diff --git a/blog/nosleep.markdown b/blog/nosleep.markdown
deleted file mode 100644
index 829092a..0000000
--- a/blog/nosleep.markdown
+++ /dev/null
@@ -1,195 +0,0 @@
----
-title: Time is not a synchronization primitive
-date: 2023-06-25
----
-
-Programming is so complicated. I know this is an example of the
-nostalgia paradox in action, but it easily feels like everything has
-gotten so much more complicated over the course of my career. One of
-the biggest things that is really complicated is the fact that working
-with other people is always super complicated.
-
-One of the axioms you end up working with is "assume best intent".
-This has sometimes been used as a dog-whistle to defend pathological
-behavior; but really there is a good idea at the core of this:
-everyone is really trying to do the best that they can given their
-limited time and energy and it's usually better to start from the
-position of "the system that allowed this failure to happen is the
-thing that must be fixed".
-
-However, we work with other people and this can result in things that
-can troll you on accident. One of the biggest sources of friction is
-when people end up creating tests that can fail for no reason. To make
-this even more fun, this will end up breaking people's trust in CI
-systems. This lack of trust trains people that it's okay for CI to
-fail because sometimes it's not your fault. This leads to hacks like
-the flaky attribute on python where it will ignore test failures. Or
-even worse, it trains people to merge broken code to main because
-they're trained that sometimes CI just fails but everything is okay.
-
-Today I want to talk about one of the most common ways that I see
-things fall apart. This has caused tests, production-load-bearing bash
-scripts, and normal application code to be unresponsive at best and
-randomly break at worst. It's when people use time as a
-synchronization mechanism.
-
-## Time as an effect
-
-<xeblog-conv name="Aoi" mood="wut">What do you mean by that? That
-sounds mathy as all heck.</xeblog-conv>
-
-I think that the best way to explain this is to start with a flaky
-test that I wrote years ago and break it down to explain why things
-are flaky and what I mean by a "synchronization mechanism". Consider
-this Go test:
-
-```go
-func TestListener(t *testing.T) {
- ctx, cancel := context.WithCancel(context.Background())
- defer cancel()
-
- go func() {
- lis, err := net.Listen("tcp", ":1337")
- if err != nil {
- t.Error(err)
- return
- }
- defer lis.Close()
- for {
- select {
- case <- ctx.Done():
- return
- default:
- }
- conn, err := lis.Accept()
- if err != nil {
- t.Error(err)
- return
- }
-
- // do something with conn
- }()
-
- time.Sleep(150*time.Millisecond)
- conn, err := net.Dial("tcp", "127.0.0.1:1337")
- if err != nil {
- t.Error(err)
- return
- }
- // do something with conn
-}
-```
-
-This code starts a new goroutine that opens a network listener on port
-1337 and then waits for it to be active before connecting to it. Most
-of the time, this will work out okay. However there's a huge problem
-lurking at the core of this: This test will take a minimum of 150
-milliseconds to run no matter what. If the logic of starting a test
-server is lifted into a helper function then every time you create a
-test server from any downstream test function, you spend that
-additional 150 milliseconds.
-
-Additionally, the TCP listener is probably ready near instantly, but
-also if you run multiple tests in parallel then they'll all fight for
-that one port and then everything will fail randomly.
-
-This is what I mean by "synchronization primitive". The idea here is
-that by having the main test goroutine wait for the other one to be
-ready, we are using the effect of time passing (and the Go runtime
-scheduling/executing that other goroutine) as a way to make sure that
-the server is ready for the client to connect. When you are
-synchronizing the state of two goroutines (the client being ready to
-connect and the server being ready for connections), you generally
-want to use something that synchronizes that state, such as a channel
-or even by eliminating the need to synchronize things at all.
-
-Consider this version of that test:
-
-```go
-func TestListener(t *testing.T) {
- ctx, cancel := context.WithCancel(context.Background())
- defer cancel()
-
- lis, err := net.Listen("tcp", ":0")
- if err != nil {
- t.Error(err)
- return
- }
-
- go func() {
- defer lis.Close()
- for {
- select {
- case <- ctx.Done():
- return
- default:
- }
- conn, err := lis.Accept()
- if err != nil {
- t.Error(err)
- return
- }
-
- // do something with conn
- }()
-
- conn, err := net.Dial(lis.Addr().Network(), lis.Addr().String())
- if err != nil {
- t.Error(err)
- return
- }
- // do something with conn
-}
-```
-
-Not only have we gotten rid of that time.Sleep call, we also made it
-support having multiple instances of the server in parallel! This code
-is ultimately much more robust than the old test ever was and will
-easily scale for your needs. If your tests took a total of 600 ms to
-run each, cutting out that one 150 ms sleep removes 25% of the wait!
-
-<xeblog-conv name="Aoi" mood="wut">I see, I see. I'm still not sure
-what you're getting at though. If time isn't a reliable way to
-synchronize things, why do people use it?</xeblog-conv>
-<xeblog-conv name="Cadey" mood="coffee">Well, the basic idea is that
-time is an _effect_, not a cause. When you are trying to make the state of multiple
-concurrent/parallel tasks synchronized to the same state, you can
-imagine time as incidental to the actions, not an inherent cause of
-them. Consider what happens when you leave wet bread out in the open
-for a while: it gets moldy. Time passing didn't cause the mold to
-appear on the bread, the bread being in the open and wet caused it to
-be a decent substrate for mold to grow on top of. Time is incidental
-to the mold developing, not causal. It is the same way with computer
-programs. Waiting one second does not make the service ready. The
-service being ready makes it ready. The worst part is that waiting a
-second or two will usually work well enough that you don't have to
-care.</xeblog-conv>
-<xeblog-conv name="Aoi" mood="grin">I see, thanks!</xeblog-conv>
-
-## Putting it into practice
-
-So let's put this into practice and make this kind of behavior more
-difficult to cause. Let's add a roadblock for trying to use
-`time.Sleep` in tests by using the `nosleep` linter. `nosleep` is a Go
-linter that checks for the presence of `time.Sleep` in your test code
-and fails your code if it finds it. That's it. That's the whole tool.
-
-You can run it against your Go code by installing it with `go install`:
-
-```
-go install within.website/x/linters/cmd/nosleep@latest
-```
-
-And then you can run it with the `nosleep` command:
-
-```
-nosleep ./...
-```
-
-I do recognize that sometimes you actually do need to use time as a
-synchronization method because god is dead and you have no other
-option. If this does genuinely happen, you can use the magic command
-`//nosleep:bypass here's a very good reason`. If you don't put a
-reason there, the magic comment won't work.
-
-Let me know how it works for you! Add it to your CI config if you dare.