aboutsummaryrefslogtreecommitdiff
path: root/lume/src/notes/2025/amazon-crawler.mdx
blob: 99e66a590b2b7f31122e4c3c27cbb32155b5142d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
title: "Amazon's AI crawler is making my git server unstable"
desc: "Please, just stop."
date: 2025-01-17
---

EDIT(2025-03-26 14:27 UTC):

Anubis has since become [its own full-fledged project with a documentation site](https://anubis.techaro.lol). If this helps you, please donate to my "not having to do my dayjob fund" [on Patreon](https://patreon.com/cadey).

---

EDIT(2025-01-18 23:50 UTC):

I wrote a little proxy that does a proof-of-work check before allowing requests to my Gitea server. It's called Anubis and I'll be writing a blog post about it soon.

For now, check it out at [https://git.xeserv.us/](https://git.xeserv.us/). It's a little rough around the edges, but it works enough.

---

EDIT(2025-01-18 19:00 UTC):

I give up. I moved the Gitea server back behind my VPN. I'm working on a proof of work reverse proxy to protect my server from bots in the future. I'll have it back up soon.

---

EDIT(2025-01-17 17:50 UTC):

I added this snippet to the ingress config:

```yaml
nginx.ingress.kubernetes.io/configuration-snippet: |
  if ($http_user_agent ~* "(Amazon)" ){
    return 418;
  }
```

The bots are still hammering away from a different IP every time. About 10% of the requests do not have the amazonbot user agent. I'm at a loss for what to do next.

I hate the future.

---

Hi all. This is a different kind of post. This is not informative. This is a cry for help.

To whoever runs AmazonBot, please add `git.xeserv.us` to your list of blocked domains. If you know anyone at Amazon, please forward this to them and ask them to forward it to the AmazonBot team.

Should you want to crawl my git server for some reason, please reach out to me so we can arrange for payment for hardware upgrades commensurate to your egregious resource usage.

I don't want to have to close off my Gitea server to the public, but I will if I have to. It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more. I just want the requests to stop.

I have already configured the `robots.txt` file to block all bots:

```
User-agent: *
Disallow: /
```

What else do I need to do?