diff options
| author | Xe Iaso <me@xeiaso.net> | 2025-04-26 10:01:15 -0400 |
|---|---|---|
| committer | GitHub <noreply@github.com> | 2025-04-26 14:01:15 +0000 |
| commit | ef52550e70a44c318dca8a406750589f67fac0eb (patch) | |
| tree | d00e403f38960ab9f86c8dadf41907099ecad391 /data | |
| parent | c669b47b570d222a9a902705adeff8fb26c989c4 (diff) | |
| download | anubis-ef52550e70a44c318dca8a406750589f67fac0eb.tar.xz anubis-ef52550e70a44c318dca8a406750589f67fac0eb.zip | |
fix(config): remove trailing newlines in regexes (#373)v1.17.0-beta31.17.0-beta2
Closes #372
Fun YAML fact of the day:
What is the difference between how these two expressions are parsed?
```yaml
foo: >
bar
```
```yaml
foo: >-
bar
```
They are invisible in yaml, but when you evaluate them to JSON the
difference is obvious:
```json
{
"foo": "bar\n"
}
```
```json
{
"foo": "bar"
}
```
User-Agent strings, URL path values, and HTTP headers _do_ end in
newlines in HTTP/1.1 wire form, but that newline is usually stripped
before the server actually handles it. Also HTTP/2 is a thing and does
not terminate header values with newlines.
This change makes Anubis more aggressively detect mistaken uses of the
yaml `>` operator and nudges the user into using the yaml `>-` operator
which does not append the trailing newline.
I had honestly forgotten about this YAML behavior because it wasn't
relevant for so long. Oops! Glad I released a beta.
Whenever you get into this state, Anubis will throw a config parsing
error and then give you a message hinting at the folly of your ways.
```
config.Bot: regular expression ends with newline (try >- instead of > in yaml)
```
Big thanks to https://yaml-multiline.info, this helped me realize my
folly instantly.
@aiverson, this is official permission to say "told you so".
Signed-off-by: Xe Iaso <me@xeiaso.net>
Diffstat (limited to 'data')
| -rw-r--r-- | data/botPolicies.json | 2 | ||||
| -rw-r--r-- | data/botPolicies.yaml | 2 | ||||
| -rw-r--r-- | data/bots/ai-robots-txt.yaml | 2 |
3 files changed, 3 insertions, 3 deletions
diff --git a/data/botPolicies.json b/data/botPolicies.json index dad04e8..160bbf0 100644 --- a/data/botPolicies.json +++ b/data/botPolicies.json @@ -41,7 +41,7 @@ }, { "name": "generic-browser", - "user_agent_regex": "Mozilla|Opera\n", + "user_agent_regex": "Mozilla|Opera", "action": "CHALLENGE" } ], diff --git a/data/botPolicies.yaml b/data/botPolicies.yaml index 585be15..51af499 100644 --- a/data/botPolicies.yaml +++ b/data/botPolicies.yaml @@ -43,7 +43,7 @@ bots: # Generic catchall rule - name: generic-browser - user_agent_regex: > + user_agent_regex: >- Mozilla|Opera action: CHALLENGE diff --git a/data/bots/ai-robots-txt.yaml b/data/bots/ai-robots-txt.yaml index 19cbe93..ef2790c 100644 --- a/data/bots/ai-robots-txt.yaml +++ b/data/bots/ai-robots-txt.yaml @@ -1,4 +1,4 @@ - name: "ai-robots-txt" - user_agent_regex: > + user_agent_regex: >- AI2Bot|Ai2Bot-Dolma|Amazonbot|anthropic-ai|Applebot|Applebot-Extended|Brightbot 1.0|Bytespider|CCBot|ChatGPT-User|Claude-Web|ClaudeBot|cohere-ai|cohere-training-data-crawler|Crawlspace|Diffbot|DuckAssistBot|FacebookBot|FriendlyCrawler|Google-Extended|GoogleOther|GoogleOther-Image|GoogleOther-Video|GPTBot|iaskspider/2.0|ICC-Crawler|ImagesiftBot|img2dataset|ISSCyberRiskCrawler|Kangaroo Bot|Meta-ExternalAgent|Meta-ExternalFetcher|OAI-SearchBot|omgili|omgilibot|PanguBot|Perplexity-User|PerplexityBot|PetalBot|Scrapy|SemrushBot-OCOB|SemrushBot-SWA|Sidetrade indexer bot|Timpibot|VelenPublicWebCrawler|Webzio-Extended|YouBot action: DENY
\ No newline at end of file |
