blog: fix the strawberry problem

Signed-off-by: Xe Iaso <me@xeiaso.net>
author: Xe Iaso <me@xeiaso.net> 2024-09-13 08:05:15 -0400
committer: Xe Iaso <me@xeiaso.net> 2024-09-13 08:07:21 -0400
commit: c0e86a6e9f594191be7b8912dd4e1f91e796bd75 (patch)
tree: f7cf46cdba3d85583be4599d465fe92959f1501f
parent: 3c14b9fc2c2539290bcb9e8ab0cd097f5c3d5e4a (diff)
download: xesite-c0e86a6e9f594191be7b8912dd4e1f91e796bd75.tar.xz
xesite-c0e86a6e9f594191be7b8912dd4e1f91e796bd75.zip
1 files changed, 153 insertions, 0 deletions
diff --git a/lume/src/blog/2024/strawberry.mdx b/lume/src/blog/2024/strawberry.mdx
new file mode 100644
index 0000000..05b672c
--- /dev/null
+++ b/lume/src/blog/2024/strawberry.mdx
@@ -0,0 +1,153 @@
+---
+title: "I fixed the strawberry problem because OpenAI couldn't"
+date: 2024-09-13
+desc: "Remember kids: real winners cheat"
+hero:
+  ai: "Photo by Xe Iaso, EOS R6 mark ii, RF 50mm f/1.8 STM @ f/1.8"
+  file: cerise
+  prompt: "A photograph of a bundle of cherries with a very shallow depth of field with leaves and a blue sky"
+---
+
+Recently OpenAI announced their model codenamed "strawberry" as [OpenAI o1](https://openai.com/index/learning-to-reason-with-llms/). One of the main things that has been hyped up for months with this model is the ability to solve the "strawberry problem". This is one of those rare problems that's trivial for a human to do correctly, but almost impossible for large language models to solve in their current forms. Surely with this being the _main focus_ of their model, OpenAI would add a "fast path" or something to the training data to make _that exact thing_ work as expected, right?
+
+<center>
+  <blockquote
+    class="bluesky-embed"
+    data-bluesky-uri="at://did:plc:qc6xzgctorfsm35w6i3vdebx/app.bsky.feed.post/3l3ycebgjoz2v"
+    data-bluesky-cid="bafyreibesw33z6owbqzx6zxj2gm5kwggkemmfl3cwxjcou7yomjifv64ny"
+  >
+    <p lang="en">
+      I am excited to reveal the incredible power of OpenAI&#x27;s new
+      &quot;Strawberry&quot; model (known as &quot;o1&quot;). This technology is
+      the future
+      <br />
+      <br />
+      <a href="https://bsky.app/profile/did:plc:qc6xzgctorfsm35w6i3vdebx/post/3l3ycebgjoz2v?ref_src=embed">
+        [image or embed]
+      </a>
+    </p>
+    &mdash; Ed Zitron (<a href="https://bsky.app/profile/did:plc:qc6xzgctorfsm35w6i3vdebx?ref_src=embed">
+      @zitron.bsky.social
+    </a>) <a href="https://bsky.app/profile/did:plc:qc6xzgctorfsm35w6i3vdebx/post/3l3ycebgjoz2v?ref_src=embed">September 12, 2024 at 4:37 PM</a>
+  </blockquote>
+  <script
+    async
+    src="https://embed.bsky.app/static/embed.js"
+    charset="utf-8"
+  ></script>
+</center>
+
+No. It did not. Of course it didn't. Why would they do that?
+
+In my quest for making a technological solution that allows me to retire, I have solved the strawberry problem. I am able to do this on unmodified commodity hardware with models as small as 8 billion parameters. I call this resulting model `strawberry-mimi`. I am unable to upload this to HuggingFace because my Git small filesystem is malfunctioning, but I have created a private inference API so that people can validate this for themselves. Should you want access, please [contact me](/contact/), it can be arranged.
+
+## Mimi: a next-generation life assistant
+
+<Picture
+  path="blog/2024/strawberry/mimi-strawberry"
+  desc="a brown-haired anime woman with a pixie cut, cat ears, mouth full of one strawberry sitting in front of the Space Needle in Seattle -- image generated by Flux 1 [dev] by Black Forest Labs and then somehow horriffically mangled in the upload process (I think my macbook went to sleep?) but in a way that I think looks cool"
+/>
+
+Mimi is Techaro's [initial implementation of a life assistant](/blog/2024/the-layoff/). Mimi is designed to be infinitely adaptable to doing anything, as long as you can ram it into the woeful limitations of the chat interaction model. The real power comes from Mimi's ability to use tools. Tools are function signatures that are signals to the runtime to do something and return the result into the context window. Mimi is a Discord bot, and if you are a [subscriber on Patreon](https://patreon.com/cadey), you have access to Mimi in [#mimi](https://discord.com/channels/1191183827591241828/1266740925137289287).
+
+Right now Mimi has two tools implemented:
+
+- `code_interpreter`: run some Python code and return the stdout/stderr into the context window
+- `draw_image`: fabricate a prompt for Flux dev and then send the image to the Discord channel when it's done
+
+But out of the gate, Mimi fails at the strawberry problem:
+
+<Conv name="Aoi" mood="wut">
+  How many r's are in the word strawberry?
+</Conv>
+<Conv name="Mimi" mood="happy" aiModel="Hermes 3 70B">
+  There are two r's in the word strawberry!
+</Conv>
+
+### Why is this a problem?
+
+Now for those of you that don't have a solid understanding of how all this large langle mangle bullshit works, I'm pretty sure you're asking something like this:
+
+<Conv name="Aoi" mood="wut">
+  If large language models are so "smart" and stuff, why can't they count the
+  letters in a word?
+</Conv>
+
+As humans, we understand words as a sequence of letters that are then turned into syllables and then understood as words. Let's take the word "strawberry" as an example. Here's all the letters of that word separated out:
+
+> s t r a w b e r r y
+
+From here, it's easy for us to see that there's three r's in the word strawberry. Large language models don't see "strawberry" like this. They see it like this:
+
+> str aw berry
+
+Actually, it's a bit worse, models don't see the human-readable text. I just put it like that so that you could visualize it better. Models just see token ID numbers:
+
+> [ 496, 675, 15717 ]
+
+Imagine if all of the ways that you saw every word were broken up like this, but without actually knowing what any of those parts meant. Combine that with the inability to do basic math, and you're pretty much screwed! Of course models can't understand how many r's are in the word strawberry. It's frankly incredible that there are usecases where they shine in general.
+
+## Solving strawberry
+
+One of the things that large language models are really good at is translating vague descriptions of problems into programming languages. For some reason they are really really good at Python (I think that's because Python is over-represented in the dataset). This is when I had a really dumb idea:
+
+<Conv name="Cadey" mood="coffee">
+  What if I just asked the model to translate the problem into Python?
+</Conv>
+
+I added this to Mimi's system prompt:
+
+> If you are asked to count the number of letters in a word, do it by writing a python program and run it with code_interpreter.
+
+To my amazement and horror, it worked:
+
+<Conv name="Aoi" mood="wut">
+  How many r's are in the word strawberry?
+</Conv>
+
+<Conv name="Mimi" mood="think" aiModel="Hermes 3 70b (internal)">
+Call `code_interpreter`:
+
+```python
+word = "strawberry"
+num_r = word.count('r')
+print(num_r)
+```
+
+</Conv>
+
+<Conv name="Mimi" mood="happy" aiModel="Hermes 3 70b (external)">
+  there are 3 r's in strawberry! did you know that? it's pretty cool isn't it??
+</Conv>
+
+This technique is called "codeflection".
+
+I've tried this with other words, and Mimi happily generates the Python code for it every time. There is a success rate of 100%, beating giant models such as GPT-4, Gemini 1.5 Pro, and Llama 3.1 405B.
+
+<div className="max-w-lg mx-auto">
+
+| Model                           | Score |
+| :------------------------------ | :---- |
+| gpt4-32k                        | 66%   |
+| gpt4o                           | 66%   |
+| gpt4o-mini                      | 66%   |
+| Gemini 1.5 Flash                | 66%   |
+| Gemini 1.5 Pro                  | 66%   |
+| Gemini 1.5 Pro (August variant) | 67%   |
+| Llama 3.1 405B                  | 66%   |
+| Reflection 70B                  | 65%   |
+| `strawberry-mimi` (8B)          | 100%  |
+
+</div>
+
+The data proves that `strawberry-mimi` is a state of the art model and thus worthy of several million dollars of funding so that this codeflection technique can be applied more generally across problems.
+
+### WASM
+
+Mimi runs everything in [Wazero](https://wazero.io) using some [particularly inspired Go code](https://github.com/Xe/x/blob/8f5901e8db2de662915994df0a98c6cd72ee4774/llm/codeinterpreter/python/python.go). I run all of Mimi's generated Python code in WebAssembly just to be on the safe side. Having things in WebAssembly means I can limit RAM, CPU, and internet access so that the bot doesn't take out my homelab's Kubernetes cluster doing something stupid like calculating 100 fibonacci numbers. It also means that I have total control over what the bot sees in its "filesystem" from userspace without having to have any magic kernel flags. This makes Talos Linux happy and decreases the likelihood of Mimi causing an XK-class "end of world" scenario.
+
+## Next steps
+
+I'm frankly tired of this problem being a thing. I'm working on a dataset full of entries for every letter in every word of `/usr/share/dict/words` so that it can be added to the training sets of language models. When I have this implemented as a proof of concept on top of Qwen 2 0.5B or Smollm, I plan to write about it on the blog.
+
+If you want to invest in my revolutionary, paridigim-shifting, and FIRE-enablement projects such as Mimi, `strawberry-mimi`, or the LetterCountEval benchmark, please [donate on Patreon](https://patreon.com/cadey) or send your term sheets to [investment@techaro.lol](mailto:investment@techaro.lol). We may not be sure of the utility of money post artificial general intelligence, but for right now holy cow GPU time is expensive. Your contributions will enable future creations.
author	Xe Iaso <me@xeiaso.net>	2024-09-13 08:05:15 -0400
committer	Xe Iaso <me@xeiaso.net>	2024-09-13 08:07:21 -0400
commit	c0e86a6e9f594191be7b8912dd4e1f91e796bd75 (patch)
tree	f7cf46cdba3d85583be4599d465fe92959f1501f
parent	3c14b9fc2c2539290bcb9e8ab0cd097f5c3d5e4a (diff)
download	xesite-c0e86a6e9f594191be7b8912dd4e1f91e796bd75.tar.xz xesite-c0e86a6e9f594191be7b8912dd4e1f91e796bd75.zip