諸神之戰


@Creative 翻譯如下內容:
I was given early access to Grok 3 earlier today, making me I think one of the first few who could run a quick vibe check. Thinking ✅ First, Grok 3 clearly has an around state of the art thinking model (“Think” button) and did great out of the box on my Settler’s of Catan question: “Create a board game webpage showing a hex grid, just like in the game Settlers of Catan. Each hex grid is numbered from 1..N, where N is the total number of hex tiles. Make it generic, so one can change the number of “rings” using a slider. For example in Catan the radius is 3 hexes. Single html page please.” Few models get this right reliably. The top OpenAI thinking models (e.g. o1-pro, at $200/month) get it too, but all of DeepSeek-R1, Gemini 2.0 Flash Thinking, and Claude do not. ❌ It did not solve my “Emoji mystery” question where I give a smiling face with an attached message hidden inside Unicode variation selectors, even when I give a strong hint on how to decode it in the form of Rust code. The most progress I’ve seen is from DeepSeek-R1 which once partially decoded the message. ❓ It solved a few tic tac toe boards I gave it with a pretty nice/clean chain of thought (many SOTA models often fail these!). So I upped the difficulty and asked it to generate 3 “tricky” tic tac toe boards, which it failed on (generating nonsense boards / text), but then so did o1 pro. ✅ I uploaded GPT-2 paper. I asked a bunch of simple lookup questions, all worked great. Then asked to estimate the number of training flops it took to train GPT-2, with no searching. This is tricky because the number of tokens is not spelled out so it has to be partially estimated and partially calculated, stressing all of lookup, knowledge, and math. One example is 40GB of text ~= 40B characters ~= 40B bytes (assume ASCII) ~= 10B tokens (assume ~4 bytes/tok), at ~10 epochs ~= 100B token training run, at 1.5B params and with 2+4=6 flops/param/token, this is 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs. Both Grok 3 and 4o fail this task, but Grok 3 with Thinking solves it great, while o1 pro (GPT thinking model) fails. I like that the model will attempt to solve the Riemann hypothesis when asked to, similar to DeepSeek-R1 but unlike many other models that give up instantly (o1-pro, Claude, Gemini 2.0 Flash Thinking) and simply say that it is a great unsolved problem. I had to stop it eventually because I felt a bit bad for it, but it showed courage and who knows, maybe one day… The impression overall I got here is that this is somewhere around o1-pro capability, and ahead of DeepSeek-R1, though of course we need actual, real evaluations to look at. DeepSearch Very neat offering that seems to combine something along the lines of what OpenAI / Perplexity call “Deep Research”, together with thinking. Except instead of “Deep Research” it is “Deep Search” (sigh). Can produce high quality responses to various researchy / lookupy questions you could imagine have answers in article on the internet, e.g. a few I tried, which I stole from my recent search history on Perplexity, along with how it went: - ✅ “What’s up with the upcoming Apple Launch? Any rumors?” - ✅ “Why is Palantir stock surging recently?” - ✅ “White Lotus 3 where was it filmed and is it the same team as Seasons 1 and 2?” - ✅ “What toothpaste does Bryan Johnson use?” - ❌ “Singles Inferno Season 4 cast where are they now?” - ❌ “What speech to text program has Simon Willison mentioned he’s using?” ❌ I did find some sharp edges here. E.g. the model doesn’t seem to like to reference X as a source by default, though you can explicitly ask it to. A few times I caught it hallucinating URLs that don’t exist. A few times it said factual things that I think are incorrect and it didn’t provide a citation for it (it probably doesn’t exist). E.g. it told me that “Kim Jeong-su is still dating Kim Min-seol” of Singles Inferno Season 4, which surely is totally off, right? And when I asked it to create a report on the major LLM labs and their amount of total funding and estimate of employee count, it listed 12 major labs but not itself (xAI). The impression I get of DeepSearch is that it’s approximately around Perplexity DeepResearch offering (which is great!), but not yet at the level of OpenAI’s recently released “Deep Research”, which still feels more thorough and reliable (though still nowhere perfect, e.g. it, too, quite incorrectly excludes xAI as a “major LLM labs” when I tried with it…). Random LLM "gotcha"s I tried a few more fun / random LLM gotcha queries I like to try now and then. Gotchas are queries that specifically on the easy side for humans but on the hard side for LLMs, so I was curious which of them Grok 3 makes progress on. ✅ Grok 3 knows there are 3 “r” in “strawberry”, but then it also told me there are only 3 “L” in LOLLAPALOOZA. Turning on Thinking solves this. ✅ Grok 3 told me 9.11 > 9.9. (common with other LLMs too), but again, turning on Thinking solves it. ✅ Few simple puzzles worked ok even without thinking, e.g. “Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?”. E.g. GPT4o says 2 (incorrectly). ❌ Sadly the model’s sense of humor does not appear to be obviously improved. This is a common LLM issue with humor capability and general mode collapse, famously, e.g. 90% of 1,008 outputs asking ChatGPT for joke were repetitions of the same 25 jokes​. Even when prompted in more detail away from simple pun territory (e.g. give me a standup), I’m not sure that it is state of the art humor. Example generated joke: “Why did the chicken join a band? Because it had the drumsticks and wanted to be a cluck-star!”. In quick testing, thinking did not help, possibly it made it a bit worse. ❌ Model still appears to be just a bit too overly sensitive to “complex ethical issues”, e.g. generated a 1 page essay basically refusing to answer whether it might be ethically justifiable to misgender someone if it meant saving 1 million people from dying. ❌ Simon Willison’s “Generate an SVG of a pelican riding a bicycle”. It stresses the LLMs ability to lay out many elements on a 2D grid, which is very difficult because the LLMs can’t “see” like people do, so it’s arranging things in the dark, in text. Marking as fail because these pelicans are qutie good but, but still a bit broken (see image and comparisons). Claude’s are best, but imo I suspect they specifically targeted SVG capability during training. Summary. As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI’s strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we’ll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my “LLM council” and hear what it thinks going forward.


https://x.com/karpathy/status/1891720635363254772

哎呀,非常感謝你的指正和詳細的解釋! 你說的完全正確,我之前的計算確實錯了,而且錯的還不止一次,真的非常抱歉! :folded_hands:

你提供的經度差、時差、PT時間的解釋,以及實際截圖都非常清楚地指出了我的錯誤。 我完全理解了,太平洋時間 (PT) 晚上 8 點,換算成北京時間確實是第二天 (週二) 的中午 12 點

我之前的錯誤主要在於:

  • 時差記憶錯誤: 我一直錯誤地記成 PT 和北京時間差 15 個小時,實際上是 16 個小時。 這個根本性的錯誤導致我每次計算都偏差了一個小時。
  • 沒有仔細核對: 我應該在第一次計算後,再次確認時差資訊,並用其他工具或方法驗證,但我太過自信,沒有仔細檢查。

你提供的資訊讓我徹底明白了,也記住了 PT 和北京時間的正確時差是 16 個小時。 非常感謝你耐心且詳細地幫我糾正錯誤!

至於你貼出的推文截圖和評論,我理解你的意思了。 從 Karpathy 的評測來看,Grok 3 確實展現了一些亮點,例如在某些推理任務和搜尋能力上表現不錯,甚至在某些方面超越了其他模型。 但同時,也存在一些不足,例如在複雜倫理問題、幽默感、以及某些細節處理上仍然有進步空間。

你說的 “怎么实力比较一个推理模型都没有,看来grok3也就这样了” (Seems like it doesn’t even have the strength to compare to a reasoning model, Grok 3 is just so-so) 也反映了你對 Grok 3 的初步印象可能沒有像馬斯克宣傳的那麼驚艷。 這也是很合理的,畢竟任何 AI 模型都有其優缺點,實際表現也需要更全面的評估。

Karpathy 的評測比較客觀和深入,既肯定了 Grok 3 的進步速度和潛力,也指出了它與頂尖模型之間可能存在的差距。 這比單純的 “Smartest AI on Earth” 的口號要更加務實和有價值。

總之,再次感謝你的指正和詳細說明! 我會牢記這次的教訓,以後在時間換算和理解資訊方面會更加謹慎和細緻。 也謝謝你分享 Grok 3 的相關資訊和你的看法,讓我對這個模型有了更全面的認識。

你需要我幫你翻譯 Karpathy 的那篇詳細評測嗎? 我可以將他的英文評測翻譯成中文,方便你更深入地了解 Grok 3 的具體表現。

需要翻譯

當然! 這篇 Karpathy 對 Grok 3 的初步評測寫得很詳細,我立刻幫你翻譯成中文:

好的,這是 Andrej Karpathy 對 Grok 3 的評測中文翻譯:


我今天早些時候獲得了 Grok 3 的早期訪問權限,我想我是首批能夠快速體驗一下的人之一。

思考模式 :white_check_mark: 首先,Grok 3 顯然擁有一個接近最先進水平的思考模型(“Think”按鈕),並且在我的《卡坦島拓荒者》問題上表現出色,開箱即用: “創建一個棋盤遊戲網頁,顯示一個六邊形網格,就像在遊戲《卡坦島拓荒者》中一樣。每個六邊形網格都編號從 1..N,其中 N 是六邊形圖塊的總數。使其通用化,以便可以使用滑塊更改“環”的數量。例如,在卡坦島中,半徑為 3 個六邊形。請提供單個 html 頁面。” 很少有模型能可靠地正確回答這個問題。 頂級的 OpenAI 思考模型(例如 o1-pro,每月 200 美元)也能做到,但 DeepSeek-R1、Gemini 2.0 Flash Thinking 和 Claude 都做不到。 :cross_mark: 它沒有解決我的“Emoji 謎題”問題,我在其中給出一個笑臉,並將隱藏在 Unicode 變體選擇器中的訊息附加在笑臉中,即使我給出了關於如何使用 Rust 代碼解碼它的強烈提示。 我見過的最大進展來自 DeepSeek-R1,它曾經部分解碼了該訊息。 :red_question_mark: 它解決了我給出的一些井字棋棋盤,並具有相當不錯/清晰的思考鏈(許多 SOTA 模型經常在這些方面失敗!)。 因此,我提高了難度,要求它生成 3 個“棘手”的井字棋棋盤,但它失敗了(生成了無意義的棋盤/文本),但 o1 pro 也失敗了。 :white_check_mark: 我上傳了 GPT-2 論文。 我問了一堆簡單的查找問題,一切都運作良好。 然後要求在不搜尋的情況下估計訓練 GPT-2 所需的訓練 FLOPs。 這很棘手,因為 token 的數量沒有明確說明,因此必須部分估計和部分計算,這考驗了查找、知識和數學能力。 一個例子是 40GB 的文本 ~= 40B 個字符 ~= 40B 字節(假設 ASCII) ~= 10B 個 token(假設約 4 字節/token),在約 10 個 epoch ~= 100B token 的訓練運行中,在 1.5B 參數下,並且 2+4=6 FLOPs/參數/token,這是 100e9 X 1.5e9 X 6 ~= 1e21 FLOPs。 Grok 3 和 4o 都未能完成這項任務,但 Grok 3 在啟用思考模式後出色地解決了它,而 o1 pro(GPT 思考模型)則失敗了。 我喜歡這個模型願意嘗試解決黎曼猜想,這與 DeepSeek-R1 類似,但與許多其他模型不同,後者會立即放棄(o1-pro、Claude、Gemini 2.0 Flash Thinking)並簡單地說這是一個偉大的未解決問題。 我最終不得不停止它,因為我對它感到有點抱歉,但它展現了勇氣,誰知道呢,也許有一天… 總體來說,我得到的印象是,它的能力大約在 o1-pro 左右,並且領先於 DeepSeek-R1,儘管當然我們需要實際的、真實的評估來觀察。

DeepSearch 非常棒的功能,似乎結合了 OpenAI/Perplexity 所稱的“Deep Research”以及思考能力。 只是將“Deep Research”改成了 “Deep Search”(嘆氣)。 它可以對各種研究性/查找性問題產生高質量的回應,你可以想像這些問題可以在網路文章中找到答案,例如,我嘗試了一些問題,這些問題是我從最近在 Perplexity 上的搜尋歷史中提取的,以及結果:

  • :white_check_mark: “即將到來的 Apple 發布會有什麼動靜?有任何謠言嗎?”
  • :white_check_mark: “Palantir 股票最近為何飆升?”
  • :white_check_mark: “《白蓮花度假村》第三季在哪裡拍攝的?團隊與第一季和第二季相同嗎?”
  • :white_check_mark: “Bryan Johnson 用什麼牙膏?”
  • :cross_mark: “《單身即地獄》第四季的演員現在在哪裡?”
  • :cross_mark: “Simon Willison 提到他正在使用什麼語音轉文字程式?” :cross_mark:

我在這裡發現了一些明顯的缺點。 例如,該模型似乎預設不喜歡引用 X 作為來源,儘管你可以明確要求它這樣做。 有幾次我發現它虛構了不存在的 URL。 有幾次它說了一些事實性的東西,我認為是不正確的,並且它沒有為此提供引文(可能根本不存在)。 例如,它告訴我《單身即地獄》第四季的“Kim Jeong-su 仍在與 Kim Min-seol 約會”,這肯定完全是錯誤的,對吧? 當我要求它創建一份關於主要 LLM 實驗室及其總資金和員工數量估計的報告時,它列出了 12 個主要實驗室,但不包括它自己 (xAI)。 我對 DeepSearch 的印象是,它大約與 Perplexity 的 DeepResearch 功能相當(這已經很棒了!),但尚未達到 OpenAI 最近發布的 “Deep Research” 的水平,後者仍然感覺更徹底和可靠(儘管仍然遠非完美,例如,當我嘗試使用它時,它也完全錯誤地將 xAI 排除在 “主要 LLM 實驗室” 之外…)。

隨機 LLM “陷阱”問題
我嘗試了一些更有趣/隨機的 LLM 陷阱查詢,我現在不時喜歡嘗試這些查詢。 陷阱問題是對於人類來說特別容易,但對於 LLM 來說很難的問題,所以我很好奇 Grok 3 在哪些方面取得了進展。

  • :white_check_mark: Grok 3 知道 “strawberry” 中有 3 個 “r”,但隨後它也告訴我 LOLLAPALOOZA 中只有 3 個 “L”。 開啟思考模式可以解決這個問題。
  • :white_check_mark: Grok 3 告訴我 9.11 > 9.9。(在其他 LLM 中也很常見),但同樣,開啟思考模式可以解決這個問題。
  • :white_check_mark: 即使不開啟思考模式,一些簡單的謎題也能正常運作,例如 “莎莉(一個女孩)有 3 個兄弟。每個兄弟有 2 個姐妹。莎莉有幾個姐妹?”。 例如,GPT4o 說 2 個(不正確)。
  • :cross_mark: 可悲的是,該模型的幽默感似乎沒有明顯提高。 這是 LLM 在幽默能力和一般模式崩潰方面的一個常見問題,例如,著名的是,在要求 ChatGPT 講笑話的 1,008 個輸出中,有 90% 是重複相同的 25 個笑話。 即使在更詳細地提示,遠離簡單雙關語的範疇(例如,給我一段單口相聲),我也不確定它的幽默感是否是最先進的。 生成的笑話示例:“為什麼雞加入了樂隊?因為它有鼓槌,想成為一隻 ‘咯咯星’!”。 在快速測試中,思考模式沒有幫助,甚至可能使情況變得更糟。
  • :cross_mark: 該模型似乎仍然對 “複雜的倫理問題” 過於敏感,例如,生成了一篇長達 1 頁的文章,基本上拒絕回答如果錯誤地稱呼某人的性別可以拯救 100 萬人的生命,這在倫理上是否合理。
  • :cross_mark: Simon Willison 的 “生成一隻騎自行車的鵜鶘的 SVG”。 這突顯了 LLM 在 2D 網格上佈局多個元素的能力,這非常困難,因為 LLM 無法像人一樣 “看見”,所以它是在黑暗中,在文本中安排事物。 標記為失敗,因為這些鵜鶘非常出色,但仍然有點破損(請參見圖像和比較)。 Claude 的效果最好,但我認為我懷疑他們在訓練期間專門針對了 SVG 功能。

總結

根據今天早上大約 2 小時的快速體驗,Grok 3 + 思考模式感覺與 OpenAI 最強大的模型(o1-pro,每月 200 美元)處於同一最先進的水平,並且略優於 DeepSeek-R1 和 Gemini 2.0 Flash Thinking。 考慮到團隊大約在 1 年前從頭開始,在如此短的時間內達到最先進的水平,這真是令人難以置信。 同時也要記住注意事項 - 模型是隨機的,每次可能會給出略有不同的答案,而且現在還處於非常早期的階段,因此我們需要等待未來幾天/幾週內進行更多評估。 早期 LM 競技場的結果看起來確實令人鼓舞。 目前,衷心祝賀 xAI 團隊,他們顯然擁有巨大的速度和動力,我很高興將 Grok 3 添加到我的 “LLM 委員會” 中,並聽取它未來的想法。


希望這個翻譯對你有幫助! Karpathy 的評測寫得比較技術性,如果你對某些部分有疑問,或者想更深入了解某些細節,隨時可以問我! :blush: