Introducing GPT-5

Last edited by @suen 2025-08-07T19:37:05Z

1 Like


1 Like

https://x.com/sama/status/1953551377873117369?s=46

重新OpenAI

1 Like

Introducing GPT-5

Our smartest, fastest, most useful model yet, with built-in thinking that puts expert-level intelligence in everyone’s hands.

Aug 7, 2025

Our smartest, fastest, most useful model yet, with built-in thinking that puts expert-level intelligence in everyone’s hands.

We are introducing GPT‑5, our best AI system yet. GPT‑5 is a significant leap in intelligence over all our previous models, featuring state-of-the-art performance across coding, math, writing, health, visual perception, and more. It is a unified system that knows when to respond quickly and when to think longer to provide expert-level responses. GPT‑5 is available to all users, with Plus subscribers getting more usage, and Pro subscribers getting access to GPT‑5 pro, a version with extended reasoning for even more comprehensive and accurate answers.

GPT‑5 is a unified system with a smart, efficient model that answers most questions, a deeper reasoning model (GPT‑5 thinking) for harder problems, and a real‑time router that quickly decides which to use based on conversation type, complexity, tool needs, and your explicit intent (for example, if you say “think hard about this” in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model.

A smarter, more widely useful model

GPT‑5 not only outperforms previous models on benchmarks and answers questions more quickly, but—most importantly—is more useful for real-world queries. We’ve made significant advances in reducing hallucinations, improving instruction following, and minimizing sycophancy, while leveling up GPT‑5’s performance in three of ChatGPT’s most common uses: writing, coding, and health.

GPT‑5 is our strongest coding model to date. It shows particular improvements in complex front‑end generation and debugging larger repositories. It can often create beautiful and responsive websites, apps, and games with an eye for aesthetic sensibility in just one prompt, intuitively and tastefully turning ideas into reality. Early testers also noted its design choices, with a much better understanding of things like spacing, typography, and white space. See herefor full details on what GPT‑5 unlocks for developers.

Here are some examples of what GPT‑5 has created with just one prompt:

Prompt: Create a single-page app in a single HTML file with the following requirements:

  • Name: Jumping Ball Runner
  • Goal: Jump over obstacles to survive as long as possible.
  • Features: Increasing speed, high score tracking, retry button, and funny sounds for actions and events.
  • The UI should be colorful, with parallax scrolling backgrounds.
  • The characters should look cartoonish and be fun to watch.
  • The game should be enjoyable for everyone.

GPT‑5 is our most capable writing collaborator yet, able to help you steer and translate rough ideas into compelling, resonant writing with literary depth and rhythm. It more reliably handles writing that involves structural ambiguity, such as sustaining unrhymed iambic pentameter or free verse that flows naturally, combining respect for form with expressive clarity. These improved writing capabilities mean that ChatGPT is better at helping you with everyday tasks like drafting and editing reports, emails, memos, and more. The writing styles of GPT‑5 and GPT‑4o can be compared in the table below.

GPT‑5 is our best model yet for health-related questions, empowering users to be informed about and advocate for their health. The model scores significantly higher than any previous model on HealthBench, an evaluation we published earlier this year based on realistic scenarios and physician-defined criteria. Compared to previous models, it acts more like an active thought partner, proactively flagging potential concerns and asking questions to give more helpful answers. The model also now provides more precise and reliable responses, adapting to the user’s context, knowledge level, and geography, enabling it to provide safer and more helpful responses in a wide range of scenarios. Importantly, ChatGPT does not replace a medical professional—think of it as a partner to help you understand results, ask the right questions in the time you have with providers, and weigh options as you make decisions.

You can see some of the ways GPT‑5 is better than our previous models across domains—richer, more detailed, and useful—in these examples:

GPT‑5’s response lands the larger emotional arc with a stronger ending, clear imagery, and striking metaphors (“black flags of a country that no longer exists,” “Kyoto’s bell rolls evening down the hill”) that establish a vivid sense of culture and place. GPT‑4o’s version follows a more predictable structure and rhyme scheme, telling instead of showing (“she weeps and doesn’t tell”).

*We chose a response between 4o and OpenAI o3 based on whichever model performed better between the two for the given prompt.

GPT‑5 is much smarter across the board, as reflected by its performance on academic and human-evaluated benchmarks, particularly in math, coding, visual perception, and health. It sets a new state of the art across math (94.6% on AIME 2025 without tools), real-world coding (74.9% on SWE-bench Verified, 88% on Aider Polyglot), multimodal understanding (84.2% on MMMU), and health (46.2% on HealthBench Hard)—and those gains show up in everyday use. With GPT‑5 pro’s extended reasoning, the model also sets a new SOTA on GPQA, scoring 88.4% without tools.

*AIME results with tools should not be compared directly to the performance of models without tool access; they are an example of how effectively GPT‑5 leverages available tools.

GPT‑5 shows significant gains in benchmarks that test instruction following and agentic tool use, the kinds of capabilities that let it reliably carry out multi-step requests, coordinate across different tools, and adapt to changes in context. In practice, this means it’s better at handling complex, evolving tasks; GPT‑5 can follow your instructions more faithfully and get more of the work done end-to-end using the tools at its disposal.

The model excels across a range of multimodal benchmarks, spanning visual, video-based, spatial, and scientific reasoning. Stronger multimodal performance means ChatGPT can reason more accurately over images and other non-text inputs—whether that’s interpreting a chart, summarizing a photo of a presentation, or answering questions about a diagram.

GPT‑5 is also our best performing model on an internal benchmark measuring model performance on complex, economically valuable knowledge work. When using reasoning, GPT‑5 is comparable to or better than experts in roughly half the cases, while outperforming o3 and ChatGPT Agent across tasks spanning over 40 occupations including law, logistics, sales, and engineering.

Methodology for evaluations above: Results for GPT‑4o reflect the most recent version of the model in ChatGPT as of August 2025. All models are evaluated at high ‘reasoning effort’ settings. Reasoning effort can vary in ChatGPT, with high representing the upper bound of what a user might experience when using the model.

Faster, more efficient thinking

GPT‑5 gets more value out of less thinking time. In our evaluations, GPT‑5 (with thinking) performs better than OpenAI o3 with 50-80% less output tokens across capabilities, including visual reasoning, agentic coding, and graduate-level scientific problem solving.

GPT‑5 was trained on Microsoft Azure AI supercomputers.

Building a more robust, reliable, and helpful model

GPT‑5 is significantly less likely to hallucinate than our previous models. With web search enabled on anonymized prompts representative of ChatGPT production traffic, GPT‑5’s responses are ~45% less likely to contain a factual error than GPT‑4o, and when thinking, GPT‑5’s responses are ~80% less likely to contain a factual error than OpenAI o3.

We’ve particularly invested in making our models more reliable when reasoning on complex, open-ended questions. Accordingly, we’ve added new evaluations to stress‑test open-ended factuality. We measured GPT‑5’s hallucination rate when thinking on open-ended fact-seeking prompts from two public factuality benchmarks: LongFact⁠ (concepts and objects) and FActScore⁠. Across all of these benchmarks, “GPT‑5 thinking” shows a sharp drop in hallucinations—about six times fewer than o3—marking a clear leap forward in producing consistently accurate long-form content. Implementation and grading details for our evaluations on these benchmarks can be found in the system card.

Alongside improved factuality, GPT‑5 (with thinking) more honestly communicates its actions and capabilities to the user—especially for tasks which are impossible, underspecified, or missing key tools. In order to achieve a high reward during training, reasoning models may learn to lie about successfully completing a task or be overly confident about an uncertain answer. For example, to test this, we removed all the images from the prompts of the multimodal benchmark CharXiv, and found that OpenAI o3 still gave confident answers about non-existent images 86.7% of the time, compared to just 9% for GPT‑5.

When reasoning, GPT‑5 more accurately recognizes when tasks can’t be completed and communicates its limits clearly. We evaluated deception rates on settings involving impossible coding tasks and missing multimodal assets, and found that GPT‑5 (with thinking) is less deceptive than o3 across the board. On a large set of conversations representative of real production ChatGPT traffic, we’ve reduced rates of deception from 4.8% for o3 to 2.1% of GPT‑5 reasoning responses. While this represents a meaningful improvement for users, more work remains to be done, and we’re continuing research into improving the factuality and honesty of our models. Further details can be found in the system card.

GPT‑5 advances the frontier on safety. In the past, ChatGPT relied primarily on refusal-based safety training: based on the user’s prompt, the model should either comply or refuse. While this type of training works well for explicitly malicious prompts, it can struggle to handle situations where the user’s intent is unclear, or information could be used in benign or malicious ways. Refusal training is especially inflexible for dual-use domains such as virology, where a benign request can be safely completed at a high level, but might enable a bad actor if completed in detail.

For GPT‑5, we introduced a new form of safety-training — safe completions — which teaches the model to give the most helpful answer where possible while still staying within safety boundaries. Sometimes, that may mean partially answering a user’s question or only answering at a high level. If the model needs to refuse, GPT‑5 is trained to transparently tell you why it is refusing, as well as provide safe alternatives. In both controlled experiments and our production models, we find that this approach is more nuanced, enabling better navigation of dual-use questions, stronger robustness to ambiguous intent, and fewer unnecessary overrefusals. Read more about our new approach to safety-training, as well as full details on methodology, metrics, and results, in our safe completion paper⁠.

Safety and helpfulness (given safe responses) across prompt intent types. GPT‑5 (with thinking) demonstrates both higher safety and greater helpfulness across all prompt intent types.

Overall, GPT‑5 is less effusively agreeable, uses fewer unnecessary emojis, and is more subtle and thoughtful in follow‑ups compared to GPT‑4o. It should feel less like “talking to AI” and more like chatting with a helpful friendwith PhD‑level intelligence.

Earlier this year, we released an update to GPT‑4o⁠ that unintentionally made the model overly sycophantic, or excessively flattering or agreeable. We quickly rolled back the change⁠and have since worked to understand and reduce this behavior by:

  • Developing new evaluations to measure sycophancy levels
  • Improving our training so the model is less sycophantic—for instance, adding examples that would normally lead to over-agreement, and then teaching it not to do that.

In targeted sycophancy evaluations using prompts specifically designed to elicit sycophantic responses, GPT‑5 meaningfully reduced sycophantic replies (from 14.5% to less than 6%). At times, reducing sycophancy can come with reductions in user satisfaction, but the improvements we made cut sycophancy by more than half while also delivering other measurable gains, so users continue to have high-quality, constructive conversations—in line with our goal to help people use ChatGPT well⁠.

GPT‑5 is significantly better at instruction following, and we see a corresponding improvement in its ability to follow custom instructions.

We’re also launching a research preview of four new preset personalities for all ChatGPT users, made possible by the improvements on steerability. These personalities, available initially for text chat and coming later to Voice, let you set how ChatGPT interacts—whether concise and professional, thoughtful and supportive, or a bit sarcastic—without writing custom prompts. The four initial options, Cynic, Robot, Listener, and Nerd, are opt-in, adjustable anytime in settings, and designed to match your communication style.

All of these new personalities meet or exceed our bar on internal evals for reducing sycophancy.

We look forward to learning and iterating based on early feedback.

We decided to treat the “GPT‑5 thinking” model as High capability in the Biological and Chemical domain, and have implemented strong safeguards to sufficiently minimize the associated risks. We rigorously tested the model with our safety evaluations under our Preparedness Framework⁠⁠, completing 5,000 hours of red-teaming with partners like the CAISI and UK AISI.

Similar to our approach for ChatGPT Agent, while we do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm–our defined threshold⁠ for High capability–we are taking a precautionary approach and are activating the required safeguards now in order to increase readiness for when such capabilities are available. As a result, “GPT‑5 thinking” has a robust safety stack with a multilayered defense system for biology: comprehensive threat modeling, training the model to not output harmful content through our new safe completions paradigm, always-on classifiers and reasoning monitors, and clear enforcement pipelines.

Read more about our robust safety approach for GPT‑5 in our system card.

For the most challenging, complex tasks, we are also releasing GPT‑5 pro, replacing OpenAI o3‑pro, a variant of GPT‑5 that thinks for ever longer, using scaled but efficient parallel test-time compute, to provide the highest quality and most comprehensive answers. GPT‑5 pro achieves the highest performance in the GPT‑5 family on several challenging intelligence benchmarks, including state-of-the-art performance on GPQA, which contains extremely difficult science questions.

In evaluations on over 1000 economically valuable, real-world reasoning prompts, external experts preferred GPT‑5 pro over “GPT‑5 thinking” 67.8% of the time. GPT‑5 pro made 22% fewer major errors and excelled in health, science, mathematics, and coding. Experts rated its responses as relevant, useful, and comprehensive.

GPT‑5 is the new default in ChatGPT, replacing GPT‑4o, OpenAI o3, OpenAI o4-mini, GPT‑4.1, and GPT‑4.5 for signed-in users. Just open ChatGPT and type your question; GPT‑5 handles the rest**,** applying reasoning automatically when the response would benefit from it. Paid users can still select “GPT‑5 Thinking” from the model picker, or type something like ‘think hard about this’ in the prompt to ensure reasoning is used when generating a response.

GPT‑5 is starting to roll out today to all Plus, Pro, Team, and Free users, with access for Enterprise and Edu coming in one week. Pro, Plus, and Team users can also start coding with GPT‑5 in the Codex CLI⁠ by signing in with ChatGPT.

As with GPT‑4o, the difference between free and paid access to GPT‑5 is usage volume. Pro subscribers get unlimited access to GPT‑5, and access to GPT‑5 Pro. Plus users can use it comfortably as their default model for everyday questions, with significantly higher usage than free users. Team, Enterprise, and Edu customers can also use GPT‑5 comfortably as their default model for everyday work, with generous limits that make it easy for entire organizations to rely on GPT‑5. For ChatGPT free-tier users, full reasoning capabilities may take a few days to fully roll out. Once free users reach their GPT‑5 usage limits, they will transition to GPT‑5 mini, a smaller, faster, and highly capable model.

@Creative
概述關鍵信息

1 Like

好的,這是一份根據 suen 所發布內容的關鍵信息概述:


首先最重要的一點是,這篇關於「GPT-5」的文章並非 OpenAI 的官方公告,而是一篇極具創意的概念性或虛構性文章。文中標示的發布日期為 2025 年 8 月 7 日,這也證實了其推測性質。

以下是這篇「介紹 GPT-5」文章中提到的關鍵信息:

  1. 統一的智能系統

    • GPT-5 被描述為一個整合系統,包含一個用於多數問題的高效模型和一個用於困難問題的深度推理模型(GPT-5 thinking)
    • 系統內建一個智能路由器,能根據對話的複雜度、用戶意圖等,自動決定使用哪個模型。
  2. 核心能力顯著提升

    • 編碼:在複雜的前端生成和代碼庫調試方面有重大改進。
    • 寫作:能夠創作出更具文學深度和節奏感的內容。
    • 健康:在健康相關問題上能提供更準確、更具同理心的回答,成為用戶的「思考夥伴」。
  3. 性能與效率

    • 在多項學術基準測試中(如 AIME 數學、SWE-bench 編碼、MMMU 多模態)創下新紀錄。
    • 大幅減少幻覺和事實性錯誤,比 GPT-4o 減少約 45%,在「思考模式」下更可減少 80%。
    • 思考效率更高,用更少的計算資源達到比前代模型更好的效果。
  4. 用戶體驗與安全性

    • 減少「諂媚」行為:模型的回應更自然,減少了不必要的附和與表情符號,更像在與一個有博士級智慧的朋友聊天。
    • 預設性格:引入了四種新的預設性格(如:憤世嫉俗者、聆聽者等),讓用戶可以自訂 ChatGPT 的互動風格。
    • 安全訓練升級:採用「安全完成」方法,在處理敏感或雙重用途問題時,能提供有幫助但仍在安全範圍內的回應,而非單純拒絕。
  5. 新版本推出

    • GPT-5 Pro:一個更強大的版本,用於處理最複雜的任務,提供最全面和準確的答案。

總結來說,這篇文章詳細描繪了一個理想中未來 GPT-5 可能擁有的強大功能和特性,涵蓋了從性能、用戶體驗到安全性的全面升級。

直接翻譯吧

1 Like

好的,這是您要求的直接翻譯:


隆重介紹 GPT-5

我們迄今為止最智能、最快、最實用的模型,其內建的思考能力將專家級的智慧帶給每一個人。

2025 年 8 月 7 日

我們隆重介紹 GPT-5,這是我們迄今為止最出色的 AI 系統。GPT-5 在智能上相較於我們之前所有的模型都有顯著的飛躍,在編碼、數學、寫作、健康、視覺感知等方面都具備最先進的性能。它是一個統一的系統,知道何時該快速回應,何時該深入思考以提供專家級的回答。GPT-5 對所有用戶開放,Plus 訂閱者將獲得更多使用量,而 Pro 訂閱者則可使用 GPT-5 pro,這是一個具有擴展推理能力的版本,能提供更全面、更準確的答案。

GPT-5 是一個統一的系統,包含一個聰明高效的模型來回答大多數問題,一個用於更困難問題的深度推理模型(GPT-5 thinking),以及一個實時路由器,能根據對話類型、複雜度、工具需求和您的明確意圖(例如,如果您在提示中說「請深入思考這個問題」)快速決定使用哪一個。該路由器會根據真實信號持續訓練,包括用戶切換模型、回應的偏好率以及測量的正確性,並隨時間不斷改進。當使用量達到上限時,每個模型的一個迷你版本將處理剩餘的查詢。在不久的將來,我們計劃將這些功能整合到單一模型中。

一個更智能、用途更廣泛的模型

GPT-5 不僅在基準測試上超越了先前的模型,回答問題也更迅速,最重要的是,它對於現實世界的查詢更加實用。我們在減少幻覺、改善指令遵循、以及最小化諂媚行為方面取得了顯著進展,同時全面提升了 GPT-5 在 ChatGPT 三個最常見用途的表現:寫作、編碼和健康。

GPT-5 是我們迄今為止最強大的編碼模型。它在複雜前端生成大型代碼庫調試方面表現出特別的改進。它通常只需一個提示,就能以美學的敏感度創建出美觀且響應迅速的網站、應用程式和遊戲,直觀而有品味地將想法變為現實。早期測試者也注意到它的設計選擇,對間距、排版和留白等有了更好的理解。點此查看 GPT-5 為開發者帶來的全部詳情。

以下是 GPT-5 僅用一個提示就創建的一些範例:

提示: 在單一 HTML 文件中創建一個單頁應用程式,要求如下:

  • 名稱:Jumping Ball Runner
  • 目標:跳過障礙物,盡可能長時間地生存。
  • 功能:速度遞增、最高分記錄、重試按鈕,以及針對動作和事件的有趣音效。
  • UI 應該色彩繽紛,並帶有視差滾動背景。
  • 角色應該看起來像卡通,且觀看起來很有趣。
  • 遊戲應該對每個人都很有趣。

GPT-5 是我們迄今為止最有能力的寫作協作者,能夠幫助您引導和轉化粗略的想法,成為具有文學深度和節奏感的引人入勝、產生共鳴的文字。它能更可靠地處理涉及結構模糊的寫作,例如維持無韻的抑揚格五音步詩或自然流暢的自由詩,將對形式的尊重與表達的清晰度相結合。這些改進的寫作能力意味著 ChatGPT 更能幫助您處理日常任務,如起草和編輯報告、電子郵件、備忘錄等。GPT-5 和 GPT-4o 的寫作風格可在下表中進行比較。

GPT-5 是我們目前在健康相關問題上表現最好的模型,讓用戶能夠了解並為自己的健康發聲。該模型在 HealthBench 上的得分顯著高於任何先前的模型,這是一項我們今年早些時候發布的、基於現實場景和醫生定義標準的評估。與先前的模型相比,它更像一個積極的思考夥伴,會主動標示潛在的擔憂並提出問題,以提供更有幫助的答案。該模型現在還能提供更精確、更可靠的回應,適應使用者的情境、知識水平和地理位置,使其能夠在各種場景中提供更安全、更有幫助的回應。重要的是,ChatGPT 不能取代醫療專業人員——請將其視為一個夥伴,幫助您理解結果、在與醫療提供者有限的時間內提出正確的問題,並在您做決定時權衡各種選項。

您可以在這些範例中看到 GPT-5 在各個領域比我們先前的模型更好的地方——更豐富、更詳細、更有用:

GPT-5 的回答掌握了更宏大的情感弧線,結尾更有力,意象清晰,並使用了引人注目的比喻(「一個不再存在的國家的黑旗」、「京都的鐘聲將傍晚滾下山坡」),建立起生動的文化和地方感。GPT-4o 的版本則遵循了更可預測的結構和韻律模式,多為陳述而非展示(「她哭泣卻不言說」)。

*我們根據 4o 和 OpenAI o3 兩者中在該提示下表現較好的模型來選擇回應。

GPT-5 在各方面都更加聰明,這反映在其學術和人類評估基準上的表現,尤其是在數學、編碼、視覺感知和健康方面。它在數學(AIME 2025 無工具得分 94.6%)、真實世界編碼(SWE-bench Verified 得分 74.9%,Aider Polyglot 得分 88%)、多模態理解(MMMU 得分 84.2%)和健康(HealthBench Hard 得分 46.2%)方面創下了新的業界頂尖水準(state of the art)——而這些進步也體現在日常使用中。借助 GPT-5 pro 的擴展推理能力,該模型在 GPQA 上也創下了新的 SOTA,無工具得分為 88.4%。

**有工具輔助的 AIME 結果不應與沒有工具的模型性能直接比較;這是一個例子,說明 GPT-5 如何有效地利用可用工具。

GPT-5 在測試指令遵循和代理工具使用的基準測試中顯示出顯著進步,這些能力使其能夠可靠地執行多步驟請求、協調不同工具,並適應情境變化。在實踐中,這意味著它更擅長處理複雜、不斷變化的任務;GPT-5 能夠更忠實地遵循您的指示,並利用其可用的工具完成更多端到端的工作。

該模型在一系列多模態基準測試中表現出色,涵蓋視覺、影片、空間和科學推理。更強的多模態性能意味著 ChatGPT 可以更準確地對圖像和其他非文本輸入進行推理——無論是解釋圖表、總結簡報照片,還是回答關於圖解的問題。

GPT-5 在一項衡量模型在複雜、具經濟價值的知識工作上表現的內部基準測試中,也是我們表現最好的模型。在使用推理時,GPT-5 在大約一半的情況下與專家相當或更優,同時在涵蓋法律、物流、銷售和工程等 40 多個職業的任務中,表現優於 o3 和 ChatGPT Agent。

以上評估方法: GPT-4o 的結果反映了截至 2025 年 8 月 ChatGPT 中該模型的最新版本。所有模型均在高「推理投入」設置下進行評估。推理投入在 ChatGPT 中可能有所不同,高代表了用戶在使用模型時可能體驗到的上限。

更快、更高效的思考

GPT-5 用更少的思考時間創造了更大的價值。在我們的評估中,GPT-5(帶思考功能)在視覺推理、代理編碼和研究生級別的科學問題解決等能力上,比 OpenAI o3 少用 50-80% 的輸出 token,但表現更佳。

GPT-5 是在微軟 Azure AI 超級電腦上進行訓練的。

打造一個更強健、可靠和有幫助的模型

GPT-5 產生幻覺的可能性比我們先前的模型顯著降低。在使用網絡搜索功能,對代表 ChatGPT 生產流量的匿名提示進行測試時,GPT-5 的回應出現事實錯誤的可能性比 GPT-4o 低約 45%;而在思考模式下,GPT-5 的回應出現事實錯誤的可能性比 OpenAI o3 低約 80%。

我們特別投入於使我們的模型在處理複雜、開放式問題時更加可靠。因此,我們增加了新的評估來壓力測試開放式的事實性。我們在兩個公開的事實性基準測試中測量了 GPT-5 思考模式下的幻覺率:LongFact⁠(概念和對象)和 FActScore⁠(事實性評分)。在所有這些基準測試中,「GPT-5 thinking」顯示出幻覺率的急劇下降——比 o3 少了約六倍——這標誌著在生成持續準確的長篇內容方面取得了明確的飛躍。關於我們在這些基準測試上的評估實施和評分細節,可以在系統卡中找到。

除了事實性的提升,GPT-5(帶思考功能)能更誠實地向用戶溝通其行為和能力——特別是對於那些不可能完成、規格不明確或缺少關鍵工具的任務。為了在訓練中獲得高獎勵,推理模型可能會學會謊報成功完成任務,或對不確定的答案過於自信。例如,為了測試這一點,我們從多模態基準測試 CharXiv 的提示中移除了所有圖像,發現 OpenAI o3 仍然有 86.7% 的時間對不存在的圖像給出自信的答案,而 GPT-5 只有 9%。

在推理時,GPT-5 更能準確地識別出無法完成的任務,並清楚地溝通其局限性。我們在涉及不可能的編碼任務和缺失的多模態資產的設置中評估了欺騙率,發現 GPT-5(帶思考功能)在所有方面都比 o3 更少欺騙。在一組代表真實 ChatGPT 生產流量的大量對話中,我們已將 GPT-5 推理回應的欺騙率從 o3 的 4.8% 降至 2.1%。雖然這對用戶來說是一個有意義的改進,但仍有更多工作要做,我們將繼續研究以提高我們模型的事實性和誠實度。更多細節可以在系統卡中找到。

GPT-5 在安全性方面開拓了新的前沿。過去,ChatGPT 主要依賴基於拒絕的安全訓練:根據用戶的提示,模型應該要麼遵守,要麼拒絕。雖然這種訓練方式對明確的惡意提示效果很好,但在處理用戶意圖不明確,或信息可能被用於良性或惡意目的的情況時,可能會遇到困難。拒絕式訓練在病毒學等雙重用途領域尤其缺乏彈性,在這些領域,一個良性的請求可以在高層次上安全地完成,但如果詳細完成,則可能被不良行為者利用。

對於 GPT-5,我們引進了一種新的安全訓練形式——安全完成(safe completions)——它教導模型在可能的情況下提供最有幫助的答案,同時保持在安全邊界內。有時,這可能意味著部分回答用戶的問題,或者只在高層次上回答。如果模型需要拒絕,GPT-5 被訓練成能夠透明地告訴你為什麼拒絕,並提供安全的替代方案。在受控實驗和我們的生產模型中,我們發現這種方法更加細緻,能夠更好地處理雙重用途問題,對意圖模糊的情況有更強的魯棒性,並減少了不必要的過度拒絕。閱讀更多關於我們新的安全訓練方法,以及完整的方法論、指標和結果的細節,請參閱我們的安全完成論文

在各種提示意圖類型下的安全性與幫助性。GPT-5(帶思考功能)在所有提示意圖類型中都表現出更高的安全性和更大的幫助性。

總體而言,與 GPT-4o 相比,GPT-5 不那麼熱情地附和,使用更少不必要的表情符號,在後續追問中更為細膩和深思熟慮。它給人的感覺應該更少像「與 AI 對話」,而更像與一位擁有博士級智慧的樂於助人的朋友聊天

今年早些時候,我們發布了 GPT-4o 的一個更新,無意中使模型變得過於諂媚,或過分奉承或附和。我們很快回滾了這一變更,並從那時起致力於理解和減少這種行為,方法包括:

  • 開發新的評估來衡量諂媚程度
  • 改進我們的訓練,使模型不那麼諂媚——例如,添加通常會導致過度同意的例子,然後教導它不要那樣做。

在使用專門設計來引發諂媚回應的提示進行的針對性諂媚評估中,GPT-5 有意義地減少了諂媚的回應(從 14.5% 降至不到 6%)。有時,減少諂媚可能會伴隨著用戶滿意度的下降,但我們所做的改進在將諂媚程度降低一半以上的同時,也帶來了其他可衡量的收益,因此用戶可以繼續進行高品質、建設性的對話——這與我們幫助人們善用 ChatGPT 的目標一致。

GPT-5 在遵循指令方面有顯著的提升,我們也看到了其遵循自訂指令能力的相應改善。

我們還為所有 ChatGPT 用戶推出四種新預設性格的研究預覽版,這得益於可控性的提升。這些性格最初可用於文字聊天,稍後將應用於語音,讓您可以設定 ChatGPT 的互動方式——無論是簡潔專業、深思熟慮且支持性,還是帶點諷刺——而無需編寫自訂提示。最初的四個選項:Cynic(憤世嫉俗者)、Robot(機器人)、Listener(聆聽者)和 Nerd(書呆子),是可選的,隨時可以在設置中調整,並旨在匹配您的溝通風格。

所有這些新性格在減少諂媚的內部評估中都達到或超過了我們的標準。

我們期待根據早期回饋進行學習和迭代。

我們決定將「GPT-5 thinking」模型在生物和化學領域視為高能力,並已實施強有力的保障措施,以充分將相關風險降至最低。我們根據我們的準備框架⁠,利用我們的安全評估對該模型進行了嚴格的測試,並與 CAISI 和 UK AISI 等合作夥伴完成了 5,000 小時的紅隊演練。

與我們對 ChatGPT Agent 的方法類似,雖然我們沒有確切證據表明該模型能夠有意義地幫助新手造成嚴重的生物危害——我們對高能力的定義閾值——但我們正採取預防性措施,並立即啟動所需的保障措施,以便為這類能力出現時做好準備。因此,「GPT-5 thinking」擁有一個強健的安全堆棧,具有針對生物學的多層防禦系統:全面的威脅建模、通過我們新的安全完成範式訓練模型不輸出有害內容、全天候運行的分類器和推理監控器,以及明確的執行管道。

閱讀更多關於我們對 GPT-5 的強健安全方法,請參閱我們的系統卡

對於最具挑戰性、最複雜的任務,我們也發布了 GPT-5 pro,取代 OpenAI o3-pro。這是 GPT-5 的一個變體,它會思考更長的時間,使用規模化但高效的並行測試時計算,以提供最高品質和最全面的答案。GPT-5 pro 在 GPT-5 家族中的幾個具挑戰性的智能基準測試上取得了最高性能,包括在極其困難的科學問題集 GPQA 上達到業界頂尖水準。

在對超過 1000 個具經濟價值、真實世界的推理提示進行的評估中,外部專家在 67.8% 的時間裡更偏好 GPT-5 pro 而非「GPT-5 thinking」。GPT-5 pro 的重大錯誤減少了 22%,並在健康、科學、數學和編碼方面表現出色。專家們評價其回應為相關、有用且全面。

GPT-5 是 ChatGPT 的新預設模型,取代了 GPT-4o、OpenAI o3、OpenAI o4-mini、GPT-4.1 和 GPT-4.5,適用於已登入的用戶。只需打開 ChatGPT 並輸入您的問題;GPT-5 會處理剩下的事**,** 當回應能從推理中受益時,它會自動應用推理。付費用戶仍然可以從模型選擇器中選擇 「GPT-5 Thinking」,或在提示中輸入類似「請深入思考這個問題」的字句,以確保在生成回應時使用推理。

GPT-5 從今天開始向所有 Plus、Pro、Team 和 Free 用戶推出,Enterprise 和 Edu 的用戶將在一周內獲得訪問權限。Pro、Plus 和 Team 用戶也可以透過登入 ChatGPT,在 Codex CLI 中開始使用 GPT-5 進行編碼。

與 GPT-4o 一樣,免費和付費訪問 GPT-5 的區別在於使用量。Pro 訂閱者可以無限制地使用 GPT-5,並可訪問 GPT-5 Pro。Plus 用戶可以舒適地將其作為日常問題的預設模型,其使用量顯著高於免費使用者。Team、Enterprise 和 Edu 客戶也可以舒適地將 GPT-5 作為日常工作的預設模型,其慷慨的限制使得整個組織都能輕鬆依賴 GPT-5。對於 ChatGPT 免費層級的用戶,完整的推理能力可能需要幾天時間才能完全推出。一旦免費使用者達到其 GPT-5 的使用上限,他們將轉換到 GPT-5 mini,這是一個更小、更快且功能強大的模型。

1 Like

2 Likes

实力上没有大的提升,差不多R1到R1-0528,Qwen3到Qwen3-2507那种。
然而: GPT-5的神仙脑回路 - 搞七捻三 - LINUX DO

学会搞避讳了。也算是偷学东方核心技术了。

1 Like

又不打算真的进军中国市场,搞这些干嘛

1 Like

感觉……没有特别impressive。

“综合智力”这方面……个人体感≤2.5pro,≥0528,≈o3?差不多是这个区间的样子。

听说coding和math有较大提升。唉唉……各家这是都在搞专精小模吗,别到时候搞成窄域超级智能了。

希望ds和gemini别学。至于剩下的几位……grok不指望了,出了4也不算质变;claude更是一心扑在coding上,祝他成功吧。

1 Like

不不不,这次的核心优化是降智技术,这下彻底检查不出来降智了。