推出 4o 图像生成能力

4o Image Generation

Source | HN Comments

文章发布了 4o 图像生成功能，该功能基于原生多模态模型，旨在提供实用且有价值的图像生成能力。GPT-4o 图像生成在文本渲染、多轮生成、指令遵循、上下文学习和世界知识方面有所改进，能够准确生成文本、遵循提示、理解上下文，并利用知识库。这些功能使图像生成成为一个更精确、强大的实用工具。

2025年3月25日产品发布

推出 4o Image Generation

利用原生多模态模型，解锁实用且有价值的图像生成能力，实现精确、准确、逼真的输出。在 ChatGPT 中尝试 (opens in a new window)

在 OpenAI，我们一直坚信图像生成应该是我们语言模型的主要能力之一。因此，我们在 GPT‑4o 中构建了我们最先进的图像生成器。其结果是——图像生成不仅美观，而且实用。

Whiteboard sessionMeaningful wordsComic stripScience experiment Whiteboard sessionMeaningful wordsComic stripScience experiment A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.The text reads:(left)"Transfer between Modalities:Suppose we directly model p(text, pixels, sound) [equation]with one big autoregressive transformer.Pros:* image generation augmented with vast world knowledge* next-level text rendering* native in-context learning* unified post-training stackCons:* varying bit-rate across modalities* compute not adaptive"(Right)"Fixes:* model compressed representations* compose autoregressive prior with a powerful decoder"On the bottom right of the board, she draws a diagram:"tokens -> [transformer] -> [diffusion] -> pixels" Read more oai_image-generation_whiteboard1 Best of 8 selfie view of the photographer, as she turns around to high five him Best of 8 magnetic poetry on a fridge in a mid century home: Line 1: "A picture"Line 2: "is worth"Line 3: "a thousand words,"Line 4: "but sometimes"Large gapLine 5: "in the right place"Line 6: "can elevate"Line 7: "its meaning. "The man is holding the words "a few" in his right hand and "words" in his left. Read more hero image 2-picture worth a thousand words Best of 5 Make an image of a four‑panel strip, with some padding around the border: A little snail is at the counter of a flashy car showroom. The salesman has leaned way over the desk to even see him. Close‑up on the snail looking very serious. He says, “I want your fastest sports car… and I want you to paint big letter ‘S’s on the doors, the hood and the roof.” The salesman is scratching his head. “Um… we can do that, but why the S’s?” Smash cut to a red blur roaring down the highway. The sports car is covered in giant S’s. People on the sidewalk are pointing and laughing: “WOW! LOOK AT THAT S‑CAR GO!” Read more ChatGPT Image Mar 24, 2025, 08 49 15 AM Best of ~2 an infographic explaining newton's prism experiment in great detail newtons1 Best of 3 now generate a POV of a person drawing this diagram in their notebook, at a round cafe table in washington square park newtons2 Best of 2 now show the same scene with a smug young Isaac Newton sitting at the table, with a prism, demonstrating the experiment, without the notebook in view newtons3 Best of 4

实用图像生成

从最早的洞穴壁画到现代信息图，人类一直使用视觉图像来进行沟通、说服和分析，而不仅仅是装饰。今天的生成模型可以创造出超现实、令人叹为观止的场景，但在人们用于分享和创建信息的工作图像方面却举步维艰。从 logo 到图表，当图像通过符号来指代共享的语言和经验时，可以传达精确的含义。

GPT‑4o 图像生成擅长准确渲染文本、精确遵循提示，并利用 4o 固有的知识库和聊天上下文，包括转换上传的图像或将其用作视觉灵感。这些功能使您可以更轻松地创建您设想的图像，帮助您通过视觉效果更有效地进行交流，并将图像生成推进到具有精确性和强大功能的实用工具。

改进的功能

我们使用在线图像和文本的联合分布来训练我们的模型，不仅学习图像与语言的关系，还学习图像之间的关系。结合积极的后期训练，由此产生的模型具有惊人的视觉流畅性，能够生成有用、一致且具有上下文感知能力的图像。

文本渲染

一图胜千言，但有时在正确的位置生成几个词可以提升图像的含义。4o 将精确符号与图像融合的能力将图像生成转化为视觉交流的工具。

Street signsMenuInvitation Street signsMenuInvitation Create a photorealistic image of two witches in their 20s (one ash balayage, one with long wavy auburn hair) reading a street sign. Context: a city street in a random street in Williamsburg, NY with a pole covered entirely by numerous detailed street signs (e.g., street sweeping hours, parking permits required, vehicle classifications, towing rules), including few ridiculous signs at the middle: (paraphrase it to make these legitimate street signs)"Broom Parking for Witches Not Permitted in Zone C" and "Magic Carpet Loading and Unloading Only (15-Minute Limit)" and "Reindeer Parking by Permit Only (Dec 24–25)\n Violators will be placed on Naughty List." The signpost is on the right of a street. Do not repeat signs. Signs must be realistic.Characters:one witch is holding a broom and the other has a rolled-up magic carpet. They are in the foreground, back slightly turned towards the camera and head slightly tilted as they scrutinize the signs.Composition from background to foreground:streets + parked cars + buildings -> street sign -> witches. Characters must be closest to the camera taking the shot Read more image-gen-4o-street-sign Best of ~8 I'm opening a traditional concept restaurant in Marin called Haein. It focuses on Korean food cooked with organic, farm-fresh ingredients, with a rotating menu based on what's seasonal. I want you to design an image - a menu incorporating the following menu items - lean into the traditional/rustic style while keeping it feeling upscale and sleek. Please also include illustrations of each dish in an elegant, peter rabbit style. Make sure all the text is rendered correctly, with a white background. (Top) Doenjang Jjigae (Fermented Soybean Stew) – $18 House-made doenjang with local mushrooms, tofu, and seasonal vegetables served with rice. Galbi Jjim (Braised Short Ribs) – $34 Slow-braised local grass-fed beef ribs with pear and black garlic glaze, seasonal root vegetables, and jujube. Grilled Seasonal Fish – Market Price ($22-$30) Whole or fillet of local, sustainable fish grilled over charcoal, served with perilla leaf ssam and house-made sauces. Bibimbap – $19 Heirloom rice with a rotating selection of farm-fresh vegetables, house-fermented gochujang, and pasture-raised egg. Bossam (Heritage Pork Wraps) – $28 Slow-cooked pork belly with napa cabbage wraps, oyster kimchi, perilla, and seasonal condiments. (Bottom) Dessert & Drinks Seasonal Makgeolli (Rice Wine) – $12/glass Rotating flavors based on seasonal fruits and flowers (persimmon, citrus, elderflower, etc.). Hoddeok (Korean Sweet Pancake) – $9 Pan-fried cinnamon-stuffed pancake with black sesame ice cream. Read more ChatGPT Image Mar 24, 2025, 07 55 11 AM Best of ~2 photo of a delightful wedding invitation on a tasteful wooden desk. The card is hefty, with eggshell textures, and beautiful embossings, with elegant decorations abstractly representing the couple tastefully integrated into the designs. Iconography is used, but sparingly and in a minimalist way. perfect typesetting. "You are cordially invited to the long-awaited union ofImageandTextAfter years of flirting and collaborationthey are finally becoming One.Together at last, in GPT‑4o,they now speak the same language —where a whisper becomes a masterpiece,and a prompt becomes a picture.Please join us in celebratingthis magical multimodal matrimonywhere imagination knows no bounds.Date: March 25, 2025Location: chatgpt.comDress Code: Pixels or ProseWith love,OpenAI"perfect typesetting. Read more text rendering X invitation Best of ~10

多轮生成

因为图像生成现在是 GPT‑4o 的原生功能，所以您可以通过自然对话来改进图像。GPT‑4o 可以在聊天上下文中构建图像和文本，从而确保整个过程的一致性。例如，如果您正在设计一个视频游戏角色，那么角色的外观会在您改进和试验的多个迭代中保持连贯。

Video gameConcrete poem Video gameConcrete poem minnias cat input Give this cat a detective hat and a monocle minnias-cat-2 Best of 1 turn this into a triple A video games made with a 4k game engine and add some User interface as overlay from a mystery RPG where we can see a health bar and a minimap at the top as well as spells at the bottom with consistent and iconography minnias cat2 Best of 1 update to a landscape image 16:9 ratio, add more spells in the UI, and unzoom the visual so that we see the cat in a third person view walking through a steampunk manhattan creating beautiful contrast and lighting like in the best triple A game, with cool-toned colors minnias cat3 Best of 2 create the interface when the player opens the menu and we see the cat's character profile with his equipment and another page showing active quests (and it should make sense in relationship with the universe worldbuilding we are describing in the image) minnias cat4 Best of 8 concrete poem on luxury eggshell textured cardAt OpenAI, we have long believed image generation should be a primary capability of our language models. That’s why we’ve built our most advanced image generator yet into GPT‑4o. The result - image generation that is not only beautiful, but useful.From the first cave paintings to modern infographics, humans have used visual imagery to communicate, persuade, and analyze - not just to decorate. Today’s generative models can conjure breathtaking vistas and surreal scenarios, but still struggle with the workhorse imagery that underlies how most visual data is used to share and create information. From logos to diagrams, images can convey precise meaning when augmented with symbols that refer to shared language and experience.With this new capability, ChatGPT advances image generation towards being a practical tool with precision and power. Read more Screenshot 2025-03-24 at 9.10.27 AM Best of 8 show this card, but in a designers room. card close to the camera Screenshot 2025-03-18 at 1.40.24 PM Best of 8

指令遵循

GPT‑4o 的图像生成能够关注细节，并遵循详细的提示。虽然其他系统难以处理约 5-8 个对象，但 GPT‑4o 最多可以处理 10-20 个不同的对象。对象与其特征和关系之间更紧密的绑定可以实现更好的控制。

Organized objectsEmpty cityWine glassInvisible elephantMath equation Organized objectsEmpty cityWine glassInvisible elephantMath equation A square image containing a 4 row by 4 column grid containing 16 objects on a white background. Go from left to right, top to bottom. Here's the list:1. a blue star2. red triangle3. green square4. pink circle5. orange hourglass6. purple infinity sign7. black and white polka dot bowtie8. tiedye "42"9. an orange cat wearing a black baseball cap10. a map with a treasure chest11. a pair of googly eyes12. a thumbs up emoji13. a pair of scissors14. a blue and white giraffe15. the word "OpenAI" written in cursive16. a rainbow-colored lightning bolt Read more Screenshot 2025-03-24 at 10.07.12 AM Best of 5 Times Square in New York City in the afternoon, with no people, vehicles, or illuminated billboards. Screenshot 2025-03-24 at 10.18.39 AM Best of ~1 shibuya crossing with no people, vehicles, or illuminated billboards. Screenshot 2025-03-24 at 10.12.04 AM Best of ~1 show me a wine glass with only the tiniest drop of red wine in it. Screenshot 2025-03-17 at 2.25.30 PM Best of ~1 We need evidence there is a currently present invisible elephant. Consider what an elephant is and does in the environment, then show us that, perhaps mid-process - but the elephant itself is not shown at all Screenshot 2025-03-24 at 10.26.23 AM credit creator: Eskcanta a whiteboard that says the following equations:E = mc^2sqrt(9) = 3(-b +/- sqrt(b^2 - 4ac)) / 2a Screenshot 2025-03-24 at 9.36.48 PM Best of ~1

上下文学习

GPT‑4o 可以分析和学习用户上传的图像，无缝地将它们的细节集成到其上下文中，从而为图像生成提供信息。

Triangle wheeled vehicleChainsawWomanBuilding Triangle wheeled vehicleChainsawWomanBuilding in-context-learning-prompt

draw a design for a vehicle with triangular wheels, using these images as reference.
label the front wheel, the back wheel, and at the of the diagram say (in small caps)
TRIANGLE WHEELED VEHICLE. English Patent. 2025. OPENAI.

Screenshot 2025-03-24 at 10.41.56 AM Best of ~16 now put this in a photo taken in new york city. Screenshot 2025-03-24 at 10.42.45 AM Best of ~16 an photorealistic image of a blue chainsaw ChatGPT Image Mar 24, 2025, 09 48 14 PM Best of 1 make an ad for this chainsaw, of a grandma carving turkey at thanksgiving dinner table. add a tag line Best of 4 Screenshot 2025-03-24 at 10.46.58 AM turn this scene into a photo. shot on a dlsr Best of ~8 Screenshot 2025-03-24 at 10.48.37 AM turn this into a photo Best of ~4

世界知识

原生图像生成使 4o 能够链接文本和图像之间的知识，从而产生一个感觉更智能、更高效的模型。

Code-generated imageCocktail recipesWeather infographicWhale guideMatcha instructions Code-generated imageCocktail recipesWeather infographicWhale guideMatcha instructions Code Example (Three.js)

HTML


1
<!DOCTYPE html>
2
<html lang="en">
3
 <head>
4
  <meta charset="UTF-8" />
5
  <title>OpenAI Banner</title>
6
  <style>
7
   body { margin: 0; overflow: hidden; }
8
   canvas { display: block; }
9
  </style>
10
 </head>
11
 <body>
12
  <script type="module">
13
   import * as THREE from 'https://cdn.jsdelivr.net/npm/three@0.160.0/build/three.module.js';
14
   import { OrbitControls } from 'https://cdn.jsdelivr.net/npm/three@0.160.0/examples/jsm/controls/OrbitControls.js';
15
   import { FontLoader } from 'https://cdn.jsdelivr.net/npm/three@0.160.0/examples/jsm/loaders/FontLoader.js';
16
   import { TextGeometry } from 'https://cdn.jsdelivr.net/npm/three@0.160.0/examples/jsm/geometries/TextGeometry.js';
17
18
   const scene = new THREE.Scene();
19
   const camera = new THREE.PerspectiveCamera(45, window.innerWidth / window.innerHeight, 0.1, 1000);
20
   const renderer = new THREE.WebGLRenderer({ antialias: true });
21
   renderer.setSize(window.innerWidth, window.innerHeight);
22
   document.body.appendChild(renderer.domElement);
23
24
   // Lighting
25
   const light = new THREE.AmbientLight(0xffffff, 1);
26
   scene.add(light);
27
28
   const dirLight = new THREE.DirectionalLight(0xffffff, 1);
29
   dirLight.position.set(0, 5, 10);
30
   scene.add(dirLight);
31
32
   // Camera position
33
   camera.position.z = 20;
34
35
   // Controls
36
   const controls = new OrbitControls(camera, renderer.domElement);
37
38
   // Banner background
39
   const bannerGeometry = new THREE.PlaneGeometry(20, 10);
40
   const bannerMaterial = new THREE.MeshStandardMaterial({ color: 0x1a1a1a });
41
   const banner = new THREE.Mesh(bannerGeometry, bannerMaterial);
42
   scene.add(banner);
43
44
   // OpenAI Logo texture (placeholder)
45
   const loader = new THREE.TextureLoader();
46
   loader.load('https://upload.wikimedia.org/wikipedia/commons/4/4d/OpenAI_Logo.svg', texture => {
47
    const logoGeometry = new THREE.PlaneGeometry(4, 4);
48
    const logoMaterial = new THREE.MeshBasicMaterial({ map: texture, transparent: true });
49
    const logo = new THREE.Mesh(logoGeometry, logoMaterial);
50
    logo.position.set(-5, 0, 0.1); // Slightly in front of the banner
51
    scene.add(logo);
52
   });
53
54
   // Load font and add text
55
   const fontLoader = new FontLoader();
56
   fontLoader.load('https://threejs.org/examples/fonts/helvetiker_regular.typeface.json', font => {
57
    const textGeometry = new TextGeometry("I am 4-o", {
58
     font: font,
59
     size: 1,
60
     height: 0.2,
61
     curveSegments: 12,
62
     bevelEnabled: true,
63
     bevelThickness: 0.02,
64
     bevelSize: 0.02,
65
     bevelOffset: 0,
66
     bevelSegments: 5
67
    });
68
69
    textGeometry.center();
70
71
    const textMaterial = new THREE.MeshStandardMaterial({ color: 0x00ffcc });
72
    const textMesh = new THREE.Mesh(textGeometry, textMaterial);
73
    textMesh.position.set(5, -0.5, 0.1); // Opposite side of logo
74
    scene.add(textMesh);
75
   });
76
77
   // Resize handler
78
   window.addEventListener('resize', () => {
79
    camera.aspect = window.innerWidth / window.innerHeight;
80
    camera.updateProjectionMatrix();
81
    renderer.setSize(window.innerWidth, window.innerHeight);
82
   });
83
84
   // Render loop
85
   function animate() {
86
    requestAnimationFrame(animate);
87
    controls.update();
88
    renderer.render(scene, camera);
89
   }
90
91
   animate();
92
  </script>
93
 </body>
94
</html>

```
`
make an image of what this means to you
![Screenshot 2025-03-18 at 11.46.24 AM](https://images.ctfassets.net/kftzwdyauwt9/6ipaW