Advanced AI Coding Workflows: GPT 5.2, Multi-Agent Pipelines, and Red Teams
GPT 5.2 vs Claude vs Codex: The Verdict
After a year of daily use with $200/month GPT Pro and $100/month Claude Pro subscriptions, here's the breakdown:
GPT 5.2 High Reasoning — Best for actual coding. Measures twice, cuts once. Keeps the bigger picture in mind so changes don't break other parts of your app. Less hallucination. More pessimistic personality but that's actually good for code quality.
Opus — Fast because of parallel tool calling. But it's like working with an enthusiastic intern who makes a mess. Files everywhere, random test scripts, doesn't follow instructions exactly, hallucinates more. Performance drops hard on larger codebases due to limited context window.
Codex CLI — OpenAI's equivalent to Claude Code. Not to be confused with Codex Cloud (terrible branding). The only model I use for writing code is GPT 5.2 in high reasoning mode through Codex CLI.
The only time Claude beats GPT 5.2 is maybe UI design. For accuracy, code quality, and instruction following — GPT 5.2 wins by a mile.
Environment Setup
Three apps. That's it:
- Safari — Came full circle back from Arc and Zed
- Warp — Terminal with AI built in
- Codex App — Good stepping stone from Cursor if you're not CLI-comfortable
Everything runs from GitHub worktrees. This lets you run multiple agents in parallel on different branches. Each feature gets an isolated environment. No committing to main until it's fully developed.
git worktree add -b feature-branch new-directoryPRD Workflow: Spec-Driven Development
The only part of the process I'm actually involved in is creating the Product Requirements Document. Everything after that is automated.
The PRD agent:
- Runs 10 rounds of questions to get 90%+ clarity on what I want built
- Asks about expected features, behavior, how it'll be used
- Makes 3 suggestions per question with a recommended option
- I reply like "1A, 2B, 3C, 4 no do this instead"
- Translates everything into GitHub issues
This upfront clarity is critical. Without it, the AI misinterprets what you want, goes in the wrong direction, changes code you then have to clean up, wastes tokens and time. Over-specify. Give more detail than you think you need.
The Dev → QA → Merge Pipeline
Once issues are created, I run one command: IMPL 1237 (implementation shortcut I built). Here's what happens automatically:
1. Dev Agent — Creates a worktree for the branch, implements the feature based on the GitHub issue, says "done" when finished.
2. QA Agent (separate context window — this is key) — Fresh agent, no history of what happened. Reads the issue requirements and acceptance criteria. Runs end-to-end tests in real environments. Checks code efficiency, no bloat, no over-engineering. Must prove it works with actual output.
Why separate agents? If you ask the dev agent to QA its own work, it's biased. It'll say everything's fine when it isn't.
3. Iteration Loop — QA fails → feedback goes to dev agent → Dev fixes → new QA agent spins up → Repeat until QA passes.
4. Merge Agent — Runs AI code reviews (Cubic.dev, CodeRabbit), runs linting (Ruff), updates documentation if needed, merges to main, closes the issue.
With Codex's sub-agent capability, this entire pipeline runs with one command.
Running Agents in Parallel
I can run 40-50 agents simultaneously without CPU issues — as long as they're not launching sub-agents.
With sub-agents (the full IMPL pipeline), CPU spikes. On an M4 MacBook Pro, 5-10 implementation agents with sub-agents pushes CPU to 100%.
Red Team: Attack Your Own Code
A red team is an independent group hired to attack your system. I set this up with AI agents.
The prompt: "You are external consultants. Your job is NOT to agree with things. Play devil's advocate. Attack our system for bugs, logic flaws, over-engineering, security vulnerabilities."
I use extra-high reasoning for this. The project manager agent writes 10 isolated GitHub issues for the red team to investigate. Each investigation agent gets a fresh worktree, runs real-world tests, writes a full report as a GitHub comment.
For anything involving real money (I'm building an AI-driven hedge fund), this hardening process is non-negotiable.
Beyond Dev: AI Departments
Most people use these tools only for development. But you can set up entire departments:
- Quant Researcher — Pulls from a vector database of 176,000 learnings. Designs new backtests.
- Backtesting Engineer — Runs backtests with specific setup instructions.
- Forensic Data Analyst — Analyzes backtest results day-by-day. Runs a 13-stage audit using scikit-learn and rule mining.
OpenClaw for Personal Workflows
Not using OpenClaw for dev work — that's all Warp and Codex. But for personal assistant stuff: morning and evening reviews via Telegram voice notes, weekly/monthly/quarterly reviews, habit nudges and reminders, connected to my Notion system.
Quick Setup Tips
- Use Codex CLI directly — Not through Cursor. Direct OpenAI products give 2x usage limits right now.
- GitHub worktrees — Essential for parallel agent work.
- Separate agents for separate jobs — Fresh context windows prevent bias.
- Over-specify requirements — The PRD phase is where you add value.
- Red team regularly — Especially for anything involving real money.
Now is the best time to be a nerd. More innovation every couple months than 99% of humans saw in their entire lifetime.