From Skill to API to Product

forge.aurochs.agency

Stefan Kovalik · March 12, 2026 · 7 min read

Forge didn’t start as a product. It started as a command I typed when I wasn’t sure something was going to work.

/analyze

That’s it. A skill file in my CLI workflow. I’d be building a landing page, or reviewing a client’s dashboard, or staring at a mockup that felt off but I couldn’t articulate why. So I’d type /analyze and let it run. A few seconds later: a scored report. What’s working, what isn’t, and why, grounded in the five perception layers I’d been using for 15 years of design audits.

I never planned to extract it. I just kept using it. Hundreds of times.

How I Actually Use It

The formal name is Perception-First Design analysis. In practice, I say “PFD this.”

It runs on anything. A URL. A screenshot. A mockup file. A directory of HTML and CSS. Raw copy. A plan. An impasse. It doesn’t care about the input format. It cares about the question: will this make sense to a real person, and will it convert?

The output is a scored report across five cognitive layers:

L0 Foundation Cognitive load. Can the brain even process this without overload?

L1 First Impression The 50-millisecond judgment. Pre-attentive pattern recognition and initial trust.

L2 Processing Fluency Typography, spacing, and color relationships. Does it feel effortless?

L3 Perception Bias Anchoring, framing, social proof. Working for you or against you?

L4 Decision Architecture Is the path to conversion clear, or is there friction?

Each layer gets a score. Each finding gets a priority. Each prescription is specific enough to act on.

But the part that changed how I work isn’t the first pass. It’s the second one.

The Convergence Loop

One pass gives you a score and a list of findings. Useful. But the real value comes from running it again after you fix things.

First pass: 47. Twenty-two findings across four layers. The CTA competes with the nav. The font-size ramp breaks on mobile. The pricing section has three equal-weight options with no anchor.

I fix the findings. Then I type /analyze again.

Second pass: 68. Fourteen findings. The CTA issue is resolved. New finding: the hero image is fighting the headline for first-fixation. Didn’t notice that before because the CTA problem was louder.

Fix. Run again.

Third pass: 82. Seven findings, all medium or low priority. The score is climbing because each pass peels back a layer. Fixing the biggest violations reveals subtler ones underneath. The confidence increases because the analysis is looking at a cleaner surface each time.

Fourth pass: 91. Three findings, all low. The design is converging. Not perfect. Converging. The gap between passes is narrowing, which is the signal that you’re approaching the ceiling for this particular design.

That loop, score, fix, re-score, is the thing I couldn’t stop using. Not because I trust a number blindly. Because each pass forced me to look at the design through a specific cognitive lens, and the score gave me a way to tell if the changes actually moved the needle or just felt like they did.

Where It Shows Up

I use it at every decision point where the question is “will this work?”

Plans. Before I build, I run the plan through PFD. Not the code. The plan itself. Does the information architecture make sense? Is the hierarchy going to create the right first impression? Will the user path feel obvious or will people have to think about where to click? If the plan scores low, building it is a waste. Fix the plan first.

Impasses. I’m staring at a layout and something feels wrong. I can’t tell what. I run it. The report says L2 is at 38. The spacing rhythm in the feature grid is irregular, creating a processing fluency violation. That’s the thing I couldn’t name. Now I can fix it.

QA checks. Before anything ships, it gets a PFD pass. This is the “did I miss something?” moment. Client dashboards, landing pages, book chapter layouts, pricing tables. If it’s going to be seen by someone, it gets scored. I’ve caught things on the QA pass that I’d been looking at for hours without seeing.

Client work. When a client asks “why should we change this?” the report is the answer. Not “because I think so.” Because the L1 first-impression score is 44 and here are the three findings that explain why their bounce rate is what it is. Specific, scored, grounded in published cognitive science. It changed the conversation from opinion to measurement.

From a Skill File to an API

The skill file was a markdown document. A set of instructions that told the AI how to think like a PFD analyst. It encoded 15 years of design audits into a repeatable process: what to look at, in what order, how to score it, and how to write prescriptions that someone could actually act on.

It worked inside my terminal. It couldn’t work anywhere else.

The extraction started in early March. I pulled the methodology out of the skill file and into a Cloudflare Worker. Same five-layer analysis, same scoring, same prescription format. But now it was an API. Call it from anywhere. Send a URL, get a report back.

The API went live at pfd-api.aurochs.agency. Same day, I pointed my CLI skill at the API instead of running it locally. The experience was identical. The difference: now anyone with an API key could run the same analysis I’d been running for months.

From API to Product

An API is useful if you already have a workflow to plug it into. Most people don’t. They need a place to go, paste a URL, and see a report.

That’s Forge.

Web app with auth, session management, and the convergence loop built into the UI. Scan a URL. See the scored report. Fix things. Scan again. Watch the score climb. Same workflow I’ve been running in my terminal, except now it’s a product anyone can use.

Three tiers. Solo is free: one analysis a week, three layers. Professional is $39/month: unlimited analyses, all five layers, the convergence loop, API access, MCP integration for Claude Code and ChatGPT. Agency is $129/month: everything plus white-label reports and multi-client vaults.

The free tier exists because I want people to feel the first pass. One scored report is usually enough to know whether this is useful to you. If the findings make you nod, you’ll want the convergence loop. If they don’t, it’s not for you. No pitch required.

What It’s Not

It’s not a replacement for user testing. Scores reflect heuristic analysis of perceptual factors. They’re informed by 55 research citations and 15 years of applied audits, but they’re not eye-tracking data. Pair it with real user testing for definitive product decisions.

It’s not going to redesign your site for you. It tells you what’s wrong and why. The prescriptions are specific enough to act on, but acting on them is your job.

It’s not vibes. Every finding maps to a perception layer. Every score has a methodology behind it. When the report says L2 is at 38, it means the processing fluency of your layout is creating measurable friction against how the human visual system processes information. That’s not an opinion. That’s a diagnostic.

Why I’m Sharing It

The honest answer: I built it for myself and it made my work measurably better. Clients got clearer reports. Designs converged faster. I stopped second-guessing layouts because I had a scoring framework that caught things I missed.

Keeping it locked in my terminal would be the easy move. Extract it, charge for it, and let people find their own tools. But the methodology, Perception-First Design, is already public. The book chapters explaining the cognitive science are already published. The tool is the methodology applied. If I believe the methodology should be shared, the tool should be too.

I use it every day. On every project. On client work, on my own products, on this article. If the question “will this work, and will this make sense to a real person?” matters to you, that’s what Forge answers.