Coding agents can produce working UI fast, but what's harder is a different shape. They can copy your product's style, match its patterns, and try to follow its conventions. What they cannot do is understand why those patterns exist. Code shows agents what shipped, not why one component, phrase, or interaction became your standard. That reasoning lives in design reviews, PR comments, Slack threads, and with the people who were in the room. For an agent, context that isn't in the codebase doesn't exist.
Vercel is an agent-native team. We treat accepted product decisions like code, keeping them in the repository, reviewing changes against them, and making them available to every agent working there.
The way we do this is through . It's a system with three parts:product-design
Any team can build the same structure around their own standards.
The skill lives inside the repository alongside the code it governs. Here's a simplified view of its structure:
resolves the request mode first: shape, implement, review, copy, or harden. This keeps audits from becoming edits and copy passes from expanding into redesigns. It skips backend-only work, telemetry, console errors, generated files, and tests with no shipped UI impact.SKILL.md
The skill routes to canonical sources instead of duplicating them. Component APIs, design-system rules, accessibility criteria, and interaction guidance stay with their owners.
Routing is specific to both task and surface. Material changes load product-judgment and interface-quality first. Copy, component, layout, interaction, accessibility, and resilience work each route to focused references. A modal loads destructive-action patterns and canonical verbs. A settings form loads labels, validation, progressive disclosure, and accessible-name guidance.
You can use this simplified structure as a starting point and replace the paths and standards with your own:
Routing is only part of what makes the skill useful. The other part is how findings stay traceable once the skill produces them.
Copy rules have stable IDs and point to their canonical sources:
When Vercel Agent proposes a patch, it validates the change in a secure Vercel Sandbox with the repository's builds, tests, and linters before posting the suggestion.
We prefer deterministic checks when a linter can enforce a rule reliably. Linters are fast and cheap to run, so developers and coding agents get feedback while they work instead of waiting for a later review.
Code can count two or three static options, so a linter can recommend radio buttons. Naming the right object and consequence for a destructive action requires product context, so the skill handles it.
Examples in the codebase include rules that:
Each rule explains why the pattern is a problem and suggests a concrete fix. Some rules autofix safe migrations, such as replacing deprecated Tailwind utility names.
Accepted decisions can take several forms:
The lint rule below shows how one product guideline is encoded as a deterministic check:
Each of these catches a class of mistake automatically, freeing code review for the decisions that actually require judgment.
Lint rules are deterministic, but agent behavior can vary, so we test the skill on interfaces it has not seen before.
An agent edits a before state, then a judge checks the results against a rubric.
Evals come from shipped examples documented in the skill. Holdouts hide their expected edits, testing whether the guidance generalizes. We also run fixtures without the skill to measure whether it changed the agent's behavior.
We score rule correctness separately from similarity to the shipped result. Shipped code can contain a flaw that the agent should improve instead of reproduce.
Product standards change as components, names, workflows, and failure states change, and every update needs evidence and human review.
Our weekly evidence-intake workflow collects design feedback that may improve . It searches Slack conversations and preserves links to Figma files, pull requests, review comments, and previews as evidence. When evidence is incomplete, it records the code or commit needed for verification.product-design
The workflow separates collection from judgment:
Every candidate links to its source and remains pending. A comment from an experienced reviewer can raise its priority, but every candidate still needs evidence.
Automation ends with the review packet. A human decides whether a candidate becomes agent guidance, a lint rule, an example, an eval, or no change. Accepted changes go into the narrowest relevant file and pass the relevant checks before merging.
Our setup reflects Vercel's product, components, and review history, but other teams can adapt the structure to their own standards.
Choose one product surface where the same review comments keep appearing: destructive actions, error states, settings forms, empty states, or navigation. Collect examples from shipped code and real reviews, and write down the decision, why it matters, exceptions, and the source.
Avoid starting with broad adjectives like , , or . Agents need observable decisions. is usable. is not.clearpolishedintuitiveDestructive actions use Verb + NounButtons should be clear
Fill in the fields specific to your surface before expanding to others.
Tell agents when to load the skill in persistent repository instructions, and define the files and surfaces it covers along with the areas it must skip. In , agents failed to invoke an available skill in 56% of cases. Test the trigger separately from the guidance, because failing to load the skill and failing to follow a rule are different problems.separate Next.js evals
Ask the agent to report which surfaces and references it loaded, then verify that its findings cite those sources.
Use a short entry point to identify the surface and load focused references. Organize the details around surfaces and decisions reviewers already discuss: forms, modals, navigation, product vocabulary, workflow states, and cross-surface patterns.
Give rules stable IDs and link them to examples and sources. Record shipped examples with both useful decisions and known flaws, and keep missing guidance visible in a coverage-gap list.
A coverage-gap list makes missing guidance explicit.
If a linter can identify a problem reliably, enforce the rule there. Use agent guidance when the decision needs product or codebase context. Keep new standards, policy choices, and unresolved product decisions with people.
Build training fixtures from documented examples and holdouts from interfaces whose expected edits do not appear in the skill. Test retrieval and application separately, because whether the agent loaded the skill and whether it followed the rule are different questions.
If a rule cannot stay reliable without many exceptions, move it back to agent guidance.
Review new evidence regularly, but require human approval before changing the guidance or checks. Keep a decision log that records what changed, why, and which source supported it. Treat new rules as product changes, reviewing and testing each one, and removing those that stop helping.
Start with one surface and the decisions your team already repeats. Put those decisions where code is written and reviewed, and keep people responsible for what becomes a standard.
The hardest part is picking the first surface. Every team has decisions worth encoding. The question is whether they live in someone's head or somewhere agents can find them. If you build something using this pattern or have questions about how we set it up, let us know.
An agent skill that gives coding agents the context behind decisions that require product or codebase judgment.
Linters that enforce clear rules automatically.
A review loop that gathers evidence from Slack, Figma, and GitHub, then prepares guideline updates for review.
A collector gathers messages, links, and nearby context without proposing rules.
A separate judge groups the evidence, verifies sources, and records open questions.
The job creates a review packet with candidates, rejected topics, follow-up requests, and coverage gaps.
Inside the product-design skill
Use linters for faster feedback
How we test the guidance with evals
Keep the guidance current
How to build product-design into your codebase
Build your own
The repository tells coding agents when to load the skill. The skill-local defines load order, validation, and governance. owns the runtime workflow.
AGENTS.mdAGENTS.mdSKILL.mdstores product-judgment, interface-quality, resilience, copy, canonical product names, interaction patterns, and surface-specific decisions.
references/documents decisions worth repeating from shipped pull requests, along with mistakes to avoid. lists areas where we do not have a standard yet.
exemplars/coverage-gaps.mdtests copy and interface-language behavior. It does not evaluate the broader product-design workflow.
copywriting-eval/
Prevent nested modals, which break focus management, keyboard navigation, and layering.
Recommend radio buttons instead of a select for two or three static options, so every choice stays visible.
Require accessible names for icon buttons and form controls, and reject custom focus rings that bypass shared focus tokens.
Prevent from overriding a design-system component's color, radius, or shadow while still allowing layout classes.
classNameRequire so long content scrolls correctly and headers and footers can remain sticky.
Modal.BodyReplace raw shadows with theme-aware Material classes and reject borders that duplicate a Material's built-in treatment.
Flag arbitrary spacing that falls off the 4px grid and suggest a standard utility when one exists.
Human-readable guidance next to the relevant Geist component, such as .Checkbox best practices
Agent guidance in the skill.
product-designA lint rule when code can check it reliably.