Skills, Not Chatbots: Building an Internal AI Plugin System for an Engineering Team

Most engineering organizations adopting AI right now have landed in the same place. The team installs Cursor or Claude Code, the senior engineers love it, the junior engineers love it differently, and after six months you have a productivity story that nobody can quantify and a quality story that nobody wants to talk about.

The teams I have seen actually move the needle did something else. They stopped treating AI as a chatbot that engineers consult, and started treating it as a structured part of the workflow. Specifically, they built a skill library: a collection of named, scoped, auditable procedures that the AI executes on behalf of the engineer, with hard gates between each step.

This is the story of one of those skill libraries. The client is a B2B SaaS company with roughly thirty engineers. Their codebase is a Rails monolith on the backend, four mobile and web clients on the frontend, and a half-dozen specialized services around the edges. Their tickets come in through Jira and merge through GitHub.

After a year of building skills against that stack, the team had moved from “we use AI assistants” to “AI is a structured part of how we ship.” Routine tickets stopped going through senior-engineer escalation. PR-time regressions declined visibly (we tracked them on a single dashboard). Throughput per engineer on routine ticket types improved in a way the team could feel but did not pretend to have a single clean number for. The wins were specific, not headline.

Here is what the skill library looks like, what survived, what we threw out, and the one architectural decision that matters more than all the others.

The architectural decision that matters more than all the others

Before the skill catalog, the prompt engineering, or the orchestration: every AI-generated change goes through an independent automated review before it can be opened as a PR. Not a linter pass. An actual review by a second AI agent that did not write the code, has not seen the chain of thought, and is instructed to be ruthless.

If that review fails, the code does not get to the human reviewer. The failures are surfaced to the original agent, which fixes them and re-runs the review. Only when the review passes does the PR get opened.

This is the hard gate. Without it, the rest of the skill library is a way of generating mediocre code at scale. With it, the skill library is a way of generating reviewed code at scale.

I will come back to this. It is the core insight. The catalog and the orchestration are interesting, but they are downstream of the gate.

The skill taxonomy

Skills are named workflows. A skill receives input (usually a ticket key or a path), runs a procedure, and produces output (a draft PR, a research note, a status update). Skills can invoke other skills.

The catalog we settled on, after culling roughly twice as many candidates:

workspace: Bootstrap and configure the developer environment. /workspace:setup, /workspace:reset-db, /workspace:setup-secrets. Replaces the README that nobody reads.

learn: Structured research and synthesis. /learn:plan-research, /learn:collect-research, /learn:synthesize-research, /learn:plan-experiment, /learn:writeup-experiment-results, /learn:journal. Forces the messy “go figure out X” task into a discrete artifact.

work: Time-boxed work tracking. /work:start-card, /work:start-pomodoro, /work:finish-pomodoro, /work:finish-card, /work:standup, /work:handoff. Wraps the day’s work into auditable units.

ticket (the spine): The end-to-end implementation pipeline. ticket-open (set up workspace, fetch Jira, gather context), ticket-engineer (analysis and design), ticket-write-plan (implementation plan), ticket-create-pr (push branch, open PR, transition Jira), ticket-add-testing-instructions, ticket-retro. Every routine ticket runs through this chain.

pr-review: The hard gate. Spawned as a subagent against any diff (including diffs from ticket runs). Produces a structured review and a response artifact. Discussed below.

use-jira / use-github / use-slack / use-newrelic / use-testing: Convention bundles. Loaded by other skills on demand to keep the main context lean. Each one teaches the agent the team’s specific norms for that surface.

ops-hotfix: An expedited path for production fixes. Skips some pipeline steps, mandates others (a regression spec), and routes the PR differently.

A skill is plain text plus a header. We use Claude Code’s plugin marketplace mechanism to distribute them, which means installing the library is one command:

/plugin marketplace add Your-Org/skills

Skills are versioned in git like any other code. They get reviewed at PR time the same way. They emit telemetry when invoked. They are not magic.

The ticket pipeline (end to end)

The pipeline that runs when an engineer says “work on TICKET-1234”:

ticket-open       → set up workspace, fetch Jira, gather code context
       |
ticket-engineer   → write engineering analysis artifact
       |
ticket-write-plan → write implementation plan artifact
       |
implementation    → write code and tests, commit locally
       |
   *** HARD GATE — DO NOT PROCEED UNTIL SELF-REVIEW IS COMPLETE ***
   pr-review      → spawn independent review subagent
       |            review artifact and response artifact MUST exist
   fix findings   → address all findings, update response artifact
       |
 ticket-create-pr → push, open draft PR, transition Jira
       |
post-PR babysit   → watch automated review (Greptile etc), watch CI

The whole thing runs without human input. Artifacts are produced at every stage, written to ~/Desktop/tickets/TICKET-1234/. The human reviews at the end.

The artifacts are not decoration. They are how the pipeline catches itself when it goes wrong. If the engineering analysis turns out to be wrong, you can read the artifact, see the moment of the wrong turn, and fix the skill. If the plan turns out to omit a critical concern, that omission is documented and you can update the skill template.

This is the part most “AI for engineering teams” pitches skip. The pipeline is observable because the artifacts are durable. The artifacts are durable because we treat them as deliverables, not exhaust.

The hard gate: automated PR review before the PR opens

Every PR generated by the pipeline goes through an independent review before it is opened. The review runs as a subagent, in a fresh context, against the local diff vs the target branch.

# Conceptual sketch — the actual implementation is a skill in markdown,
# but the shape of what gets spawned looks like this.
Agent.spawn(
  description: "Self-review #{ticket_key}",
  prompt: <<~PROMPT,
    Run /pr-review against the current branch's diff vs dev in #{repo_path}.
    Write the review to ~/Desktop/pr-reviews/#{ticket_key}-review-#{date}.md
    and a response template to ~/Desktop/pr-reviews/#{ticket_key}-response-#{date}.md.

    Apply the full pr-review skill: phases, agentic checks, autonomous dispositions.
    This is a self-review before PR creation. Be ruthless.
  PROMPT
)

The review covers:

Code correctness: does the diff do what the plan said it would do, and does it do it right?
Test coverage: are all conditional branches covered? Are bug fixes accompanied by a regression spec that fails on revert?
Security: nil guards, authorization scope, HTML escaping, encryption-key purpose-isolation, rescue handling on non-critical external API calls.
Data integrity: do update_all and update_columns paths update all denormalized columns the callback would have written? Are empty-state cases explicitly cleared?
N+1 prevention: do preload chains cover all code paths, including feature-flag branches?
Locale parity: keys added to one locale must be added to all locales.
Style and idiom: framework idioms enforced.

Each finding gets a disposition: Changes Requested, Optional, or Acknowledged-False-Positive. The pipeline cannot proceed past the gate until every Changes Requested finding has been addressed in code, every Optional finding has been considered, and every false positive has been documented with a reason.

The response artifact is the receipt. It is checked in if the team wants traceability, or kept local if not. The point is that it exists.

The two failure modes the hard gate is designed to prevent:

Skipping review for small changes. It is always tempting. “It is a one-liner, I do not need to review it.” The hard gate is non-negotiable so this temptation never wins. Even for one-line changes, the review runs and produces an artifact. If the change is genuinely trivial the artifact is two paragraphs.
Self-review-in-the-same-context. Asking a model to review code it just wrote is theater. The model is biased toward its own output. The review must run in a fresh subagent with no shared context. This is why we spawn a separate agent rather than asking the same agent to “review your work.”

When this gate is in place, the rest of the pipeline becomes safe. When it is not, the pipeline is a way to generate plausible code faster than humans can review it. That is worse than no pipeline at all.

Audit trails

Every skill invocation emits structured telemetry. Caller, skill name, version, ticket key, duration, outcome, artifacts produced. Stored append-only.

When a PR causes a regression in production, the audit trail tells us which skill chain produced it, which review caught (or missed) the relevant concern, and which artifact contains the reasoning at the moment the wrong decision was made. This makes skills improvable. Without the audit trail, a regression is anonymous and the relevant skill stays broken forever.

The simplest implementation: a hook fires at skill entry and exit, appends a JSON line to a log, and the log is rotated weekly into a database table for analysis.

# pseudocode for the entry hook
SkillTelemetry.emit(
  event: "skill_entry",
  skill: skill_name,
  version: skill_version,
  caller: parent_skill || "user",
  ticket: current_ticket_key,
  invocation_id: SecureRandom.uuid,
  started_at: Time.now.iso8601
)

The exit hook closes the loop with outcome, duration, and artifact paths. Aggregated views answer questions like “which skill produced the most rolled-back PRs?” and “which skill chain has the highest fix-rate-on-second-attempt?”

Multi-agent parallelism

The other operational pattern that earned its keep: running several skill chains in parallel against different tickets.

The shape is a supervisor process (we call it “daddy” because the original prototype was lighthearted and the name stuck) that pulls tickets from a queue, spawns a worker per ticket, watches each worker’s progress through skill hooks, and arbitrates when workers need shared resources (the same database, the same fixture state, the same git branch).

Each worker runs in its own git working copy. Multiple working copies are cheaper than they used to be: git worktree plus a per-worker Docker compose project gives each worker an isolated codebase view. Tmux is the visibility layer, not the control layer. Workers communicate state through hooks, not through terminal scraping. This last point is important. Driving Claude Code by sending keystrokes to a terminal works for demos and fails overnight; driving it through subagent lifecycle hooks works overnight.

The result is a team that can run four tickets in parallel without four engineers. The supervisor reports back to the human when each ticket is at a meaningful decision point. The human stays in the loop on judgment; the workers handle the mechanical parts.

What survived and what we threw out

Threw out:

A general-purpose “do anything” skill. The model used it confidently and incorrectly. Replaced with narrow, intent-named skills.
A skill that auto-merged PRs after CI green. Even with the hard gate, the team wanted a human commit on the merge. Removed and never missed.
A “review the last commit” inline review pattern. Theater. Replaced with the subagent gate.
Twenty-plus convention bundles. Most never got loaded. Trimmed to the six that earn their keep.

Survived:

The ticket pipeline as written above.
The pr-review subagent gate as a hard, non-negotiable step.
The per-skill audit trail.
The supervisor-and-workers parallelism pattern.
The convention bundles for the surfaces the team interacts with daily.
The “test plan in the PR description, generated by the skill, edited by the human” pattern.

Why this is a real productivity gain and not vibes

The honest answer for “did this make the team faster” is two layers deep.

Layer one. Routine tickets that used to take a senior engineer four hours of focused time now take the same engineer twenty minutes of supervision. The pipeline does the rote work (Jira retrieval, branch setup, plan documentation, draft PR, Jira transitions, test instructions). The engineer does the judgment work (is the plan right? is the diff correct? does the test actually test what we care about?).

Layer two. The hard gate raises the floor of quality. PRs that would previously have gotten a human “LGTM with comments” now arrive with the comments already addressed, because the subagent gate caught them. Reviewer time is spent on the architectural and product-judgment questions, not on “you forgot to preload location.”

Neither layer is magic. Both compound. Six months in, the senior engineers’ calendars had opened up in a way you could feel from a standup (“nobody escalated to me today” became normal); the junior engineers had a structured pipeline that taught them what good looks like by example; and the rate of pull requests opening with no obvious correctness or coverage gap had risen visibly. We did not chase a single dramatic productivity number, because the honest answer is that the wins compound across many ticket types unevenly. The qualitative shift, after six months, was not subtle.

The takeaway

If you are setting up an AI workflow for an engineering team, the single most important decision is whether you build a hard automated review gate before code reaches the human reviewer. With the gate, every other piece compounds. Without the gate, you have built a system that produces faster mediocre code.

The skill catalog is the second-most-important piece. Narrow, intent-named skills that own one job each. Plain text, versioned in git, distributed by the same mechanism your team already uses.

The audit trail is third. Without it, skills do not improve, because failures are anonymous.

The orchestration is fourth. The supervisor-and-workers pattern is genuinely valuable but only after the first three are in place.

This was one of a body of internal-AI engagements. If your team has Cursor adoption and not much else to show for it, the skills-and-gates pattern is what gets you from “we use AI” to “AI is part of how we ship.”