Methodology rung1.v1

How we score an MCP server

A score you cannot check is a score you cannot trust. So here is the whole method, in plain language. Every number on MCP Verdict comes from these rules, applied to evaluations we run by installing the server and exercising it. The score comes from behavior, not from a manifest. Nothing else feeds it.

The grade

Each entry gets a composite from 0 to 100 and a letter grade derived from it. The letter and the number always travel together.

A90 to 100Strong. Our default kind of pick.

B80 to 89Good, with a caveat worth reading.

C70 to 79Usable, with real rough edges.

D55 to 69Weak. Use with caution, if at all.

F0 to 54Do not rely on it.

The composite

The composite is a weighted blend of four sub-scores. Functionality and reliability dominate, because a registry first has to find what actually works. Security is heavy. Latency matters least, since a slower correct tool can still be the right call.

Sub-score	Weight
Functional	35%
Reliability	30%
Latency	15%
Security	20%

The four sub-scores

Functional

Does the tool do what its entry claims? A pass scores 100, a partial scores 60, a fail scores 0. We average across every evaluation.

Reliability

Repeated success over many attempts, recorded as a 0 to 100 score per run and averaged. A run with no reliability measure defaults to 50, so an incomplete test never quietly looks perfect.

Latency

Measured response time, scored on a curve. At or under 250ms scores 100. From 250ms to 1000ms it falls from 100 to 70. From 1000ms to 3000ms it falls from 70 to 35. Past 3000ms it keeps falling toward 0. A missing measure defaults to 50.

Security

Each evaluation starts at 100 and loses points for every flag, by severity. The result is clamped to a 0 to 100 range and averaged.

infominus 2
lowminus 8
mediumminus 25
highminus 60

One rule overrides the math. An unresolved high-severity flag, meaning a tool can leak credentials or data, run arbitrary code, or reach outside its declared scope, caps the composite at 40 no matter what the weighted score would have been, and the entry shows a warning. Resolved flags stop penalizing. Medium flags lower the security sub-score but do not cap the composite.

Automated scanners are one input here, not the whole of it. We draw on the open-source tools in this space (mcp-scan from Invariant Labs, Cisco's MCP Scanner) and add a human review on top, because static analysis alone is uneven. The number gets the machine pass. The verdict gets the human.

Confidence

A score is only as trustworthy as the testing behind it. Confidence runs from 0 to 1 and rises with the number of evaluations: it starts at 0.50 for a single run and climbs to 0.95. A great-looking score with one test behind it shows high and low at once, on purpose. At Rung 3, confidence will also account for how many different camps of builders agree.

What we refuse to use

The method is as much about the inputs we reject as the ones we keep. The score does not use any of these.

Vendor or model-provider identity. A tool's maker never moves its score.
Marketplace popularity, stars, or download counts. Attention is not quality.
Sponsorship or payment. Placement is not for sale at any rung.
Raw community votes. When the community arrives, bridging consensus replaces vote-counting, so a fan club cannot buy a grade.

The score, by API

A score is meant to be queried, not just read. The stable, machine-readable view of any entry's score lives at a single endpoint. This is the seam that, at Rung 4, lets package managers, frameworks, and IDEs ask MCP Verdict directly.

GET /api/scores/{id}

{
  "entry_id": "ddfe0583-e4ef-5e8c-bd05-23b3a2e5242f",
  "composite": 98.1,
  "sub_scores": {
    "functional": 100,
    "reliability": 95,
    "latency": 100,
    "security": 98
  },
  "confidence": 0.5,
  "methodology_version": "rung1.v1",
  "last_computed_at": "2026-06-07T00:00:00Z"
}

Before a failing grade goes public, the maintainer gets the full evaluation and a window to fix it or correct us. Read the disclosure policy.

Found a problem with the method, or a tool we should test? Submit it for review.