Neutrality is the product

Everyone else scores the packaging. We run the server.

MCP Verdict is a registry with one job: tell you which MCP servers actually work and are safe to point an agent at. We do not host a directory, we do not sell a model, and no vendor pays to move a score. That independence is not a feature we could trade for a partnership. It is the whole product.

A score built from a project's own metadata is a score its author controls.

Several products now put a number on an MCP server. Read how they get the number and the pattern is the same: GitHub stars, manifest completeness, declared permissions, dependency hygiene, maintenance recency. All of it is packaging, and most of it is filled in by the person being rated. There is already a public guide titled "how to hit 100" on one of these scores. A metric you can study for and grind toward is measuring completeness, not quality.

MCP Verdict scores a different thing. We install the server, run every tool it advertises, hit it repeatedly to see if it holds, time it, and check what it can actually reach on the host. Then a person writes a verdict that takes a side. The number comes from behavior. The verdict comes from judgment. Neither comes from the server's npm page.

A referee with a horse in the race is not a referee.

Every marketplace and directory that ranks these tools has a reason to rank them a certain way. A model vendor's store exists to make its platform look good. A directory with a connector gateway has an incentive toward the servers that route through it. None of them can be the neutral judge, not because they are dishonest, but because they cannot prove they are not. The incentive never goes away, so the trust never arrives. MCP Verdict has no such incentive to explain away. That gap is the reason it exists.

Anyone can write the word neutral. We make it checkable.

The method is published and versioned, so you can read exactly how a score is produced and challenge it. Every evaluation behind a score is shown in full. Placement is not for sale at any rung. Two servers in our first category function perfectly and still earn an F, because one hands an agent the whole filesystem and the other lets a shell command leave the sandbox. We published both.

What we will not do

Rank a server above a rival for any reason other than the score.
Take money to change, raise, or feature a score.
Host a directory or gateway with a stake in what gets listed.
Soften a verdict to keep an author or a partner happy.
Claim a community consensus before there is a community.

The closest thing to what we do is a "confirmed functioning" checkbox.

We are not the only people thinking about whether these servers work. Glama, the largest directory, runs a "confirmed functioning" check alongside its scorecards, and it is the nearest thing in the field to behavioral testing. The difference is depth and independence: a checkbox is not a verdict, and a directory that also runs a connector gateway has a commercial stake in what it lists. We will not out-list a 21,000-server directory, and we will not try. We go deeper on the few servers that matter, and we owe nothing to any of them.

The score is meant to spread.

Over time the protocol opens, so other systems can query a MCP Verdict score directly, in the package manager, the framework, the IDE. We do not try to own the format. We own the thing that compounds: the accumulated behavioral evidence and the network of builders who produce it. Openness is how the score reaches everyone. The data is how it stays trustworthy. At Rung 3, agreement across builders who normally disagree becomes a signal no metadata aggregate can fake.

MCP servers first. On purpose.

Every scorer in this field is MCP-only, and behavioral testing is slow, so we start where the need is sharpest and the test is cleanest: MCP servers, beginning with filesystem access, where "what can it reach" is a question with a real answer. Agents and the rest of the tool layer are where we go next. We would rather test one category honestly than list ten we have not run.

The proof of all this is the method itself.

Read how we score Browse the registry