A community-driven registry for the Claude Code ecosystem. Not affiliated with Anthropic.
Are you the author? Sign in to claim
Unified research-data acquisition MCP — search & fetch datasets across Zenodo, DataCite, NCBI omics (GEO/SRA/BioProject)
One MCP server to find and fetch research data across archives, omics registries, and literature — behind a single normalized model.
search one query across Zenodo, DataCite (Dryad / Figshare / Dataverse /
OSF / Mendeley), NCBI omics (GEO / SRA / BioProject), DataONE (eco /
environmental), literature (PubMed / OpenAIRE), OmicsDI (proteomics /
metabolomics), and HuggingFace datasets — deduplicated, normalized, and
cross-linked. resolve any hit to its file manifest, citation, trust signals,
and the data it points at. fetch it to disk with checksum verification.
mcp-name: io.github.musharna/data-aggregator-mcp
Most data MCPs wrap a single source. This one unifies them behind five tools
and one DataResource model, so an agent searches once and gets back comparable
records:
organism="Orobanche aegyptiaca" also matches
Phelipanche aegyptiaca (NCBI Taxonomy), so a species rename doesn't cost you
results.resolve.metrics (citations / views / downloads / likes),
version status (is_latest / superseded_by), and last_updated freshness,
surfaced wherever the source exposes them.resolve(format="croissant") or "ro-crate" hands a
dataset to an ML or research-packaging pipeline as standard JSON-LD.operate reads the schema, previews rows, or
runs a read-only SQL SELECT against a remote Parquet/CSV/TSV without
downloading it (Parquet footer + DuckDB httpfs range reads). Optional
[operate] extra; base install is unchanged.→ Full rationale and a comparison vs. single-source servers, breadth gateways, and ML-dataset tools: docs/POSITIONING.md.
Run with no install:
uvx data-aggregator-mcp
Register with Claude Code:
claude mcp add data-aggregator -- uvx data-aggregator-mcp
A typical agent flow:
search("drought stress RNA-seq", organism="Sorghum bicolor")
→ [ geo:GSE..., sra:SRX..., zenodo:..., pubmed:... ] # deduped, taxa-normalized
resolve("sra:SRX079566")
→ DataResource{ files: [ENA FASTQ urls…], access: "open", taxa: [...] }
fetch("sra:SRX079566", dest="./data")
→ ["./data/SRX079566_1.fastq.gz", …] # md5-verified
pip install data-aggregator-mcp
data-aggregator-mcp # or: python -m data_aggregator_mcp
To use the operate tool (query remote tabular files in place), install the
optional extra:
pip install "data-aggregator-mcp[operate]"
Add to a client's MCP config (e.g. Claude Desktop claude_desktop_config.json):
{
"mcpServers": {
"data-aggregator": {
"command": "uvx",
"args": ["data-aggregator-mcp"],
"env": { "NCBI_API_KEY": "your-optional-key" }
}
}
}
| Source | Discover | Fetch | Checksum |
|---|---|---|---|
| Zenodo | ✅ | ✅ | md5 |
| DataCite → Figshare | ✅ | ✅ | md5 |
| DataCite → Dataverse | ✅ | ✅ | md5 |
| DataCite → OSF | ✅ | ✅ | md5 |
| DataCite → Dryad | ✅ | manifest only¹ | sha-256 (listed) |
| DataCite → Mendeley & others | ✅ | — | — |
| NCBI SRA | ✅ | ✅ (ENA FASTQ) | md5 |
| NCBI GEO | ✅ | ✅ (suppl/) | none² |
| NCBI BioProject | ✅ | → SRA links | — |
| PubMed / OpenAIRE | ✅ | ✅ (OA full text) | none² |
| HuggingFace datasets | ✅ | ✅ (resolve URL) | none |
| DataONE (eco/env) | ✅ | ✅ (Member Node) | md5 / sha-256 |
| OmicsDI → PRIDE | ✅ | ✅ (HTTPS FTP) | size only |
| OmicsDI → MetaboLights | ✅ | ✅ (HTTPS FTP) | none |
| OmicsDI → other MS repos | ✅ | — | — |
¹ Dryad downloads are token / bot-challenge gated, so fetch fails loud;
resolve still lists the files.
² No upstream checksum — fetch verifies content-type instead (rejects an HTML
page served in place of a binary).
search(query?, size?, sources?, organism?, kind?, published_after?, published_before?, rank?, cursor?)Fan out across all wired sources in parallel and return compact DataResource
records, deduped by DOI. Per-source failures land in errors{} — never silently
dropped.
organism — expand the query with NCBI-Taxonomy synonyms; the expansion is
echoed in taxon_expansion, and results carry normalized taxa[]
({taxid, name}) plus a described_in link to plant-genomics-mcp for plant
taxa.sources — restrict the fan-out, e.g. ["omics"].size — max results (1–50).kind — keep only dataset / sequencing_run / study / publication /
software.published_after / published_before — filter by publication year.rank — relevance (default) or semantic (re-rank the fetched page by
embedding similarity to the query; needs EMBEDDING_API_BASE, degrades to
relevance order otherwise).cursor — opaque token from a prior result's next_cursor; pages forward
across every source. In cursor mode the other params are read from the
token, so query is optional.resolve(id, cite?, format?)Full record + files manifest. Routes by id shape — zenodo:7654321, a bare DOI,
datacite:10.5061/dryad.x, an omics id (sra:SRX079566, geo:GSE332789,
bioproject:PRJNA1468572), a literature id (pubmed:34320281, openaire:<id>),
a HuggingFace id (hf:owner/name), a DataONE id (dataone:doi:10.5063/F1HT2M7Q),
or an OmicsDI id (omicsdi:pride:PXD000001). Attaches, where available:
files[] — ENA FASTQ manifest (SRA), GEO suppl/, or the host repo's
native manifest (Figshare / Dataverse / OSF / Dryad).links[] — paper → data: pubmed: → sra: / geo: / bioproject: (NCBI
elink); openaire: → datacite: (ScholeXplorer Scholix).access / license — normalized status
(open / embargoed / restricted / closed / unknown) and license where
the source exposes it.identifiers — normalized {pmid, pmcid, doi}, plus an open-access
full-text FileEntry (EuropePMC XML, or an Unpaywall PDF fallback) for papers.citation — pass cite=<format>: bibtex, ris, csl-json, or any CSL
style name (apa, mla, vancouver, …). DOI records use content
negotiation; others render CSL-JSON from metadata. Off by default; failures
degrade quietly.metrics (citations / views / downloads / likes),
is_latest / superseded_by (derived from version links), and last_updated
freshness, where the source provides them.format — pass format="croissant" (file-level Croissant JSON-LD) or
"ro-crate" (minimal RO-Crate 1.1) to attach a standard manifest under the
matching field, for ML or research-packaging pipelines.fetch(id, dest?, files?, max_bytes?, force?, extract?)Download files to disk and return their paths. Streams under a max_bytes guard
(force to override) with md5 verification wherever a checksum exists.
files — restrict to a subset of the resolved manifest.extract — unpack downloaded zip / tar archives in place, guarded against
path traversal and runaway extracted size. Off by default.suppl/, literature full text) get a content-type
sniff that fails loud if a declared binary is actually an HTML page.FetchNotSupportedError.list_sources()Wired sources with their capabilities — layer, kinds, supported filters,
fetchability, operable flag, id examples, auth, and rate limits.
operate(op, id, file?, query?, n?, columns?)Inspect or query a remote tabular file (Parquet / CSV / TSV) without
downloading it. Addresses a file by catalog id + file name (defaults to the
first tabular file on the resolved record). Ops:
schema — column names + types (reads the Parquet footer / sniffs the CSV
header; no full load).preview — a small sample of rows.head — the first n rows (default 20), optionally restricted to columns.sql — a read-only SELECT (the file is the view data), e.g.
SELECT col, count(*) FROM data GROUP BY 1.Backed by the Parquet footer reader + DuckDB httpfs range reads. sql runs in
a locked-down DuckDB (read-only, local filesystem disabled, single-SELECT
validation, row / wall-clock caps). Requires the optional [operate] extra
(pip install data-aggregator-mcp[operate]); without it, operate returns a
clear install-the-extra message and the other four tools are unaffected.
Any HuggingFace dataset with a datasets-server converted view is operable
(schema / preview / head / sql): resolve surfaces the auto-converted
Parquet files (source="hf-datasets-server") even for datasets stored as
JSON/JSONL/arrow, so pass file=<config>/<split>/...parquet to pick a split when
there are several.
Three workflow prompts surface in clients (e.g. /mcp__data_aggregator__* in
Claude Code):
find_data — find datasets for a topic, optionally scoped to an organism.data_behind_paper — find the datasets / accessions behind a paper.search_resolve_fetch — walk the end-to-end search → resolve → fetch flow.Both optional, set via environment variables:
NCBI_API_KEY — raises the NCBI E-utilities rate limit (3 → 10 req/s) used by
the omics, literature, and taxonomy lookups.UNPAYWALL_EMAIL — enables the Unpaywall fallback leg of literature full-text
retrieval (the EuropePMC leg works without it).uv venv && uv pip install -e ".[dev]"
uv run pytest -q
uv run ruff check src tests
DATA_AGGREGATOR_MCP_LIVE=1 uv run pytest -k live -q # real-API probes
The README demo (examples/assets/demo.svg) is recorded network-free from
examples/_demo_stdio.py — see the header of that file to re-record.
MIT — see LICENSE.
Run Claude Code as an MCP server so any agent can delegate coding tasks to it
Browser automation using accessibility snapshots instead of screenshots
English-first Korean equity intelligence MCP — DART filings, foreign-holder 5%-rule flows, activist filings, KRX news. F
Unity MCP acts as a bridge between AI assistants and your Unity Editor. Give your LLM tools to manage assets, control sc