From SCF source release to blockchain-verified data. Fully automated. Zero human intervention. Every byte hashed, every step chained, every output anchored.
A persistent daemon polls the SCF GitHub repository every 60 seconds. It hits the GitHub API, compares the returned commit SHA against the last known SHA stored in .last_scf_commit.
If the SHA differs, a new SCF release has been pushed. The watcher immediately logs the detection, sends an email notification to all stakeholders, and hands off to the pipeline with the new commit SHA.
Detection latency: under 60 seconds from the moment SCF pushes.
Before touching anything, the system protects the previous run's output. The current SCF Compliance Atlas and all verification artifacts are zipped into a timestamped archive. If the new pipeline run fails, the last known-good output remains intact.
The pipeline clones the entire securecontrolsframework/securecontrolsframework repository. It walks every file — the Excel workbook, all 185 STRM PDFs, every config file — and computes SHA-256 for each one.
Writes source_manifest.json with the commit SHA, download timestamp, and per-file hashes. Individual .provenance artifacts are created for each file.
Smart skip: If the source manifest shows we already have this exact commit's data, the download is skipped. The pipeline still re-processes everything from the cached source.
194 files hashed. The source manifest is the first link in the hash chain.
The SCF Excel workbook contains 10 sheets. Each is exported as a standalone CSV:
Each raw CSV is SHA-256 hashed immediately after creation. The combined hash of all 10 chains to Block 0's source hash.
ALL OF THESE NUMBERS MAY OR MAY NOT CHANGE AFTER AN UPDATE. AFTER THE DATA IS PROCESSED AN EMAIL WILL BE SENT FOR THE DISTINCT DIFFERENCES.
The longest phase. For each of the 185 STRM PDFs:
RaStTioRnMale → strm_rationalescf_index, scf_description, fde, fde_description, relationship, strm_rationale, strm_strengthSelf-healing: If extraction quality is poor, the extractor falls back to pre-extracted benchmark data. If even that fails, the PDF is flagged for manual review with notes explaining what went wrong.
Zero tolerance for bad data. The pipeline would rather use benchmark data than risk corrupt output.
Every CSV (both the 10 core tables and 185 STRM extractions) goes through deterministic normalization: headers to snake_case, encoding to UTF-8, type coercion, whitespace trimmed, empty strings replaced with proper nulls. Output to csv_cleaned/.
Deterministic: given the same input, this always produces byte-identical output. Critical for hash reproducibility.
4 SCF Junction Tables in scf_relationships/:
scf_to_domain.csv — maps each control to its security domain(s)scf_to_threat.csv — maps controls to threat categoriesscf_to_risk.csv — maps controls to risk categoriesscf_to_authoritative_source.csv — maps controls to regulatory frameworks249 Framework Relationship CSVs in framework_relationships/ — one per regulatory framework (NIST 800-53, ISO 27001, PCI DSS, GDPR, etc.). Hundreds of thousands of mapping rows total. All outputs hashed immediately.
Four analytical summaries generated:
The self-auditing phase. The pipeline independently computes what the correct numbers should be, then checks every one.
17 Validation Checks:
If any check fails, the pipeline refuses to produce output. No exceptions.
A 27-leaf Merkle tree is built from the expected totals. Each leaf is a named data point (e.g., "framework_count: 249"). Leaves are SHA-256 hashed, then paired and re-hashed up the tree. The root hash represents the entire verified state of all pipeline output.
Per-leaf portable proofs are written — each contains the leaf data plus the sibling hashes needed to recompute the root. Anyone can independently verify that a specific number was part of the verified dataset.
The pipeline builds the final Excel deliverable:
_Counts, _Schema, _Relationships, _Provenance, _Data_ModelResult: a single 92+ MB Excel file with 270+ sheets, every one traceable through the hash chain to the original SCF source.
Creates a dated evidence zip containing all verification artifacts. Packages everything into the "It Is Finished" folder: the Compliance Atlas Excel, the evidence zip, and provenance sidecar JSON. Archives previous versions. Sends completion notification emails to all stakeholders.
Two independent Merkle roots are anchored on-chain via the SecureFrameworkVerificationToken contract:
The anchoring transaction embeds the root hash in the transaction data. Once confirmed on Base, this hash is permanent and publicly verifiable. 19 category wallets hold SFVT token balances equal to each verified count.
| Phase | Duration | What's Happening |
|---|---|---|
| Detection | < 60s | Watcher detects new commit |
| Archive | ~5s | Previous output backed up |
| Source Download | 2-5 min | Full repo + 185 PDFs downloaded |
| Raw Split | ~10s | Excel split into 10 CSVs |
| STRM Extraction | 10-12 min | 185 PDFs scanned, tables extracted |
| Data Cleaning | ~30s | All CSVs normalized |
| Relationship Mapping | ~1 min | 249 framework CSVs + 4 junctions built |
| STRM Summaries | ~20s | 4 analytical summary sheets |
| Validation | ~30s | 17 checks, 7-block chain, Merkle tree |
| Atlas Build | 2-3 min | 270+ sheet Excel assembled |
| Archive + Notify | ~30s | Zip, email, done |
| Total | ~15-20 min | Fully automated, zero human intervention |
Every block includes the previous block's hash. Change one byte anywhere and the chain breaks.
verify_chain.py computes expected totals from scratch. It doesn't trust what any other script says.
The pipeline refuses output if any of 17 checks fail. No partial output, no "close enough."
Same input always produces same output. Reproducible across runs.
The root hash is anchored on Base L2. Immutable, public, permanent.
Two independent systems (pipeline + audit track) must agree or the pipeline refuses output.
STRM extractor falls back to benchmarks rather than producing bad data.
Every file has a SHA-256 hash recorded at the moment of creation.