Home Explorer Mind Map Pipeline Roadmap Future Plans Dashboard SFVT on BaseScan

The Pipeline

From SCF source release to blockchain-verified data. Fully automated. Zero human intervention. Every byte hashed, every step chained, every output anchored.

<60s
Detection
194
Files Hashed
185
PDFs Extracted
17
Validation Checks
~15min
Full Pipeline

The Full Lifecycle

0

Continuous Watch scf_watcher.py

A persistent daemon polls the SCF GitHub repository every 60 seconds. It hits the GitHub API, compares the returned commit SHA against the last known SHA stored in .last_scf_commit.

If the SHA differs, a new SCF release has been pushed. The watcher immediately logs the detection, sends an email notification to all stakeholders, and hands off to the pipeline with the new commit SHA.

Detection latency: under 60 seconds from the moment SCF pushes.

1

Archive Protection archive_manager.py

Before touching anything, the system protects the previous run's output. The current SCF Compliance Atlas and all verification artifacts are zipped into a timestamped archive. If the new pipeline run fails, the last known-good output remains intact.

2

Source Acquisition Block 0

The pipeline clones the entire securecontrolsframework/securecontrolsframework repository. It walks every file — the Excel workbook, all 185 STRM PDFs, every config file — and computes SHA-256 for each one.

Writes source_manifest.json with the commit SHA, download timestamp, and per-file hashes. Individual .provenance artifacts are created for each file.

Smart skip: If the source manifest shows we already have this exact commit's data, the download is skipped. The pipeline still re-processes everything from the cached source.

194 files hashed. The source manifest is the first link in the hash chain.

3

Raw Data Split Block 1

The SCF Excel workbook contains 10 sheets. Each is exported as a standalone CSV:

  • SCF1,468 controls (the backbone)
  • SCF_Domains_Principles — 33 security domains
  • Threat_Catalog — 41 threat categories
  • Risk_Catalog — 39 risk categories
  • Assessment_Objectives — assessment criteria
  • Evidence_Request_List — evidence mapping
  • Authoritative_Sources249 regulatory frameworks
  • Compensating_Controls — compensating control guidance
  • Data_Privacy_Mgmt_Principles — privacy principles
  • STRM_Mappings — the master Set Theory Relationship Mapping

Each raw CSV is SHA-256 hashed immediately after creation. The combined hash of all 10 chains to Block 0's source hash.

ALL OF THESE NUMBERS MAY OR MAY NOT CHANGE AFTER AN UPDATE. AFTER THE DATA IS PROCESSED AN EMAIL WILL BE SENT FOR THE DISTINCT DIFFERENCES.

4

STRM PDF Extraction Block 5 · ~10-12 min

The longest phase. For each of the 185 STRM PDFs:

  • Opens the PDF with pdfplumber and scans every page for tables
  • Uses adaptive header detection to find the FDE column — handles garbled OCR text like RaStTioRnMalestrm_rationale
  • Keeps the best column map across all pages (doesn't let degraded headers overwrite good ones)
  • Extracts every row: scf_index, scf_description, fde, fde_description, relationship, strm_rationale, strm_strength
  • Normalizes all relationship values to 5 canonical types: Equal, Subset Of, Superset Of, Intersects With, No Relationship

Self-healing: If extraction quality is poor, the extractor falls back to pre-extracted benchmark data. If even that fails, the PDF is flagged for manual review with notes explaining what went wrong.

Zero tolerance for bad data. The pipeline would rather use benchmark data than risk corrupt output.

5

Data Cleaning Block 2

Every CSV (both the 10 core tables and 185 STRM extractions) goes through deterministic normalization: headers to snake_case, encoding to UTF-8, type coercion, whitespace trimmed, empty strings replaced with proper nulls. Output to csv_cleaned/.

Deterministic: given the same input, this always produces byte-identical output. Critical for hash reproducibility.

6

Relationship Mapping Block 3 + Block 4

4 SCF Junction Tables in scf_relationships/:

  • scf_to_domain.csv — maps each control to its security domain(s)
  • scf_to_threat.csv — maps controls to threat categories
  • scf_to_risk.csv — maps controls to risk categories
  • scf_to_authoritative_source.csv — maps controls to regulatory frameworks

249 Framework Relationship CSVs in framework_relationships/ — one per regulatory framework (NIST 800-53, ISO 27001, PCI DSS, GDPR, etc.). Hundreds of thousands of mapping rows total. All outputs hashed immediately.

7

STRM Summary Sheets

Four analytical summaries generated:

  • Framework_STRM_Summary — per-framework FDE count, relationship type distribution, row count
  • FDE_to_Control_Mappings — how many FDEs map to each SCF control and via which frameworks
  • Framework_to_Domain — which SCF domains each framework touches
  • Framework_Overlap — pairwise overlap between frameworks (shared controls, threshold ≥ 5)
8

Validation & Hash Chain Block 6

The self-auditing phase. The pipeline independently computes what the correct numbers should be, then checks every one.

7-Block Hash Chain
Block 0: Source Acquisition → SHA-256 of downloaded source files
↓ previous_hash
Block 1: Raw Data Split → SHA-256 of raw CSVs
↓ previous_hash
Block 2: Data Cleaning → SHA-256 of cleaned CSVs
↓ previous_hash
Block 3: Relationship Extract → SHA-256 of junction tables + STRM
↓ previous_hash
Block 4: Framework Mapping → SHA-256 of 249 framework CSVs
↓ previous_hash
Block 5: STRM Extraction → SHA-256 of 185 extracted CSVs
↓ previous_hash
Block 6: Final Validation → SHA-256 of expected totals + all previous hashes

17 Validation Checks:

1expected_totals.json integritySHA-256 match 2Source mirror file count194 3STRM PDFs on disk185 4Raw CSV count10 5Clean CSV count10 6Framework relationship count249 7SCF junction table count4 8Extracted CSV count185 9Total mapping rows468,814 10Provenance artifact count194 11Relationship type distributionall 6 types 12Rationale type distributionall match 13Strength value distributionall match 14Per-PDF row countsall 185 15Core table record countsall 10 16Total unique FDEs34,696 17Framework relationship total rowsexact match

If any check fails, the pipeline refuses to produce output. No exceptions.

9

Merkle Tree & Proofs

A 27-leaf Merkle tree is built from the expected totals. Each leaf is a named data point (e.g., "framework_count: 249"). Leaves are SHA-256 hashed, then paired and re-hashed up the tree. The root hash represents the entire verified state of all pipeline output.

Per-leaf portable proofs are written — each contains the leaf data plus the sibling hashes needed to recompute the root. Anyone can independently verify that a specific number was part of the verified dataset.

10

SCF Compliance Atlas Final Deliverable

The pipeline builds the final Excel deliverable:

  • All 10 cleaned core tables as sheets
  • All 249 framework relationship CSVs as sheets
  • 4 STRM summary sheets
  • Analytical sheets: Framework Health Dashboard, Domain Risk Heatmap, Risk/Threat/Evidence Exploded views, ERL Framework Coverage, Full Control Profile
  • Metadata sheets: _Counts, _Schema, _Relationships, _Provenance, _Data_Model

Result: a single 92+ MB Excel file with 270+ sheets, every one traceable through the hash chain to the original SCF source.

11

Evidence Archive & Notification

Creates a dated evidence zip containing all verification artifacts. Packages everything into the "It Is Finished" folder: the Compliance Atlas Excel, the evidence zip, and provenance sidecar JSON. Archives previous versions. Sends completion notification emails to all stakeholders.

12

Blockchain Anchoring SFVT on Base L2

Two independent Merkle roots are anchored on-chain via the SecureFrameworkVerificationToken contract:

  • Main Verification Track — root hash of the 27-leaf Merkle tree covering all pipeline outputs
  • Authoritative Source Audit Track — independent root hash reading only original STRM PDFs

The anchoring transaction embeds the root hash in the transaction data. Once confirmed on Base, this hash is permanent and publicly verifiable. 19 category wallets hold SFVT token balances equal to each verified count.

Run Timeline

PhaseDurationWhat's Happening
Detection< 60sWatcher detects new commit
Archive~5sPrevious output backed up
Source Download2-5 minFull repo + 185 PDFs downloaded
Raw Split~10sExcel split into 10 CSVs
STRM Extraction10-12 min185 PDFs scanned, tables extracted
Data Cleaning~30sAll CSVs normalized
Relationship Mapping~1 min249 framework CSVs + 4 junctions built
STRM Summaries~20s4 analytical summary sheets
Validation~30s17 checks, 7-block chain, Merkle tree
Atlas Build2-3 min270+ sheet Excel assembled
Archive + Notify~30sZip, email, done
Total~15-20 minFully automated, zero human intervention

What Makes This Trustworthy

Hash Chain

Every block includes the previous block's hash. Change one byte anywhere and the chain breaks.

Independent Verification

verify_chain.py computes expected totals from scratch. It doesn't trust what any other script says.

Zero Tolerance

The pipeline refuses output if any of 17 checks fail. No partial output, no "close enough."

Deterministic

Same input always produces same output. Reproducible across runs.

Blockchain Anchoring

The root hash is anchored on Base L2. Immutable, public, permanent.

Dual-Track

Two independent systems (pipeline + audit track) must agree or the pipeline refuses output.

Self-Healing Extraction

STRM extractor falls back to benchmarks rather than producing bad data.

Provenance at Every Step

Every file has a SHA-256 hash recorded at the moment of creation.