SFVT — The Pipeline

0

Continuous Watch scf_watcher.py

A persistent daemon polls the SCF GitHub repository every 60 seconds. It hits the GitHub API, compares the returned commit SHA against the last known SHA stored in .last_scf_commit.

If the SHA differs, a new SCF release has been pushed. The watcher immediately logs the detection, sends an email notification to all stakeholders, and hands off to the pipeline with the new commit SHA.

Detection latency: under 60 seconds from the moment SCF pushes.

1

Archive Protection archive_manager.py

Before touching anything, the system protects the previous run's output. The current SCF Compliance Atlas and all verification artifacts are zipped into a timestamped archive. If the new pipeline run fails, the last known-good output remains intact.

2

Source Acquisition Block 0

The pipeline clones the entire securecontrolsframework/securecontrolsframework repository. It walks every file — the Excel workbook, all 185 STRM PDFs, every config file — and computes SHA-256 for each one.

Writes source_manifest.json with the commit SHA, download timestamp, and per-file hashes. Individual .provenance artifacts are created for each file.

Smart skip: If the source manifest shows we already have this exact commit's data, the download is skipped. The pipeline still re-processes everything from the cached source.

194 files hashed. The source manifest is the first link in the hash chain.

3

Raw Data Split Block 1

The SCF Excel workbook contains 10 sheets. Each is exported as a standalone CSV:

SCF — 1,468 controls (the backbone)
SCF_Domains_Principles — 33 security domains
Threat_Catalog — 41 threat categories
Risk_Catalog — 39 risk categories
Assessment_Objectives — assessment criteria
Evidence_Request_List — evidence mapping
Authoritative_Sources — 249 regulatory frameworks
Compensating_Controls — compensating control guidance
Data_Privacy_Mgmt_Principles — privacy principles
STRM_Mappings — the master Set Theory Relationship Mapping

Each raw CSV is SHA-256 hashed immediately after creation. The combined hash of all 10 chains to Block 0's source hash.

ALL OF THESE NUMBERS MAY OR MAY NOT CHANGE AFTER AN UPDATE. AFTER THE DATA IS PROCESSED AN EMAIL WILL BE SENT FOR THE DISTINCT DIFFERENCES.

4

STRM PDF Extraction Block 5 · ~10-12 min

The longest phase. For each of the 185 STRM PDFs:

Opens the PDF with pdfplumber and scans every page for tables
Uses adaptive header detection to find the FDE column — handles garbled OCR text like RaStTioRnMale → strm_rationale
Keeps the best column map across all pages (doesn't let degraded headers overwrite good ones)
Extracts every row: scf_index, scf_description, fde, fde_description, relationship, strm_rationale, strm_strength
Normalizes all relationship values to 5 canonical types: Equal, Subset Of, Superset Of, Intersects With, No Relationship

Self-healing: If extraction quality is poor, the extractor falls back to pre-extracted benchmark data. If even that fails, the PDF is flagged for manual review with notes explaining what went wrong.

Zero tolerance for bad data. The pipeline would rather use benchmark data than risk corrupt output.

5

Data Cleaning Block 2

Every CSV (both the 10 core tables and 185 STRM extractions) goes through deterministic normalization: headers to snake_case, encoding to UTF-8, type coercion, whitespace trimmed, empty strings replaced with proper nulls. Output to csv_cleaned/.

Deterministic: given the same input, this always produces byte-identical output. Critical for hash reproducibility.

6

Relationship Mapping Block 3 + Block 4

4 SCF Junction Tables in scf_relationships/:

scf_to_domain.csv — maps each control to its security domain(s)
scf_to_threat.csv — maps controls to threat categories
scf_to_risk.csv — maps controls to risk categories
scf_to_authoritative_source.csv — maps controls to regulatory frameworks

249 Framework Relationship CSVs in framework_relationships/ — one per regulatory framework (NIST 800-53, ISO 27001, PCI DSS, GDPR, etc.). Hundreds of thousands of mapping rows total. All outputs hashed immediately.

7

STRM Summary Sheets

Four analytical summaries generated:

Framework_STRM_Summary — per-framework FDE count, relationship type distribution, row count
FDE_to_Control_Mappings — how many FDEs map to each SCF control and via which frameworks
Framework_to_Domain — which SCF domains each framework touches
Framework_Overlap — pairwise overlap between frameworks (shared controls, threshold ≥ 5)

8

Validation & Hash Chain Block 6

The self-auditing phase. The pipeline independently computes what the correct numbers should be, then checks every one.

7-Block Hash Chain

Block 0: Source Acquisition → SHA-256 of downloaded source files

↓ previous_hash

Block 1: Raw Data Split → SHA-256 of raw CSVs

↓ previous_hash

Block 2: Data Cleaning → SHA-256 of cleaned CSVs

↓ previous_hash

Block 3: Relationship Extract → SHA-256 of junction tables + STRM

↓ previous_hash

Block 4: Framework Mapping → SHA-256 of 249 framework CSVs

↓ previous_hash

Block 5: STRM Extraction → SHA-256 of 185 extracted CSVs

↓ previous_hash

Block 6: Final Validation → SHA-256 of expected totals + all previous hashes

17 Validation Checks:

1expected_totals.json integritySHA-256 match 2Source mirror file count194 3STRM PDFs on disk185 4Raw CSV count10 5Clean CSV count10 6Framework relationship count249 7SCF junction table count4 8Extracted CSV count185 9Total mapping rows468,814 10Provenance artifact count194 11Relationship type distributionall 6 types 12Rationale type distributionall match 13Strength value distributionall match 14Per-PDF row countsall 185 15Core table record countsall 10 16Total unique FDEs34,696 17Framework relationship total rowsexact match

If any check fails, the pipeline refuses to produce output. No exceptions.

9

Merkle Tree & Proofs

A 27-leaf Merkle tree is built from the expected totals. Each leaf is a named data point (e.g., "framework_count: 249"). Leaves are SHA-256 hashed, then paired and re-hashed up the tree. The root hash represents the entire verified state of all pipeline output.

Per-leaf portable proofs are written — each contains the leaf data plus the sibling hashes needed to recompute the root. Anyone can independently verify that a specific number was part of the verified dataset.

10

SCF Compliance Atlas Final Deliverable

The pipeline builds the final Excel deliverable:

All 10 cleaned core tables as sheets
All 249 framework relationship CSVs as sheets
4 STRM summary sheets
Analytical sheets: Framework Health Dashboard, Domain Risk Heatmap, Risk/Threat/Evidence Exploded views, ERL Framework Coverage, Full Control Profile
Metadata sheets: _Counts, _Schema, _Relationships, _Provenance, _Data_Model

Result: a single 92+ MB Excel file with 270+ sheets, every one traceable through the hash chain to the original SCF source.

11

Evidence Archive & Notification

Creates a dated evidence zip containing all verification artifacts. Packages everything into the "It Is Finished" folder: the Compliance Atlas Excel, the evidence zip, and provenance sidecar JSON. Archives previous versions. Sends completion notification emails to all stakeholders.

12

Blockchain Anchoring SFVT on Base L2

Two independent Merkle roots are anchored on-chain via the SecureFrameworkVerificationToken contract:

Main Verification Track — root hash of the 27-leaf Merkle tree covering all pipeline outputs
Authoritative Source Audit Track — independent root hash reading only original STRM PDFs

The anchoring transaction embeds the root hash in the transaction data. Once confirmed on Base, this hash is permanent and publicly verifiable. 19 category wallets hold SFVT token balances equal to each verified count.

Phase	Duration	What's Happening
Detection	< 60s	Watcher detects new commit
Archive	~5s	Previous output backed up
Source Download	2-5 min	Full repo + 185 PDFs downloaded
Raw Split	~10s	Excel split into 10 CSVs
STRM Extraction	10-12 min	185 PDFs scanned, tables extracted
Data Cleaning	~30s	All CSVs normalized
Relationship Mapping	~1 min	249 framework CSVs + 4 junctions built
STRM Summaries	~20s	4 analytical summary sheets
Validation	~30s	17 checks, 7-block chain, Merkle tree
Atlas Build	2-3 min	270+ sheet Excel assembled
Archive + Notify	~30s	Zip, email, done
Total	~15-20 min	Fully automated, zero human intervention

The Pipeline

The Full Lifecycle

Continuous Watch scf_watcher.py

Archive Protection archive_manager.py

Source Acquisition Block 0

Raw Data Split Block 1

STRM PDF Extraction Block 5 · ~10-12 min

Data Cleaning Block 2

Relationship Mapping Block 3 + Block 4

STRM Summary Sheets

Validation & Hash Chain Block 6

Merkle Tree & Proofs

SCF Compliance Atlas Final Deliverable

Evidence Archive & Notification

Blockchain Anchoring SFVT on Base L2

Run Timeline

What Makes This Trustworthy

Hash Chain

Independent Verification

Zero Tolerance

Deterministic

Blockchain Anchoring

Dual-Track

Self-Healing Extraction

Provenance at Every Step