Benchmark Methodology
Overview
This benchmark compares two PII (Personally Identifiable Information) detection tools: Priivacy (Rust-based) and Presidio (Python-based from Microsoft).
We measure both accuracy (precision, recall, F1 score) and speed (records processed per second) using a standardized dataset.
Benchmark Approach
Containerization
Each tool runs in an isolated Docker container to ensure fair comparison:
- Priivacy: Pre-built Rust binary in minimal Python container
- Presidio: Python container with presidio-analyzer and spacy models
Pure Timing Measurement
We measure only the PII detection processing time, excluding:
- Benchmark orchestration overhead
- File I/O operations
- Docker container startup time
- Result serialization
Each tool's container runner measures processing time using high-precision timers
(Python's time.perf_counter()) around the actual detection call.
Entity Type Filtering
We only evaluate entity types that are present in the dataset ground truth. This ensures fair comparison - we don't penalize tools for detecting additional entity types not included in the test data.
Metrics Explained
Precision
Precision = True Positives / (True Positives + False Positives)
Measures how many detected entities were correct. High precision means few false alarms.
Recall
Recall = True Positives / (True Positives + False Negatives)
Measures how many actual entities were found. High recall means few missed detections.
F1 Score
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall. Provides a single score balancing both metrics. Range: 0.0 (worst) to 1.0 (perfect).
Entity Matching
A predicted entity matches ground truth if and only if:
- Start position matches exactly
- End position matches exactly
- Entity type matches exactly
This is a strict matching criterion - partial overlaps are counted as misses.
Execution Pipeline
- Load dataset (JSONL format with ground truth annotations)
- Send each record to tool's Docker container via stdin
- Tool processes text and returns predictions with timing
- Compare predictions to ground truth (exact span matching)
- Calculate per-entity-type metrics (precision, recall, F1)
- Calculate overall aggregate metrics
- Generate cross-tool comparison data
Reproducibility
All benchmark runs use:
- Fixed dataset (no randomization)
- Deterministic processing order
- Versioned tool containers
- Timestamped results for historical tracking
Results should be reproducible within ±5% for speed metrics (due to system load variations). Accuracy metrics (precision, recall, F1) are deterministic.