Benchmark Methodology

Overview

This benchmark compares two PII (Personally Identifiable Information) detection tools: Priivacy (Rust-based) and Presidio (Python-based from Microsoft).

We measure both accuracy (precision, recall, F1 score) and speed (records processed per second) using a standardized dataset.

Benchmark Approach

Containerization

Each tool runs in an isolated Docker container to ensure fair comparison:

Priivacy: Pre-built Rust binary in minimal Python container
Presidio: Python container with presidio-analyzer and spacy models

Pure Timing Measurement

We measure only the PII detection processing time, excluding:

Benchmark orchestration overhead
File I/O operations
Docker container startup time
Result serialization

Each tool's container runner measures processing time using high-precision timers (Python's time.perf_counter()) around the actual detection call.

Entity Type Filtering

We only evaluate entity types that are present in the dataset ground truth. This ensures fair comparison - we don't penalize tools for detecting additional entity types not included in the test data.

Metrics Explained

Precision

Precision = True Positives / (True Positives + False Positives)

Measures how many detected entities were correct. High precision means few false alarms.

Recall

Recall = True Positives / (True Positives + False Negatives)

Measures how many actual entities were found. High recall means few missed detections.

F1 Score

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall. Provides a single score balancing both metrics. Range: 0.0 (worst) to 1.0 (perfect).

Entity Matching

A predicted entity matches ground truth if and only if:

Start position matches exactly
End position matches exactly
Entity type matches exactly

This is a strict matching criterion - partial overlaps are counted as misses.

Execution Pipeline

Load dataset (JSONL format with ground truth annotations)
Send each record to tool's Docker container via stdin
Tool processes text and returns predictions with timing
Compare predictions to ground truth (exact span matching)
Calculate per-entity-type metrics (precision, recall, F1)
Calculate overall aggregate metrics
Generate cross-tool comparison data

Reproducibility

All benchmark runs use:

Fixed dataset (no randomization)
Deterministic processing order
Versioned tool containers
Timestamped results for historical tracking

Results should be reproducible within ±5% for speed metrics (due to system load variations). Accuracy metrics (precision, recall, F1) are deterministic.