Dataset Information

Dataset: open-pii-masking-500k-ai4privacy

500,000

Total Records

Multilingual

Language Coverage

10-20

Entity Types

Source

This dataset is designed for training and evaluating PII detection and masking systems. It contains real-world text samples with annotated ground truth for various entity types.

Format: JSONL (JSON Lines) with one record per line

Record Structure

Each record contains:

id: Unique record identifier
text: The text content to analyze
language: Language code (e.g., "en", "de", "fr")
entities: Array of ground truth annotations with:
- entity_type (e.g., "EMAIL_ADDRESS", "PERSON", "SSN")
- start and end positions (character offsets)
- text (the actual entity text)

Entity Types

The dataset includes various PII entity types such as:

EMAIL_ADDRESS: Email addresses
PHONE_NUMBER: Phone numbers (various formats)
PERSON: Person names
SSN: Social Security Numbers (US)
CREDIT_CARD: Credit card numbers
IP_ADDRESS: IPv4/IPv6 addresses
URL: Web URLs
LOCATION: Physical locations
And more...

The exact entity types present are determined by the dataset ground truth annotations. The benchmark only evaluates entity types that exist in the ground truth.

Dataset Characteristics

Multilingual: Includes text in multiple languages
Real-world scenarios: Diverse text contexts (emails, documents, forms, etc.)
Multiple entity types per record: Records often contain several different PII types
Varying difficulty: Mix of obvious and challenging PII patterns

Quality Assurance

Ground truth annotations are carefully curated to ensure accurate evaluation. The dataset is split into training and test sets, though for this benchmark we use it as a test set only (neither tool was trained on this specific data).