Dataset Information

Dataset: open-pii-masking-500k-ai4privacy

500,000
Total Records
Multilingual
Language Coverage
10-20
Entity Types

Source

This dataset is designed for training and evaluating PII detection and masking systems. It contains real-world text samples with annotated ground truth for various entity types.

Format: JSONL (JSON Lines) with one record per line

Record Structure

Each record contains:

  • id: Unique record identifier
  • text: The text content to analyze
  • language: Language code (e.g., "en", "de", "fr")
  • entities: Array of ground truth annotations with:
    • entity_type (e.g., "EMAIL_ADDRESS", "PERSON", "SSN")
    • start and end positions (character offsets)
    • text (the actual entity text)

Entity Types

The dataset includes various PII entity types such as:

  • EMAIL_ADDRESS: Email addresses
  • PHONE_NUMBER: Phone numbers (various formats)
  • PERSON: Person names
  • SSN: Social Security Numbers (US)
  • CREDIT_CARD: Credit card numbers
  • IP_ADDRESS: IPv4/IPv6 addresses
  • URL: Web URLs
  • LOCATION: Physical locations
  • And more...

The exact entity types present are determined by the dataset ground truth annotations. The benchmark only evaluates entity types that exist in the ground truth.

Dataset Characteristics

  • Multilingual: Includes text in multiple languages
  • Real-world scenarios: Diverse text contexts (emails, documents, forms, etc.)
  • Multiple entity types per record: Records often contain several different PII types
  • Varying difficulty: Mix of obvious and challenging PII patterns

Quality Assurance

Ground truth annotations are carefully curated to ensure accurate evaluation. The dataset is split into training and test sets, though for this benchmark we use it as a test set only (neither tool was trained on this specific data).