Dataset Information
Dataset: open-pii-masking-500k-ai4privacy
Source
This dataset is designed for training and evaluating PII detection and masking systems. It contains real-world text samples with annotated ground truth for various entity types.
Format: JSONL (JSON Lines) with one record per line
Record Structure
Each record contains:
- id: Unique record identifier
- text: The text content to analyze
- language: Language code (e.g., "en", "de", "fr")
- entities: Array of ground truth annotations with:
- entity_type (e.g., "EMAIL_ADDRESS", "PERSON", "SSN")
- start and end positions (character offsets)
- text (the actual entity text)
Entity Types
The dataset includes various PII entity types such as:
- EMAIL_ADDRESS: Email addresses
- PHONE_NUMBER: Phone numbers (various formats)
- PERSON: Person names
- SSN: Social Security Numbers (US)
- CREDIT_CARD: Credit card numbers
- IP_ADDRESS: IPv4/IPv6 addresses
- URL: Web URLs
- LOCATION: Physical locations
- And more...
The exact entity types present are determined by the dataset ground truth annotations. The benchmark only evaluates entity types that exist in the ground truth.
Dataset Characteristics
- Multilingual: Includes text in multiple languages
- Real-world scenarios: Diverse text contexts (emails, documents, forms, etc.)
- Multiple entity types per record: Records often contain several different PII types
- Varying difficulty: Mix of obvious and challenging PII patterns
Quality Assurance
Ground truth annotations are carefully curated to ensure accurate evaluation. The dataset is split into training and test sets, though for this benchmark we use it as a test set only (neither tool was trained on this specific data).