Data Anonymizer - User Guide
This guide provides comprehensive instructions on how to install, configure, and run the Interactive Birdie Internal Anonymizer script. This tool leverages Large Language Models (LLMs) to identify and mask sensitive information in your datasets.
Download Birdie Internal Anonymizer script
Prerequisites
Before you begin, ensure you have the following:
Python installed (version 3.8 or higher recommended).
API credentials for your chosen LLM provider (OpenAI, Anthropic, Google Gemini, or Azure).
Installation and Setup
To ensure a clean environment and avoid dependency conflicts, follow these steps to set up the project.
1. Project Initialization
Extract the project files to a directory of your choice on your local machine.
2. Create a Virtual Environment (Recommended)
Open your terminal or command prompt and run the following:
Create the environment:
python -m venv .venv
Activate the environment:
Windows:
.venv\Scripts\activateLinux/MacOS:
source .venv/bin/activate
3. Install Dependencies
Install the required Python libraries using the provided requirements file:
pip install -r requirements.txt
4. Configure API Keys
You must set your API key as an environment variable so the script can communicate with the LLM.
OpenAI:
export OPENAI_API_KEY='your-key-here'Anthropic:
export ANTHROPIC_API_KEY='your-key-here'Google Gemini:
export GOOGLE_API_KEY='your-key-here'
(Note: On Windows, use set instead of export.)
Running the Anonymizer
The tool provides an interactive, step-by-step CLI (Command Line Interface) to guide you through the process.
Choosing the Right Script
Depending on your file format, run one of the two commands below:
For JSON Files:
python run_anonymizer.py(Note: This converts JSON to CSV before processing)
For CSV Files:
python run_anonymizer_csv.py
The 9-Step Interactive Process
Once the script is running, follow these prompts:
Input File Selection: Provide the path to your source file.
Column Analysis: The tool scans your data structure.
Column Selection: Choose specifically which columns contain PII (Personally Identifiable Information).
Output File Selection: Define where the cleaned file should be saved.
LLM Provider Selection: Choose from OpenAI, Azure, Anthropic, Google, or a Local LLM.
Model Selection: Enter the specific model name (e.g.,
gpt-4oorclaude-3-5-sonnet).Output Language: Select the language for the anonymized text (Default: English).
Processing Data: The tool sends data to the LLM for masking.
Sample Results: Review a preview of the anonymized data before finishing.
Supported LLM Providers
Provider
Default Model
Best For
OpenAI
gpt-4o-mini
General purpose & speed
Anthropic
claude-3-5-sonnet-20241022
High-accuracy reasoning
Google Gemini
gemini-1.5-pro
Large context windows
Local LLM
User Defined
Privacy-sensitive, offline workflows
Frequently Asked Questions
Can I use a local model for privacy?
Yes. If you select Local LLM (Option 5), you can connect to any OpenAI-compatible API (like Ollama or LocalAI) to keep data processing on your own infrastructure.
What happens to my JSON structure?
The tool currently flattens JSON data into a CSV format during the run_anonymizer.py workflow to ensure consistent processing across the LLM providers.
Last updated