Data Anonymizer - User Guide

This guide provides comprehensive instructions on how to install, configure, and run the Interactive Birdie Internal Anonymizer script. This tool leverages Large Language Models (LLMs) to identify and mask sensitive information in your datasets.


Download Birdie Internal Anonymizer script

Download script


Prerequisites

Before you begin, ensure you have the following:

  • Python installed (version 3.8 or higher recommended).

  • API credentials for your chosen LLM provider (OpenAI, Anthropic, Google Gemini, or Azure).


Installation and Setup

To ensure a clean environment and avoid dependency conflicts, follow these steps to set up the project.

1. Project Initialization

Extract the project files to a directory of your choice on your local machine.

Open your terminal or command prompt and run the following:

Create the environment:

python -m venv .venv

Activate the environment:

  • Windows: .venv\Scripts\activate

  • Linux/MacOS: source .venv/bin/activate

3. Install Dependencies

Install the required Python libraries using the provided requirements file:

pip install -r requirements.txt

4. Configure API Keys

You must set your API key as an environment variable so the script can communicate with the LLM.

  • OpenAI: export OPENAI_API_KEY='your-key-here'

  • Anthropic: export ANTHROPIC_API_KEY='your-key-here'

  • Google Gemini: export GOOGLE_API_KEY='your-key-here'

(Note: On Windows, use set instead of export.)


Running the Anonymizer

The tool provides an interactive, step-by-step CLI (Command Line Interface) to guide you through the process.

Choosing the Right Script

Depending on your file format, run one of the two commands below:

  • For JSON Files: python run_anonymizer.py

    (Note: This converts JSON to CSV before processing)

  • For CSV Files: python run_anonymizer_csv.py

The 9-Step Interactive Process

Once the script is running, follow these prompts:

  1. Input File Selection: Provide the path to your source file.

  2. Column Analysis: The tool scans your data structure.

  3. Column Selection: Choose specifically which columns contain PII (Personally Identifiable Information).

  4. Output File Selection: Define where the cleaned file should be saved.

  5. LLM Provider Selection: Choose from OpenAI, Azure, Anthropic, Google, or a Local LLM.

  6. Model Selection: Enter the specific model name (e.g., gpt-4o or claude-3-5-sonnet).

  7. Output Language: Select the language for the anonymized text (Default: English).

  8. Processing Data: The tool sends data to the LLM for masking.

  9. Sample Results: Review a preview of the anonymized data before finishing.


Supported LLM Providers

Provider

Default Model

Best For

OpenAI

gpt-4o-mini

General purpose & speed

Anthropic

claude-3-5-sonnet-20241022

High-accuracy reasoning

Google Gemini

gemini-1.5-pro

Large context windows

Local LLM

User Defined

Privacy-sensitive, offline workflows


Frequently Asked Questions

Can I use a local model for privacy?

Yes. If you select Local LLM (Option 5), you can connect to any OpenAI-compatible API (like Ollama or LocalAI) to keep data processing on your own infrastructure.

What happens to my JSON structure?

The tool currently flattens JSON data into a CSV format during the run_anonymizer.py workflow to ensure consistent processing across the LLM providers.

Last updated