> For the complete documentation index, see [llms.txt](https://ask.birdie.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ask.birdie.ai/admin-and-settings/security/data-anonymization-process-speech-to-text.md).

# Data Anonymization Process — Speech-to-Text Pipeline

This page describes the anonymization process applied by Birdie to the text produced from customer-service audio (speech-to-text) before it is stored or used for analysis. It is intended to evidence, for a Data Protection review, that the data Birdie retains and processes has had Personally Identifiable Information (PII) **irreversibly removed**, in line with major data-protection regulations (including the GDPR and Brazil's LGPD) and Birdie's SOC 2 Type II posture.

{% hint style="info" %}
**Scope note.** This page describes the anonymization step performed inside Birdie's processing pipeline. The redaction is applied to the transcribed text. Wherever the term "PII" is used, it refers to personal and sensitive data that may appear in the transcript of a customer conversation.
{% endhint %}

## 1. Data Processing Flow

The end-to-end flow has three sequential stages. Anonymization sits **between** transcription and any persistence or downstream use, so that no downstream system ever receives un-redacted transcribed text.

`Source audio → Transcription (speech-to-text) → Anonymization (PII redaction) → Storage / Downstream use`

| Stage                        | Input               | Output                                               |
| ---------------------------- | ------------------- | ---------------------------------------------------- |
| **1 — Transcription**        | Raw audio file      | Raw transcript (may contain PII)                     |
| **2 — Anonymization**        | Raw transcript      | Redacted transcript (PII replaced with `[REDACTED]`) |
| **3 — Storage / downstream** | Redacted transcript | Redacted transcript only                             |

* **Stage 1 — Transcription:** The audio is transcribed to text by Birdie's self-hosted speech-to-text engines. Output: raw transcript, which may contain PII.
* **Stage 2 — Anonymization:** The raw transcript is passed through the two-stage redaction pipeline described in [section 3](#3.-how-the-anonymization-works). Every detected PII span is replaced with the fixed token `[REDACTED]`. Output: redacted transcript.
* **Stage 3 — Storage / downstream:** Only the redacted transcript is persisted and used for downstream analysis. Output: redacted transcript.

{% hint style="info" %}
**Key property:** the anonymization step is a transformation of the text. The original (un-redacted) value of each PII span is **not stored, mapped, or otherwise preserved** anywhere in the pipeline.
{% endhint %}

{% hint style="info" %}
**Data residency.** Birdie's primary infrastructure runs in the United States (GCP region `us-central-1`, Iowa). For clients with regional data-residency requirements, transcription and anonymization can run in a local region before final persistence. The Brazil/LGPD routing is documented in [LGPD & Data Privacy (Brazil)](/admin-and-settings/security/lgpd-data-privacy.md).
{% endhint %}

## 2. Categories of PII Identified and Removed

Birdie identifies and removes the following categories of personal and sensitive information. Detection uses two complementary mechanisms: a context-aware language model and a set of deterministic pattern rules (see [section 3](#3.-how-the-anonymization-works)).

**Personal identifiers**

* People's full name or first name.
* Phone numbers.
* Email addresses.
* Physical addresses, locations, and cities.

**Government & identity documents**

* National identification numbers, tax IDs, driver's licenses, and passport numbers — for example a US Social Security Number, a Brazilian CPF, or a Mexican CURP. These are detected by the language model from context, regardless of country.

**Financial / sensitive data**

* Credit card numbers.
* Bank account information (bank, branch/agency, and account numbers).
* Passwords.
* Dates (e.g. dates that could relate to an individual).

{% hint style="info" %}
**Note on company names.** The name of the company being analyzed is explicitly **excluded** from redaction (it is not personal data), so the redacted transcript remains useful for analysis without exposing individuals.
{% endhint %}

## 3. How the Anonymization Works

Anonymization is implemented as a two-stage pipeline — the same hybrid system used across Birdie's ingestion, combining an AI-based entity recognition model with rigid pattern detection. The stages run sequentially (the output of the first stage is the input to the second, like a Unix pipe), so the two mechanisms reinforce each other. Each stage replaces any PII it detects with the fixed literal token `[REDACTED]`.

`raw transcript → Stage 1 (Semantic / LLM) → Stage 2 (Pattern / regex) → redacted transcript`

### Stage 1 — Semantic redaction (language model)

A language model reads the full transcript and rewrites it, replacing any personal or sensitive information with the token `[REDACTED]`. Because it operates on meaning and context (not just fixed patterns), it can catch PII that has no fixed format — for example a spoken-out name, an address, or a document number from a country without a specific rule.

The model is instructed to:

* Redact all categories listed in [section 2](#2.-categories-of-pii-identified-and-removed).
* Preserve the original language of the text (it must not translate).
* Preserve the surrounding (non-PII) text as faithfully as possible, so the transcript stays analyzable.

### Stage 2 — Pattern-based redaction (regular expressions)

A deterministic set of regular expressions runs over the (already LLM-redacted) text and replaces any remaining structured PII with `[REDACTED]`. This is a safety net for well-formatted values that must always be caught regardless of context. The patterns cover, among others:

| Type                 | Example                                      |
| -------------------- | -------------------------------------------- |
| Email                | `someone@example.com`                        |
| Phone                | `(11) 91234-5678`, `1234-5678`               |
| Credit card          | `1234 5678 9012 3456`, `1234-5678-9012-3456` |
| Dates                | `DD/MM/YYYY`, `DD-MM-YYYY`                   |
| Government / tax IDs | country-specific identifier formats          |

## 4. Why the Result Is Irreversible Anonymization

Modern data-protection regulations (including the GDPR and Brazil's LGPD) distinguish **anonymized data** — data that, by reasonable and available technical means, can no longer be associated with an individual — from **pseudonymized data**, where the link to the individual is merely replaced by a reversible reference and can be restored using additional information. Only anonymized data is treated as falling outside the scope of those regimes.

Birdie's process produces **anonymized** data, not pseudonymized data, for the following reasons:

1. **Destructive replacement, not tokenization.** Each PII span is overwritten with a single, non-unique, fixed label — `[REDACTED]`. The same label is used for every occurrence and every type of PII. It carries no information about the original value (no length, no format, no per-value identifier).
2. **No reversal mechanism exists.** The pipeline does not maintain any mapping table, token vault, key, dictionary, or reference linking a `[REDACTED]` token back to the original value. There is no decryption step and no "additional information" that could restore the original data — because no such information is produced or stored.
3. **One-way transformation.** Because the replacement is non-reversible and value-destroying, the same redacted output (`...my name is [REDACTED]...`) could have come from any number of original inputs. Recovering the original value is not possible by any technical means available to Birdie or a third party.
4. **Only redacted text is persisted downstream.** Storage and downstream analysis operate exclusively on the redacted transcript. The original PII values do not persist in the analytical data set.

***

**See also:**

* [LGPD & Data Privacy (Brazil)](/admin-and-settings/security/lgpd-data-privacy.md) — Art. 12 analysis, Brazilian identifiers (CPF/RG/CNH), and Brazil data residency.
* [PII & PHI anonymization](/admin-and-settings/security/pii-and-phi-anonymization.md) — the two anonymization paths (handled by Birdie vs. handled by your company) and the full list of PII/PHI types.
* [Data Anonymizer - User Guide](/admin-and-settings/security/data-anonymizer-user-guide.md) — run anonymization locally before data reaches Birdie.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://ask.birdie.ai/admin-and-settings/security/data-anonymization-process-speech-to-text.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
