PII Identification with Azure Cognitive Services in C#

PII Identification with Azure Cognitive Services in C#

Automatically flag and redact sensitive content with artificial intelligence

PII identification, short for personally identifiable information, is a part of Azure Cognitive Services that lets us automatically flag potentially sensitive content in strings. In this article we’ll discuss how to get started with PII Identification, how to detect PII in strings using C#, and how to redact those strings to reduce risks associated with storing PII.

PII is a very serious concern in many industries, particularly health and educational industries. In these industries how we store and display PII significantly matters. However, identifying PII in scanned documents or user-entered text can be a challenge. Azure’s PII identification service lets us flag these strings for follow-up and redaction prior to storage or display. What’s more, it does this in a compliant and stateless way that does not retain PII on its servers when used synchronously as we’ll see here.

Let’s check out how this works.

In order to get the most out of this article you should be familiar with:

  • The basics of Azure Cognitive Services
  • The basics of C# programming
  • How to add a package reference using NuGet

Adding a Reference to the Azure Cognitive Services Language SDK

In order to use PII identification, we should add a reference to the Azure SDK. To get this we’ll install the latest version of the Azure.AI.TextAnalytics package in Visual Studio using NuGet package manager.

Adding a NuGet Reference to Azure.AI.TextAnalytics

Caution: do not use Microsoft.Azure.CognitiveServices.Language.TextAnalytics. This is the old version of this library and it has some known bugs.

See Microsoft’s documentation on NuGet package manager for additional instructions on adding a package reference.

Note: Adding the package can also be done in the .NET CLI with dotnet add package Azure.AI.TextAnalytics.

Creating a TextAnalyticsClient Instance

We need some using statements at the beginning of our C# file to work with the TextAnalytics namespace:

using Azure;
using Azure.AI.TextAnalytics;

After that, we’ll store key and endpoint. These can be found on the Keys and Endpoints blade of your cognitive services instance in the Azure portal.

Cognitive Services Keys and Endpoints Blade

If you are using a single Azure Cognitive Services instance, you should use one of that service’s keys and its endpoint. If you wanted an isolated service and created a stand-alone Language Service, you would use that service’s key and endpoint instead.

// These values should come from a config file and should NOT be stored in source control
string key = "YourKeyGoesHere";
Uri endpoint = new Uri("https://YourCogServicesUrl.cognitiveservices.azure.com/");

Important Security Note: In a real application you should not hard-code your cognitive services key in your source code or check it into source control. Instead, you should get this key from a non-versioned configuration file via IConfiguration or similar mechanisms. Checking in keys can lead to people discovering sensitive credentials via your current source or your source control history and potentially using them to perform their own analysis at your expense.

Next, we’ll set up a TextAnalyticsClient. This object will handle all communications with Azure Cognitive Services later on.

// Create the TextAnalyticsClient and set its endpoint
AzureKeyCredential credentials = new AzureKeyCredential(key);
TextAnalyticsClient textClient = new TextAnalyticsClient(endpoint, credentials);

With the textClient created and configured, we’re ready to start summarizing text.

Flagging PII with Azure Cognitive Services

Now, let’s ask the user for some potentially-sensitive text:

Console.WriteLine("Please put in some sensitive information about yourself (possibly made up)");
string text = Console.ReadLine();

In a real application this string might instead come from something the user typed into an input box or could represent the contents of an E-Mail or scanned document.

Let’s now pass it on to Azure’s PII identifier to see what it finds:

// Detect PII
Response<PiiEntityCollection> piiResponse = textClient.RecognizePiiEntities(text);

This will make a REST POST request to https://YourCogServicesUrl.cognitiveservices.azure.com/text/analytics/v3.1/entities/recognition/pii?showStats=false&stringIndexType=Utf16CodeUnit with a JSON body similar to this:

            "text":"My name is Matt Eland. My social security number is 555-12-3456 and I live at 42 Wallaby Way, Sydney, Australia. You can E-Mail me at matt@mattondatascience.com.",

Yikes! That’s a lot of PII in one message. Thankfully I’m not posting this on the Internet for others to see. Wait…

Anyway, let’s take a look at the response:

            "redactedText":"My name is **********. My social security number is *********** and I live at *********************************. You can E-Mail me at **************************.",
                {"text":"Matt Eland","category":"Person","offset":11,"length":10,"confidenceScore":1.0},
                {"text":"42 Wallaby Way, Sydney, Australia","category":"Address","offset":78,"length":33,"confidenceScore":0.9},{"text":"matt@mattondatascience.com","category":"Email","offset":134,"length":26,"confidenceScore":0.8}

Well, right away we can see that the PII identifier was able to flag our name, SSN, address, and E-Mail address as PII. Additionally, the cognitive services API gives us back a category and confidence score for each PII entity it detected.

What makes all of this even better is that Azure actually returns back the redactedText I’d want to display to the user or store in a database so I don’t even need to do any sort of string replacement!

Security Note: a natural concern at this point would be that Azure has potentially stored the PII it just processed. Microsoft explicitly states that no data is retained when using the synchronous method we’re using above. When the async version is used, Microsoft does store information for up to 24 hours to facilitate the asynchronous request.

Displaying PII Results

Now that we’ve established how to request PII identification, let’s see how we can display it to the user:

// Get the entities out of the response
PiiEntityCollection piiEntities = piiResponse.Value;

// Display the redacted text to the user
Console.WriteLine($"Redacted Text: {piiEntities.RedactedText}");

// Display all PII Entities
Console.WriteLine("PII Entities:");
foreach (PiiEntity entity in piiEntities)
    // Determine what category to display. Some categories have sub-categories, but most do not
    string category = entity.Category.ToString();
    if (!string.IsNullOrWhiteSpace(entity.SubCategory))
        category += $"/{entity.SubCategory}";

    // Display the PII entity
    Console.WriteLine($"\t{entity.Text} (Category: {category}) with {entity.ConfidenceScore:P} confidence");

This will output something like the following:

Redacted Text: My name is **********. My social security number is *********** and I live at *********************************. You can E-Mail me at **************************.
PII Entities:
        Matt Eland (Category: Person) with 100.00% confidence
        555-12-3456 (Category: USSocialSecurityNumber) with 85.00% confidence
        42 Wallaby Way, Sydney, Australia (Category: Address) with 90.00% confidence
        matt@mattondatascience.com (Category: Email) with 80.00% confidence

Most of the time you’ll just want the RedactedText, but it could be helpful to know what PII entities are present to display or log warnings.

Ethical Considerations of PII Identification

I like this offering from Microsoft, and yet I have some concerns.

As a user of this service, most of the time my desire is going to be one of these two things:

  1. Determine if a string contains PII
  2. Sanitize or remove PII from a string by producing a redacted string

Azure Cognitive Services meets both of these needs beautifully.

However, this service doesn’t stop there. It provides a way of identifying specific entities of PII from strings. A bad actor could conceivably use this service to scan documents and E-Mails in bulk, looking for social security numbers and associated PII to potentially use for fraudulent purposes. Admittedly they could have done that without Azure, but I don’t like how easy it is for anyone to determine what types of PII are present in a string and get the exact values associated with those.

That being said, if you wanted to selectively show sensitive pieces of information to authorized personnel, knowing which pieces of information are sensitive and where they are in a string is a fantastic thing to have.

However, for the vast majority who use this service, it will be a tool for good in ensuring that sensitive information is flagged and potentially redacted.