Privacy is all about data management. You need to know which type of information you are dealing with and what really happens to it on the backend. Data catalogs can help you see what is really going on, and visualize what happens to your data. For details, please read the story I published on the subject. Today I want to discuss what can be done to identify PIIs in the context of APIs.
APIs transfer information back and forth and so you want to know if part of that information should be considered private. To help you with that task, I developed an API Private Information Analyzer that you can use to analyze your complete API payload. The tool will tell you exactly where it identified PIIs and explain how it classified them as such.
That is the first question I had to answer before designing the tool. Should I use a text classifier, or maybe a named entity recognition (NER) is what we need for the task? I saw plenty of attempts to implement such classification with NLU methods such as LSTM or BERT. Well, with a short experiment that compared the performance of NLU technologies (using libraries like huggingface/transformers, LSTM, and BERT based models), against traditional text manipulation methods, I was able to prove the obvious: Privacy information is mostly classified by pattern and less by context. Therefor solutions based on simple regular expression manipulations delivered much better results than any NLU classification model, and they did it, obviously, much faster.
This simple experiment is a typical example that shows that sometimes simple, nonfancy solution outperforms the cutting edge technology. Even though in the last few years, the NLU space was revolutionized with the introduction of the transformer, and we can do things now that we could never have done before, we should still be careful and use NLU methods only when traditional text manipulation methods fail to deliver. For PII identification, the only task where NLU methods outperformed the good old Regex was Name Entity Recognition (such as identify people’s names). It is no surprise that professional tools like BigId use regular expression as the first method for classifying private information, clustering and other fancy AI comes next.
In the tool, I chose to use Stanford Named Entity Recognizer. It can be downloaded here or just unzip the Stanford-ner-2018–10–16.zip available here. This zip contains both the NER jar file (Stanford-ner.jar) and the model file (english.all.3class.distsim.crf.ser.gz). No other installation is required to use the NER engine. It’s not a new model, it is based on Conditional Random Field (CRF) sequence models, but for the task of identifying people’s names in the context of privacy, it behaves just fine, and it much lighter than other modern BERT based implementations. If you want to dig more into the Named Entity Recognition domain, I recommend the following post as a starting point.
The tool takes a collection of input entries; each contains all the data elements of an API call (i.e., the URI, headers, payload of both the request and the response). It analyzes the data and search for possible private information (PII) by using regular expressions, a blacklist of suspicious properties names, and named entity recognition (NER) engine. The result is a CSV or JSON file with a per API call analysis that highlights all the possible private fields and their value. You can use the tool from the command line or call it from your python code.
Not all text manipulations require the NLU heavy guns. Often enough, simplicity wins. If you need to quickly analyze your API for privacy issues, check out the API Private Information Analyzer. It will tell you if your API involves data that should be considered as private data, which means that you must make sure you have the relevant tools/capabilities to manage such private data.