Hash Match: Working with anonymised data
As part of Normally's continued R&D explorations around data, we have been collecting data on the 'pulse' of the studio for the last 18 months. Our internally developed slack bot 'Norman' asks us each day how work has been, and our responses are collected anonymously in a Google Sheet.
This was to ensure that the personal data we collected couldn't be stumbled upon and tracked back to any individual.
Our Slack bot Norman is built using Slack's Bolt for Javascript framework and hosted using Glitch. We also utilise Google's Sheets api to store the data collected by Norman in a Google Sheet as this made it easy to analyse and work with.
This question immediately appeared as soon as we started trying to work with the data collected.
How do we provide analysis and trends calculated from an individual's data, when the identity of the individual is no longer known as a result of the anonymisation process?
As a heads up we chose the method bellow knowing that it wasn't foolproof from a privacy perspective, but it felt appropriate for this small internal project only storing this question's response. It also gave us the flexibility to work with the stored data in the future, being able to connect back to individuals via our data steward.
This is our attempt at just that:
STEP 1: The individual's unique Slack ID is pulled from the message payload returned when they answer the daily questions. We use an MD5 algorithm (plus a salt) to hash the Slack ID before storing it in Google Sheets. While there are more secure hashing methods available, MD5 is sufficient to obscure viewers from identifying an individual in our team.
STEP 2: A pivot table was created that calculates each hash value's total score for that week, and how that compares to the previous week's score.
STEP 3: When an individual either requests their weekly trend or when Norman delivers their past weeks trend every Monday, we run a script that converts everyones slack ID to a hash value.
STEP 4: This hash value is then compared with the hash values in the pivot table until there is a match (Hash Match).
STEP 5: Then we grab that matching hash value's coinciding delta between last week and this weeks score from the pivot table, and return this in a slack message format to that slack ID.
Now there are a couple of additional measures and protocols we have in place in order to maintain safe data handling practices, which are important to be aware of in this context.
Glitch's .env file - Glitch use a hidden .env file that can be used to store secret credentials that can then be referenced in the code. This is handy as multiple people can work in the code, and enables the code to be uploaded to github without exposing any private IDs or api credentials.
Salt - The salt is a secret passphrase that is added to the MD5 hashing function and is an additional measure used to protect the original data values. This is stored in our .env file.
Appointed data steward - We appointed a trusted data steward for this project who had sole access to the data and full hashing functions. We found appointing this role useful in the development of these types of projects as it helps provide flexibility and a means to interrogate the data further, but without providing open access to the data to the wider company.