When developing data-driven software, there’s a constant tension between anonymity and usefulness.
On one hand, some level of anonymity is required when using any data set to protect sensitive customer and commercial information. On the other, increasingly obfuscating data reduces its usefulness for information discovery.
When demonstrating a proof-of-concept, data-driven application, it is typical to use generated, fake data. This has the obvious advantage of completely protecting customer anonymity, satisfying the commercial folks. It also placates project managers, by replacing the time required to obfuscate the data (an unknown quantity) with the time required to generate fake data (a predictable quantity). But as part of the software development team, you should avoid letting this happen.
When replacing the real with a fake, it’s important to remember that the best-case for this generated, fake data, is that it looks and smells entirely like real data, and that this best case is unachievable. In order to generate completely accurate fake data, you would have to have a complete understanding of the domain being analysed, obviously an impossible demand. So generally the benchmark is set at generating “believable data.”
Believable to whom?
Well, to anyone you want to sell it to, I imagine. But this is a dangerous demand. You can never know if your data is not believable until it’s too late – in the same way as if you can never know if your network security is good enough, until it’s too late – and even then, only if someone raises an alarm.
Your team probably has good domain knowledge, and you put that knowledge into your generated data, creating what you believe to be realistic scenarios, and believable correlations. You show it to the executives in your own company, who are unlikely to have as strong domain knowledge as your team.
Their reaction is enthusiastic – they are surprised by correlations in your data and this helps generate a positive impression of your product.
So you go and show it to potential customers, who gather executives and experts for your presentation, often putting hundreds of years of domain expertise in a single room. Of course, many of the correlations they expect to see in your fake data don’t exist, because you didn’t think to create them, generating immediate suspicion of your product for your potential customer. If you’re lucky, they’ll understand the issue is due to your generated data. If you’re lucky, they will question why they’re not seeing what they expect. If you’re unlucky, they will simply use it to dismiss your product, using the missing information to re-enforce the reasons why they don’t need software to support their jobs, or to argue that such tools should really be developed in-house. In any of these cases, you’ve just devalued your potential new product.
But why couldn’t this happen with obfuscated data?
In real data, patterns such as locations, times or names in the data could be used by competitors (to whom you’re going to show this proof-of-concept) to reasonably guess to which of your customers this data belongs (argues the commercial team, who don’t want to risk your customer’s data). So your obfuscations must ensure that these elements of the data are hidden sufficiently to ensure that this doesn’t happen – knowing, of course, that these domain experts might spot some pattern you missed and use it to infer the underlying customer data in any case.
During the obfuscation process it’s important to remember that while many patterns will be diligently preserved, many will be necessarily destroyed, reducing the selection of available correlations in the data. But won’t this obfuscation process – of unknown duration – reduce the quality of the data to such a point that it would be less convincing to clients than generated data (argues the product manager who wants this software to be finished ‘yesterday’)? That’s unlikely.
Data generation is an additive process, but obfuscation is a subtractive one.
When the elements to be subtracted are sensitive commercial data, it’s likely that more time and due care will be allocated to creating the data set – which is the key component of any data-driven software – than if the data was generated, because commercially sensitive data would otherwise be at risk. And because those people doing the subtraction are data-driven domain experts themselves, they will make sure to preserve the correlations they would have otherwise placed in their generated data, as best they can.
Meanwhile, because the process is subtractive, rather than additive, unseen correlations in the original real dataset may live on, providing the ability surprise both the development team and potential customers alike. And finally, using obsfucated data, you can easily explain in advance that some correlations were removed during the obfuscation process. This means that if your potential client doesn’t see a correlation they’re expecting, they’re more likely to blame it on the necessary obfuscation process, or simply question it, than silently blaming it on your incompetence.
The data must be surprising
Remember that it’s undiscovered surprises in the data pushes the development of data-driven software in the first place. It’s important for development teams to remember that understanding what would otherwise be surprises in the data is what makes someone a domain expert. The only way to gain this expertise is to work on real data, to the maximum extent possible.
So argue as hard as you can to never generate fake data… and if you do have to fake it, work with real data until the last possible moment.