How expensive is Manual Data Manipulation?
I’ve been reminded about what a tedious and potentially dangerous task it is to manually manipulate data. There are times when manually manipulating and editing data is useful and necessary but what how does that balance against the cost of a system to take care of the data?
Working with customers for yieldHUB I see many many different formats of data and different types of data. We deal with companies from early stage start ups to mature multi nationals. Interestingly we always find some manual data manipulation.
It is worth exploring a little what is meant by manual data manipulation. We should able to set 3 types of data manipulation. Manual, semi automated and fully automated. And “data manipulation” being the moving around and preparing of data before any analysis takes place.
So if you collect data from your bench test equipment put it on your local machine, re-arrange the data in Excel and then push it to your analysis tool that is what we are talking about there.
If you collect data put it on a machine and run some scripts on it, perhaps Pearl or R then we could call than semi-automated.
If your data is automatically copied from testers, moved across the network to a server and prepared for you so you only need to choose the analysis that you need (or receive automatic reports) that is fully automated.
As we move through these levels a number of things change. With the very manual manipulation of data it takes a long time to do anything with a data file or datalog. Setting up columns, checking alignment and sorting. Plot a chart, add the labels… More automation means that the time per file is now less, move to a folder, run the script, check the output. And will a fully automated system you just view the results. So we could plot as shown in figure 1 below that time per file goes down as the automation increases.
Figure 1 – Time per file
The other thing to consider then is the chances of an error in the data. After 20 years working in the semiconductor industry I still need to do some manual manipulation of data from time to time so I can say first hand that there is often some error there and a lot of time is spent error checking and then debugging. With some macros or scripts it takes a lot more time to write these but once they are known to be correct nothing should change so the chance of error per file goes down significantly. With full automation as long as the data format does not change there is far far less chance of error. You are now not only using scripts that all of your data has been processed by but many other customer’s data too. Many eagle eye’d engineers around the world are checking even after professional software teams have tested the code thoroughly. So we could also plot this as in Figure 2.
Figure 2 – Chance of error
An interesting thing to consider next is the impact that the time spend and chance of error combine to show. The more manual the data manipulation the more time spent and more chance of error with the key mitigation being more time spent. So it would seem reasonable to multiply these factors together. That gives a decaying exponential as seen in figure 3. The cost per file of manual manipulation of data is huge when time per file and chance of error are considered together. While automated systems are expensive it would seem that the return on investment point could be on a far smaller number of data files that we might intuitively imagine.
Figure 3 – Combined, Time per file x Chance of error
In conclusion there are choices when it comes to manipulating data. You can can do it manually with a huge cost per file. Add automation with scripts written by engineers, but make sure you get documentation, source code, backups and support prepared in case those engineers leave the company. And the third option is to buy a system that can take care of all of this along with a strong support package, when all things are considered it will soon return on the investment.