Figuring out binary datalog formats without a spec
Being in the realm of semiconductor data with a wide range of customers, companies often throw interesting technical challenges at us. The most complex one so far this year is probably a request (OK, a requirement!) to interpret binary test datalog files so that they can then be analysed from our yieldHUB database system. The company provided us with little information on the actual binary format. We had to interpret the binary files based on their equivalent ASCII files and our knowledge on how datalog data is usually stored.
We needed to parse the binary files directly because generating the ASCII version is a manual operation on a desktop software. It could take thousands of man hours to convert their existing archive of the binary files into a format compatible with entry into our database.
After several days of combing through the binary files searching for patterns and the values that we know from the ASCII version, we were able to write our own converter for the binary files. Our converter was written as a command line program so that we can integrate it into the automated datalog processing and allow our customer to analyze their historical data using the tools in yieldHUB. The journey was extremely challenging but when we finally discovered all the needed information, it felt all the more satisfying.
In this project, the key lessons we learned are the following:
— the “endianness” of the binary files need to be understood first. We got stuck into a confusing trial and error method because some files were little endian and the others were big endian. It was unexpected that a single system can have both but it did happen in this case.
— the script or software used to read and interpret the binary files need a way to specify the endianness of the binary data. We were initially using an older version of the scripting language to detect the numbers and we never found the correct values until we realized we needed to use the newer version.
— The 8 byte floating point value 0.12 may not be exactly 0.12 in the binary representation but could be 0.11999999999. Because we were aware of this from the start, we did not look for floats or doubles at the beginning but rather for ASCII text and integers.
Does your company have such an archive of data that you can’t interpret or analyse any more?
— A good visualization tool is key to finding the patterns in the binary data, especially for the part of the data that represented multiple records. The tool we used allowed us to see the obvious patterns very quickly. It also aided us in validating byte positions and block sizes for the multiple record data.