Why saving data in binary format saves space

Recently we had a short group discussion regarding saving and storing data. For some experiments it may become a real bottleneck since they may be generating billions of data points per second. However saving the data in the proper format may not only prevent you from running out of space in the hard drive, it will make opening and handling files much faster, without even mentioning the time you’ll save if you send them from one computer to another. I think it is wise to explain a bit what does it mean to save in plain “ascii” and why the binary format so many fear is a fast and elegant solution.

binary_format

As you probably know by now, everything that happens inside your computer is based in 0s and 1s, the famous bits. Now of course it is easy to transform from a binary number to a decimal number. But what happens if we want to convert letters to 0s and 1s? The easy answer is to establish a convention, and assign different numbers to different characters. In this way we can convert the number 1000001 to a capital A, or the number 100000 to a space. Since we would like to deal with numbers, we also have to include them in our conversion table, so the number 110000 will be transformed into a 0, etc. You may see the entire table here.

The ASCII table is the most simple one and it needs 256 different numbers to represent the entire alphabet, including numbers and special characters. If we think in bits, in 0 and 1, we will need 8 of them the represent the entire list (2^8=256). If we want to represent the number ‘123’ with this convention we would need three times the 8 bits, one for each of the numbers. Of course you may want to store a larger number of digits, for instance 3.14159 will take up 7 times 8 bits (the dot has to be counted). This is already proving to be not an efficient approach, since you rely in converting each digit to a particular position in the ASCII table and only then to a combination of bits. Fortunately there are cleverer ways of dealing with just numbers.

Everyone who is familiar with compiled languages is more than aware of the plethora of variables. People used to python or matlab however, may be unaware of the big differences that exist between doing a=3.1 a=’3′ or a=3. In the previous example I’ve shown that for representing the number 123 we were needing 24 bits (8 bits per digit). If we do the math the number of different combinations we can have with 24 bits is more than 16 million (2^24). Despite this huge capability, we couldn’t represent more than the number 999 with those 24 bits. The solution to this is to ditch the ASCII table and build a new one just for the numbers we are interested in.

For example, if we know beforehand that the numbers we are dealing with are positive integers, always lower than 1000, we can represent them with just 10 bits (2^10=1024). The only thing we do is to change the base from decimal to binary. Even if this is too simplistic, is the foundation for understanding how more complex scenarios work. Imagine you don’t want to deal with just integers, you would like to have also half-integers (100.5, 34.5, 567.0, etc.) It is only twice as many numbers, so you can easily handle them with 11 bits (2^11=2048). If you want to go from the real number to the binary representation, you should define the rules for doing so. You could say for instance, that the first 1000 numbers are going to be represented by the first 1000 numbers in your binary format, and then the half-integers with the next 1000. Or you can just put them in order. It doesn’t really matter what you do, while you are always consistent.

Fortunately you normally do not handle this low level of detail; someone else thought the best ways of making the conversions. In most computer programs there are two main classes of numbers, integers and floating point (Matlab may call them double). Integers are easy to understand, and normally you can choose 8, 16, 32 or 64 bit for representing them. Try to think in the largest integer you may have with 64 bits! Floating point are just rational numbers. They also come in different flavours. For instance, imagine that you work with 16 bits meaning that you will have 2^16=65536 different values. If the range of the numbers you’ll get is between -10 and 10, it means that the smallest difference you can represent between two numbers is 20/65536 that is roughly 0.000305. Just to compare, if you would like to write the number 0.000610 using the ascii convention, you would need 64 bits. With the latter convention, you just need 16 bits.

I think at this point you already started to see the difference between different types of variables. For instance, if you have again integer values always lower than 1000, and you choose to use floating point of 32 bits, it means that you can represent differences as small as 1000/4294967296 ~2.33E-7, but you have just integers, so you will never have a difference between numbers smaller than 1. Choosing a 10 bit base works as good and saves you 22 bits per number you want to save. Let’s go to a more realistic scenario. Imagine you have a device that measures a particular variable in the range +/-10 with 4 decimals precision every microsecond during one minute. In total you will have 6E7 numbers to save, each one containing 7 characters (the sign, the digit, the dot and 4 decimals). In ASCII format this is taking 7X6E7X(8 bit) = 336E7 bits. Since the device shows the data with 4 decimals, we need 200000 points between -10 and 10 to correctly represent the information. With 18 bits we are set. This means that the entire timetrace would take 6E7X(18 bit) = 108E7bits.

All the discussion so far was to prove that choosing a proper way of representing your data may save space. So why do we still rely on ASCII format?

Sadly there is no universal convention on how to save data. Imagine you buy an oscilloscope and the manufacturer knows the maximum voltage you are able to read is 5 volts. If you have 16 bits available, you can just define your space as going from -5 to +5 and you have 2^16 different values. The same manufacturer may provide you with a 12 bit multimeter that can go from 0 to 200V. And the files both instruments save have the same extension .ins. Unless you know the rules that were used for generating the files, you will never be able to open them in another program. ASCII on the other hand is universal, it is used for represent this webpage for instance, windows knows about it, linux knows, etc.

Some binary formats are open source, meaning that you can see the specifications and write (probably someone else already has) a program to read the data. Sadly (and non-understandable) some manufacturers choose to have proprietary file formats that can be only opened with their own software. Even programs written in LabView if not well documented generate files hard to decode with Matlab or Python. The first step would be then to check whether it is possible to open the binary files from your device in your analysis environment (I’m assuming the manufacturer did the best to store data with the smallest footprint). If that is possible you have nothing else to worry about. If it is not possible, you can export the data from the device in ASCII format, open it in your analysis environment (in this case even Origin will do a decent work). Be sure to select the proper variables (for instance in Matlab you can check if you have unsigned integers, or the precission of the double, 16, 32 bits, etc.). And then just save the data again; in matlab you achieve that with the command save. In python, if you are using numpy, just use the command dump.

If you are interested in solutions to even larger amounts of data, you may check the HDF5 file format. It is efficient not only for saving data, but also at the moment of reading it back, since allows you to just fetch the portion that is relevant.

  • José María Miotto

    nice discussion, but I would suggest to use json instead of pickle, specially for python users, for many reasons. First, it is faster (https://kovshenin.com/2010/pickle-vs-json-which-is-faster/). Second, is readable, which can be relevant if looking at data from time to time. Third, c cannot read types in runtime, so c needs an ad hoc implementation, which is a hassle considering that a json library for read/write files is already implemented (in any language probably). Fourth, pickle is a security risk, since it has arbitrary code execution.

    • aqui_c

      Is JSON able to store binary data? I’ve used JSON only interfacing with web API’s and it was always ascii based. May it be similar to the XML format of LabView? (It stores the header info in plain ascii but data in binary?)
      Thanks for the suggestion.

      • José María Miotto

        well, not natively, but you can encode the data with base64, and then you save it as you would do it with a normal text. At least in its python read/write implementation, json always gets text as utf8, so, as said, you have to encode it as binary.

%d bloggers like this: