Using spreadsheet files for data storage and exchange can lead to the
corruption and even loss of information — sometimes without noticing it immediately. I came across this one too often so I could not help but write a rather long post about this topic.
Sharing datasets with other researchers or collaboration partners is a vital part of the knowledge exchange in a community. This might happen in big scale in form of supplementary material along with publications or in small scale within research groups. In this post I do not want to focus on what is actually shared, but how. Because a crucial but commonly underestimated element when sharing datasets with others is the used data format.
A good data format should promote the exchange of data, not hinder it by
adding more levels of complexity. In other words: It should be as simple as possible without loosing information when packing data.
A concomitant factor is machine-readability, which is needed in case of further data processing. And maybe the most prominent factor is the long-term accessibility of the datasets — a factor that should be considered even if you (often incorrectly) believe that the data is no longer required some years after its generation.
A common way to represent result datasets is in tabular form, e.g. rows
represent observations and columns represent individual data types. The need for a convenient way to access such tabular formatted data led to the use of spreadsheet applications. These applications can be versatile tools if used in the right way. Besides allowing direct access to individual cells, spreadsheet applications provide fundamental arithmetic operations and commonly used functions to process and aggregate multiple cells within tables. Maybe the most prominent feature is their ability to more or less instantly visualize selected parts of a dataset as charts. But as convenient as the spreadsheet applications themselves may be for many tasks, their data/file formats are unsuitable for data exchange within a scientific community and especially unsuitable for long-term storage.
Just because spreadsheet software is widespread does not mean that it is generally applicable.
Although most computer systems come with a pre-installed spreadsheet software, taking this situation for granted and especially taking the own installation as a reference is a fallacy. When exchanging datasets with other researchers or cooperation partners, their respective requirements and thereby the software they are using may completely differ. In case of spreadsheet applications, the “lowest common denominator” has to be taken into consideration to not alienate a certain number of users.
Microsoft, for example, has done a good job in supporting older Office versions by allowing the import and export of older Excel sheet file formats into newer releases of their application. Excel, as part of the Office365 Suite, claims to be compatible back to the Office-97 file format. But this is a lossy process as far as I have experienced. Features like macros or layouts might get lost when converting between different releases.
The same limitations especially occur when sharing spreadsheets between different products, e.g. between Excel and LibreOffice. The open source community driving the LibreOffice release has reached a high level of compatibility to MS Excel, but there is still a large number of issues tagged with Excel in their bug tracking system.
In the latest version of the Excel file format, the underlying structure is in
the XML format (Standard ECMA-376 Office Open XML File Formats). Although this basically counts as a plain-text format, the requirements of an Excel spreadsheet blows the whole content up to a level that is again unparseable by common standards (MSDN Library: link link link). The tremendous efforts of the LibreOffice programmers to implement compatibility and their list of issues can be seen as an example.
Keep it simple, stupid (KISS)… keep it CSV formatted
Although you could argue that spreadsheet software like Excel is able to import and export CSV and TSV file formats, there might be (and are) limitations in these steps; information might be lost! (Own experience!) So: do not rely on the power (and infallibility) of these file format converters.
A simple yet versatile format to store your data is as Comma-separated values (CSV). Similar to the alignment in spreadsheets numerical data (and text) is stored in form of rows and columns. Its versatility, one of the formats’ greatest advantage, may also be its greatest disadvantage. An RFC exists which recommends a standardized form, but besides that there is no fixed format definition, e.g. of delimiters. Without taking care of the delimiters and the contents, a collision may occur which results in corrupted datasets. The may especially occur when using textual content that itself may contain the delimiters as regular character. If necessary, all major spreadsheet softwares support the import and export of CSV formatted data although I do not recommend them to for larger datasets due to the previously stated limitations.
Maybe another argument pro-CSV would be its easy processing in simple script languages (shell scripts) but also in Perl or Python, the two mayor script languages. And for most other higher languages I’m pretty sure libraries for the im- and export exist and are easily applicable.
There is a rich number of really simple file formats that can be used for data storage and exchange. But maybe I’m a bit biased by the fact that I’m focusing on the file format here instead of the user perspective. The formats are simple but for a regular user, not necessarily simple to use. Spreadsheet applications on the other hand are versatile and mostly simple to handle tools especially when visually processing tabular small datasets. But still: For mid- and long-term storage and exchange of datasets they are unsuitable.