Best way to represent null and empty data in an excel file
We have an application that allows users to design their own forms and export the data to excel. One problem is the representation of null or empty data. There are two scenarios that we need to support:
- The data was not present at all (question was skipped)
- The data was empty (question was left blank)
What is the best default representation of this in Excel/csv? Currently we are leaving the cell empty for scenario 1 are struggling to find a good representation for scenario 2. The requirements for this are:
- Should not be tied to a particular language (so "no data" is not good)
- Must play nice with stats programs (not sure what they don't like, but "---" is bad)
- Should be differentiable from each other
- Must not rely on special formatting (e.g. they may get a csv and should still be able to differentiate)
Can you clarify the difference between a question being skipped and the question being left blank? If a user skips a question, have they not left it blank?
Yeah, so in our forms, questions can be made (ir)relevant by other questions, and it's important to be able to differentiate in the export from "this question wasn't asked" versus "this question wasn't answered".
Historically empty values in tables were represented as a dash, so it might be some single dash character. Typography prefers En or Em dash, but for usability it would be easiest to input a good old hyphen, which I personally use to denote empty values for parsing tasks via Excel.
As a data engineer, I would suggest you can use "NA" or "NAN" for empty data cell because of the following 2 reasons which I could think of:
- It is a standard terminology for null data in data industry
- and, of course it will make life easy for data engineers
+1, when cell is left blank some user could perceive that something didn't work and the value is ambiguous. Populating it with value clarifies that data wasn't provided.
NA could be understood as "Not Applicable" (question was skipped -- scenario 1), or "No Answer" (question was not answered -- scenario 2). Ambiguous again.
@PashaS "NA" in data science field means "No Answer" and "NAN" means "Not a Number" and we use these two strings to represent missing or no data.
@AnkitSharma: and it could also be spelled as "Not Available". But linguistics aside -- in this case you have 2 different reasons for missing data. And I don't see any logical way to assign "N/A" to either the first or the second scenario.
However this makes the cells non numeric so difficult to sum/average, import into another app etc
Actually, statistical/data mining software packages have in-built functionality to take care of NA/NAN if we intend to perform any numeric analysis. Its not the case of excel.