How does a user identify a unique record in a dataset? Remember the concatenation that you need to perform when your lookup value depends on more than one field? In the RDBMS world, these concatenated fields are regarded as “candidate keys” or in a different context also called “natural keys” because they are part of the original dataset and are independent variables best not created by formulas. While using natural keys are human-readable and intuitive to understand, there could be two disadvantages of using them. First, if there is a business logic change, the number of keys or the meaning of a key may also change. Secondly, as the number of keys and/or the underlying value of those keys increases, it decreases the spreadsheet’s performance. Therefore, keeping track of the natural keys with a designated unique identifier (something like CAT-PRODUCT-1) that increments by 1 every time there is truly a new record is a future-proof trade-off to both preserve the evolving business logic and achieve the elegance of using a single identifier (then becomes primary key) without using an integer-only surrogate key.
Notice how a formula throws a “#VALUE!” error when you attempt to divide a number by text, and wish you’d know this silly problem sooner? Many formulas are type-specific, so understanding the data type of your value really helps with debugging errors and cleaning data such as 2 instead of “2” or 07/20/2020 22:35 instead of 1595284509. There are some common data types such as string, integer, float, datetime, and boolean. One can also view data type as the computer’s accounting system because the resource within a computer is finite. Integer takes up less memory space than float, and boolean takes up the least amount of space with only two possible values, 0 and 1. Converting a field’s data type is the same as telling the computer how much space you are willing to allocate to it, so the performance can be tuned in case an object is taking up more space than it should.
Or column vs. row in the spreadsheet lingo. You know what I’m talking about if you speak either of these two languages, and there is a difference. Many users arbitrarily determine this depending on what report they are developing at the time. While a Finance report needs the “Jan”, “Feb”, “Mar” all be their own column, the raw data that feeds into the report probably doesn’t, so you would have an attribute called “Month” and the values of “Jan”, “Feb”, “Mar”. Distinguishing these two at different stages of your data crunching is pertinent to how streamlined your reporting pipeline could be and is one of the foundation stones of “tidy data” in the next pointer.
Many users spend a lot of time and effort in developing and maintaining a colorful Excel report, and it usually looks incredibly wide, especially Finance reports. These reports are perfectly fine if they are not the “intermediate” data that still need to be manipulated to produce another report. Managers tend to love this view too because they look like pivot tables. For analysts, on the other hand, it is crucial to create a fork process where one leads to a “wide table” report and another leads to a “long table”. A normalized long table is what tidy data (originated from an R package called tidyr) tends to look like. This format is the best optimized for analysis and visualization.
Some combinations of values simply don’t make sense, but how do you prevent them from happening in the first place? In Excel, the all-powerful INDIRECT function does the heavy lifting for you. As long as your formula is correct, you will see only values that make sense available for selection depending on the other value. But even INDIRECT needs a “map” to know which pairs are valid. It is tempting to just start a brand new tab and type away the possible combinations, but hold on, a less redundant approach is to have these mappings all in one shared directory. This way everybody is referencing the same thing and avoids the manual work to create a redundant tab that might look slightly different from the original copy.
Data validation is often under-appreciated until people realize “they should’ve (or could’ve)” put a simple constraint on a field so at least an invalid value is glaring at you when it appears. To complement this, leverage the data dictionary to suggest possible values instead of waiting on an invalid one, especially for texts that have categories. Additionally, conditional formatting could also be a great tool to validate data that do not yet have any “bound”. Especially when you are still unfamiliar with the collected values, using conditional formatting immediately highlights duplicates or outliers with popping colors.
Perhaps this has more to do with one’s organization skill than technical competence to manipulate spreadsheets, but even in software engineering, the location of a file has an important implication on the functionality of the modularized application. Usually, the more frequently used objects would be closer to the root folder (the very first level), and the more specific-use objects would have their own dedicated subfolder. If files are organized this way rather than by personal preference or department-specific logic, it not only eases file navigation for anyone but also inspires future developers who design a software solution based on the existing system.
Many seasoned analytics professionals will find themselves to be at least a half-programmer as they progress further down their career. This is because they’ve become fluent in writing formulas and even the scripting language supporting the spreadsheet software (VBA for Excel and Google Apps Script for Google Sheet). While this is a great place to start automating manual work, one should take previous pointers into a holistic consideration because writing 8 Vlookup across 8 spreadsheets is just as manual and unscalable. Normally if the spreadsheet has slowed down or if it becomes difficult to debug the script, that is the signal to break down or simplify the task it’s trying to perform.
Never trust your user. This is preached by many, and I’ll add one more: never trust yourself either. If you are the one entering data or creating the space for others to enter data, it is the best that data producers do not see the underlying table especially if it also contains data that is not directly input by the data producers. Data producers should have their eyes only on the fields that require their input, and these fields should have guardrail like disallowed values and/or required fields. Forms would come in handy in this case, and to further lower the likelihood to have bad data, utilize drop-down (remember to leverage data mapping in point #5 if there are only certain allowed values!) and autocomplete.
If I’m ever forced to trust one more than the other, I’d choose machine-generated values over user input. This deserves its own little section because leveraging machine to generate/transform/deduce values for you is in itself a milestone to process standardization and automation. Remember those awfully similar emails with the same attachment for an invoice or a table of details that you should care about? Don’t open one by one and transcribe them to a spreadsheet manually. Tap into the API. Automate data extraction. If there is a datetime field that populates upon user input, fill it out for them once a formula detects the user input instead of asking them to fill it out. Wherever there is an alternative to user input, always consider automating data gathering.