In a previous blog, Storytelling with Data, my colleague offered some great tips on keeping a presentation focused on the data. Unfortunately, the data doesn’t always seem to tell a coherent story. Maybe the principal developer is gone, the documentation is stale, or perhaps the subject matter experts each have their own swirling interpretations of the content. What now? Analytical opportunity is knocking and it’s time for data discovery! There are challenges with any data discovery initiative, however. To help you overcome these challenges and improve your analytical skills, I offer several tips to help you through the data discovery process.
- Find your friends – As you start, take stock of all the interested parties including knowledgeable users and developers. Even if stale, collect the most recent documentation and models. Recognize that, no matter how obvious or important a conclusion is to you, you’ll likely need an independent partner to corroborate your findings and their merit.
- Divide and conquer – To avoid getting overwhelmed, restrict your focus to just a handful of related tables at a time. First, pick a set of tables whose content is understood to give a sound starting or calibration point. Desirable tables will have mild confusion and need to be cleaned up. Once confidence in the process and query tools is built, gradually expand the focus area as confidence grows.
- Consider your schema as layers of columns and each layer plays a role – It can help to look at columns in a set of tables by the role they play. A simple initial approach is to categorize columns as either structural or descriptive.
- Structural columns are the primary and foreign keys, along with unique constraints, which play the role of identifying our entities and holding them all together. If you like metaphors, structural columns are a lot like our skeletal system, where the tables act as bones, and the foreign keys and join conditions act as the joints that hold them all together.
- Descriptive columns are the remaining columns which give meaningful attribution and qualities to our entities or tables. Continuing the human body metaphor, descriptive columns act as the muscle that drapes over the skeletal system and provide the look and feel that give an entity its everyday business meaning.
- Wiggle the join[t]s – Practice inner and left joins between each new set of tables. If you’re unsure of the best join clause, look through the stored procedures and views and seek out the best-groomed as a guide. With each successive join, try to build a mental model of the cardinality between entries in each table, and try to pose what you find back to its business meaning.
- Get comfy with recurring table patterns – It helps to see not individual tables, but rather small clusters of related tables. These little clusters usually follow a fairly limited set of patterns, based on the relationships, such as: Parent-child tables, bridge or associative tables with two or so parents, and hub-and-spoke or star tables. Once you work with a certain pattern, it will be much easier to work a new set of tables of that same pattern.
- Orphaned children and childless parents – Many schemas don’t impose referential integrity constraints. So the data load process may provide the only constraint to keep the data quality intact. If this is the case, gauge the quality of these load-process imposed constraints. Take a measure of this data quality by querying for orphaned children and childless parents with parent-child tables.
- Aim for the ambiguous – Once confident in the discovery process, now go for something of value, and target a set of tables where there’s some ambiguity or misconceptions that the team really needs to clear up. Often the ambiguity stems from a set of tables that are just a poor representation of business reality. Or possibly a table might be structured properly, but the data quality renders it less useful. Focus on learnings that are valuable to your business sponsors, and make this exercise a worthy endeavor.
Data discovery exercises can provide a valuable independent interpretation of the value and quality of a data source. Providing this interpretation allows project teams and business sponsors to choose the best data source for their projects, aligned with the organization’s key business drivers and contributing the least amount of overall project risk.