When doing data analysis, it's common for indicators to take the spotlight whereas datasets usually take the backseat as an attribution footnote or as a metadata popup.
However, we often forget how intertwined dataset sources are and how this affects data analysis. For instance, we can never assume that indicators from different datasets are mutually exclusive – it's possible for them to be the same indicator or to have an influence on the other as a component weight in an index, if the other dataset were used as a source for the other.
In this blog, we're interested to see if this applies to TCdata360 by taking a deeper look at its "dataset genealogy" and answer questions such as – Is it safe to do cross-dataset analysis using TCdata360 datasets? Are there interesting patterns in the relationships between TCdata360 datasets?
Quick introduction to network graphs
We call a dataset which serves as a data source for another dataset as "source", and a dataset which pulls indicator data from another as "target". Collectively, all of these are called "nodes".
To see the relationships between TCdata360 datasets, we mapped these in a directed network graph wherein each dataset is a node. By directed, we mean that source nodes are connected to their target nodes through an arrow, since direction is important to identify source from target nodes. For the purposes of this blog, we restricted the network graph to contain datasets within TCdata360 only; thus, all data sources and targets external to TCdata360 will not be included in our analysis.
Here's how the network graph looks like.
Each dataset is represented by a circle (aka "node") and is grouped and color-coded by data owner or institution. The direction from any source to target node is clearer in the interactive version, wherein there's a small arrow on the connecting line which shows the direction from target to source.