Chapter 4. Linking Datasets
Your dataset is not an island. Somewhere, related datasets exist, even in places where you might not expect them. For instance, if your dataset has a Country of Origin column, then it is related to a geographical database that lists the total area per country. An Author column in a book dataset relates to a list of authors with biographical data. All datasets have such connections, yet you might not know about them, and neither does the computer which contains your dataset. For instance, the record for The Picture of Dorian Gray
might list Wilde
, O
. as its author, whereas a biographical dataset might only have an entry for Oscar Wilde
. Even though they point to the same person, the string values are different, and it is thus difficult to connect the datasets. Furthermore, it would be really impractical to link all possible datasets to each other, as there are a huge number of them.
Instead, the approach is to find unique identifiers for cell values, and in particular...