I'm part of a team asked to perform some predictive analysis with a huge relational database. Data is a mess. Documentation ranges from mediocre to incorrect to absent. Information is scattered all over the tables.
For example, if I want to match addresses with telephone numbers, I can query three or four different tables, each one containing information unknown to the others, and maybe there is some information I shouldn't use.
To get data, people I'm working with heavily rely on folklore: they know that in order to obtain phone numbers from addresses, you have to query this and this that way because John told them so a few years ago. And John knew it because Sam told him. And so on. Folklore is essentially not challenged and is often not so right.
Retrieving information is a pain and we spend most of our time just extracting it from the database, without even trying to do something clever with it.
I'd like to establish some standard which we can use in all our projects. Moreover, I'd like it to improve as we gather the folklore. I don't want to create a "How to do it" super document which will probably spawn one million local variants. So basically, I think I want to encapsulate domain knowledge in "something."
I thought that we could create tables that aggregate scattered information in one place, document and query those new tables from now on instead of relying on folklore. So no more three locations for telephone numbers and addresses, one TelephoneToAddress table.
Does it make any sense? In the context of data exploitation, is it even a good idea?