November 28, 2019
What is DataStore Community?
DataStore Community
is a light-weight database based on the DataStore
schema
and populated with data from credit bureau reports.
It is open-source and contains several components:
DataStore
model / schema and SQL code to create a sample database in SQL Server (any edition – Express, Standard or Enterprise)- XSLT code to re-enginner data from credit report XML files into
DataStore
- SSIS package(s) to parse transformed XML files into SQL Server
- SQL code to detect and fix common data quality issues
- Documentation and source code on our GitHub page
Note: more features will be added in the future, through community contribution and from
DataFactory Enterprise
Please see DataFactory overview and feature comparison for more details.
Why use DataStore Community?
It is free
Great value for zero money – the DataStore
gets better with
every credit report parsed and stored in it, eventually becoming a mini-copy
of the credit histories database, limited only by the number of people
and legal entities whose credit reports are loaded into it.
The bigger the data —> the better the predictive models
Loading customer-centric data from ERP, CRM and other data sources into the datastore will make your datasets richer. Whatever credit bureau’s models can do – your models will do better.
Thus a valuable digital asset is created for free from credit report costs incurred over the years.
A 3NF normalized datastore that is easily extensible
Thanks to the 3NF database model normalization, additional data sources (e.g. customer card transactions, marketing campaign responses) are added simply as new tables alongside the tables storing data from credit reports.
A 3NF normalized datastore is a good starting point for building a designated DS/ML data silo, and can be stuctured to function as the single source of truth in data integration projects.
Integration into existing datawarehousing infrastructure
Like other mature database management systems, SQL Server has well designed tools for importing all sorts of data from various sources – relational and NoSQL databases, CSV, TXT, XML files. There are also multiple options for exporting data from SQL Server to other destinations.
The data warehousing ecosystem of your company probably has a number of data silos, some of which are used for BI, reporting and data mining purposes. Depending on the specific settings and business requirements, it creates two major options and a range of combinations in-between:
- As yet another datamart in the existing datawarehousing infrastructure by importing pre-processed and ready-to use ‘final’ data into a master DWH;
- As a stand-alone data silo for DS/ML projects, eventually tranforming it into a master DWH;
- Various combinations of the two options above.
How we deliver the DataStore Community
It is available for review and free download from our GitHub page.
Useful links
Datawarehouses, data lakes and datamarts are a big topic revolving around
two major datawarehouse philosophies, Inmon and Kimball.
Our design of DataStore
is in line with Bill Inmon’s paradigm.
Please see below a few links on DWH design.
While the source data in DataStore
is structured into many tables
linked with each other through foreign keys to ensure data integrity
and minimize data redundancy, denormalized datamarts can be created as and when
required. In practice ‘denormalized’ usually means that tables with ‘final’
data such as variables for predictive analytics and modeling will be created
for use in BI, reporting, DS/ML and other projects.
Strict 3NF data atomicity requirements do not apply to denormalized datamarts.