Traditional DIY approach to data silo and data flow management

November 28, 2019


What is DataFactory Enterprise?

DataFactory Enterprise is an on-premises data factory that covers the entire data life cycle from raw data processing to model deployment in SQL Server and its SSIS, SSAS and ML Server components.

Feature engineering

On top of the same DataStore model / schema as in DataStore Community comes SQL code for data wrangling and generating 300+ variables from XML credit reports. The code is re-usable on other data sources after slight modification.

Data pipelines

The two options for building data pipelines are SQL Server Integration Services, SSIS and Azure Data Factory, ADF. Both ADF and SSIS have the capability to read data from a huge variety of data sources and write data to different destinations.

SSIS is the main on-premises option for deploying data pipelines in DataFactory Enterprise. In a hybrid approach SSIS and ADF are combined for extra flexibility or functionality.

The standard data pipeline covers XML credit report import, transformation, variables calculation and web service call so that every retrieval of a credit bureau report updates data in DataStore and returns response from predictive models. The pipeline is modifiable to embody other data sources like CRM or ERP.

Model deployment

There is a variety of product options to prepare, build, deploy and manage ML models using a number of cloud-based and on-premises options.

On-premises option #1: SQL Server Machine Learning Service

SQL Server Machine Learning Service allows data wrangling, visualization and feature engineering to be performed directly in-database using R / Python. The ML code is executed in-database without moving data outside SQL Server or over the network. Once the model development is complete, it is deployed in SQL Server using T-SQL stored procedures and called by applications to generate predictions.

On-premises option #2: Microsoft Machine Learning Server

Microsoft Machine Learning Server is stand-alone enterprise software for data science that comes with base and high-performance R and Python libraries. When data is stored in SQL Server and Apache Spark (on HDFS), those R and Python libraries can compute and analyze data locally, returning just the results, without the need to pull data across the network. R / Python models are published as web services.

Cloud-based option #1: Azure Machine Learning Studio

Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool to build, test, and deploy predictive analytics solutions. Azure Machine Learning Studio publishes models as web services consumed by custom apps, BI tools and even Excel.

Cloud-based option #2: Azure Machine Learning Service

Azure Machine Learning Service is a fully managed cloud service used to train, manage and deploy Python models as web services.

Alternatives:

Of course there is an option of developing the models in R / Python and deploying them in Apache Spark, H2O, SAS, SPSS.


Why use DataFactory Enterprise?

Traditional approach

Because the DataStore model / schema is open-source and 3NF normalized, it renders itself easily to extension almost infinitely over other data sources that your company may deem necessary in DS/ML projects. New data is added simply as new tables in the database.

This DIY in-house approach might be the right choice if your team of SQL Server developers and DBAs is ready to upgrade and customize our standard delivery to your company’s specific business requirements.

Easy native data pipelines

Native SSIS and optional cloud ADF data pipelines eliminate the need to send data across various external applications —> hence more efficient and less costly data engineering.

A wealth of extra features

The following components are available in SQL Server for creating additional functionality: Analysis Services; Reporting Services; Data Quality Services; Master Data Services.

SQL Server carries a built-in graph database that handles complex entity relationships such as social media data with a mix of dynamic many-to-many relationships.

Columnar indexes allow for storing and querying large data warehousing tables at up to 10x data compression rates.

SQL Server is big data ready

The SQL Server 2019 edition comes with built-in big data capabilities of Apache Spark and Hadoop Distributed File System. See links below for more info.

SQL Server is ubiquitous

Should we say SQL Server is known for its ease of use, vast worldwide user community, excellent documentation by Microsoft, and no shortage of developers and DBAs?


How we deliver the DataFactory Enterprise

The DataFactory Enterprise is installed on-premises on your servers as DataStore + documentation + data pipelines + sample trained ML models + SQL code for feature engineering.