November 28, 2019
What is DataFactory Enterprise?
DataFactory Enterprise
is an on-premises data factory that covers
the entire data life cycle from raw data processing to model deployment
in SQL Server and its SSIS, SSAS and ML Server components.
Feature engineering
On top of the same DataStore
model / schema as in
DataStore Community
comes SQL code for data wrangling and generating 300+ variables from
XML credit reports.
The code is re-usable on other data sources after slight modification.
Data pipelines
The two options for building data pipelines are SQL Server Integration Services, SSIS and Azure Data Factory, ADF. Both ADF and SSIS have the capability to read data from a huge variety of data sources and write data to different destinations.
SSIS is the main on-premises option for deploying data pipelines in DataFactory Enterprise
.
In a hybrid approach SSIS and ADF are combined for extra flexibility or functionality.
The standard data pipeline covers XML credit report import, transformation,
variables calculation and web service call so that every retrieval of a credit bureau
report updates data in DataStore
and returns response from predictive models.
The pipeline is modifiable to embody other data sources like CRM or ERP.
Model deployment
There is a variety of product options to prepare, build, deploy and manage ML models using a number of cloud-based and on-premises options.
On-premises option #1: SQL Server Machine Learning Service
SQL Server Machine Learning Service allows data wrangling, visualization and feature engineering to be performed directly in-database using R / Python. The ML code is executed in-database without moving data outside SQL Server or over the network. Once the model development is complete, it is deployed in SQL Server using T-SQL stored procedures and called by applications to generate predictions.
On-premises option #2: Microsoft Machine Learning Server
Microsoft Machine Learning Server is stand-alone enterprise software for data science that comes with base and high-performance R and Python libraries. When data is stored in SQL Server and Apache Spark (on HDFS), those R and Python libraries can compute and analyze data locally, returning just the results, without the need to pull data across the network. R / Python models are published as web services.
Cloud-based option #1: Azure Machine Learning Studio
Microsoft Azure Machine Learning Studio is a collaborative, drag-and-drop tool to build, test, and deploy predictive analytics solutions. Azure Machine Learning Studio publishes models as web services consumed by custom apps, BI tools and even Excel.
Cloud-based option #2: Azure Machine Learning Service
Azure Machine Learning Service is a fully managed cloud service used to train, manage and deploy Python models as web services.
Alternatives:
Of course there is an option of developing the models in R / Python and deploying them in Apache Spark, H2O, SAS, SPSS.
Why use DataFactory Enterprise?
Traditional approach
Because the DataStore
model / schema is open-source and
3NF normalized,
it renders itself easily to extension almost infinitely over other
data sources that your company may deem necessary in DS/ML projects.
New data is added simply as new tables in the database.
This DIY in-house approach might be the right choice if your team of SQL Server developers and DBAs is ready to upgrade and customize our standard delivery to your company’s specific business requirements.
Easy native data pipelines
Native SSIS and optional cloud ADF data pipelines eliminate the need to send data across various external applications —> hence more efficient and less costly data engineering.
A wealth of extra features
The following components are available in SQL Server for creating additional functionality: Analysis Services; Reporting Services; Data Quality Services; Master Data Services.
SQL Server carries a built-in graph database that handles complex entity relationships such as social media data with a mix of dynamic many-to-many relationships.
Columnar indexes allow for storing and querying large data warehousing tables at up to 10x data compression rates.
SQL Server is big data ready
The SQL Server 2019 edition comes with built-in big data capabilities of Apache Spark and Hadoop Distributed File System. See links below for more info.
SQL Server is ubiquitous
Should we say SQL Server is known for its ease of use, vast worldwide user community, excellent documentation by Microsoft, and no shortage of developers and DBAs?
How we deliver the DataFactory Enterprise
The DataFactory Enterprise
is installed on-premises on your servers as
DataStore
+ documentation + data pipelines + sample trained ML models +
SQL code for feature engineering.