Datamesh - a paradigm shift | are Data warehouses/ Data lakes dying or need a fresh perspective?
all visuals are created by author (azazrasool)

Datamesh - a paradigm shift | are Data warehouses/ Data lakes dying or need a fresh perspective?

In my quest of finding (and defining) the right Data Architecture for enterprises, I recently came across the concept of datamesh, introduced by Zhamak Dehghani. As she says (rightly?*) in her article, Data mesh is paradigm shift, a revolutionary concept based on Domain Driven Design, Domain-Oriented Product thinking.

In this article, I am attempting to visualize how some use cases (which are currently implemented on data warehouse and data lake) would look like in a new world of datamesh. Open for your comments, suggestions and thoughts.

BTW, if you don't like reading then I have also embedded some visual stories in the article, have fun!

CHACTERISTICS OF A MODERN DATA ARCHITECTURE: AN ARCHITECTURE WHICH IS CUSTOMER CENTRIC AND DOMAIN DRIVEN, ADAPTABLE AND SCALABLE, AUTOMATED, PLUG-N-PLAY, REQUIRES MINIMAL (OR NO) DATA MOVEMENT, ENABLES FASTER TIME-TO-MARKET OF USE CASES (WHICH COULD BE IMPLEMENTED WITH AGILITY AND DOESN'T REQUIRE YEARS TO REALIZE THE VALUE), ELASTIC (ABILITY to ADJUST AND PROVISION RESOURCES ON NEED BASIS), GOVERNED AND SECURED (WITH CLEAR OWNERSHIP AND RESTRICTIONS APPLIED AS PER ORGANIZATION POLICIES AND DOMAIN REGULATIONS)

In most enterprises Data Architecture looks like this: Source systems generates transactional data and maintains it for operational purposes. This data is ingested (ETLed) into Data warehouses and (ELTed) into Data lakes. Data engineering and Analytics team aggregates this data in Data Warehouses/Data lake to further use it for Reporting and Dashboard. Raw data is stored in Data lake, this data is used by Data engineers to prepare usable datasets for Data Science / ML models. For doing all these, Data pipelines are created which are maintained and are refreshed whenever there is a change in Source system (like new fields added/updated because of change in Business requirements).


Now let's see some challenges with this architecture:

  • Lack of Domain context of data with Data engineering team, in some cases it is attempted by including Business Analysts and Data Stewards in the team, but context gets lost in transition and translation (because context is still with some tool/document/human not closely attached with the data itself)
  • Data pipelines takes time to build and hard to be maintained
  • Most of the times, lineage is lost / hard to track inside Data lakes either because of lack of tools or nature of data (semi / unstructured) or due to other priorities.
  • Metadata dictionaries and Business glossary although available but are not very well glued with the dataset when it is consumed. Requires frequent updates whenever the data definition and context changes. It takes development time to reflect that change from Source Systems to Data warehouse / Data lake. 
  • Data lineage is maintained from the point data enters into Data warehouse/Data lake till the consumption layer.
  • Data Governance processes are defined, and governance is enforced using some tools like Informatica Axon, ASG. However, governance is still away from data. is it an issue?
  • Metadata Data Dictionaries and Business glossaries are created as part of the Governance project. These dictionaries are documents or tracked in DG tools. 
  • Time-To-Market to deliver the outcomes to consumer is very slow, ranges in months to quarters

Lets try to visualize a scenario in the world of datamesh:

In Data & Analytics team of a Bank, Data engineers prepares "Customer Journey data" for Data science team. Data scientists would use this data for testing some hypothesis and building new machine learning models. In the current architecture, there are data pipelines from Source systems to data lake to Sandbox environment, to prepare the required datasets for Data science team. "Customer Journey" dataset comprises of Transactions, Interactions, Behavioral attributes. Some examples of such attributes could be :

  • Current Account Transactions of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr.
  • Credit Card Transactions of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr.
  • Loan product Transactions (Auto, Personal, Home) of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr.
  • Customer demographics (without PIIs)
  • Customer Care Interactions related attributes like Complaints and enquiries through emails, chat and phone. IVR data etc.
  • Website and mobile logs to understand his browsing patterns and interests (using Clickstream Analytics)

In a datamesh architecture, all these above datasets are not pre-processed and stored in a data warehouse / data lake. Rather these datasets would be exposed as Data Products (APIs) from the Source side. These datasets could be even broken down into smaller modules based on the domain who owns the Source data and those modular Data products could then be combined together to produce more wider datasets as needed by the consumers. 


For example, we can have one Data Product (DP) per domain as follows:

  •  DP1- Current Account Transactions attributes of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr. 
  • DP2 - Credit Card Transactions attributes of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr. - DP2
  • DP3 - Auto Loan Transactions attributes of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr. 
  • DP4 - Personal Loan Transactions attributes of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr. - DP4
  • DP5 - Home Loan Transactions attributes of all customers for last 1 month / last 3 months / last 6 months / last 1 Yr. 
  • DP6 - Customer demographics of all customers
  • DP7 - Customer Care Interactions related attributes like Complaints and enquiries through emails, chat and phone. IVR data etc. 
  • DP8 - Website and mobile logs to understand his browsing patterns and interests (using Clickstream Analytics)
  • DP9 - Inferred insights based on above Datasets

A little bit about Data Product, its an architecture quantum, which encapsulates everything about the Data i.e. Business context, Metadata definition, Security, Quality information, Lineage information, etc. Benefit of having a Data Product (over a Dataset) is that Consumer application would have everything needed to deal with the data and information is not stale, it is fresh anytime when data is accessed.  

Lets see how this architecture would help different teams:

Data science "Team-A" working on "Propensity model" for Cross-sell/Up-sell use case would require data from all Data products to feed 360-view of the customers during feature engineering exercise. This team would consume data from all the Data products DP1 to DP9, with all the required context.

Data science "Team-B" working on "Real time customer intelligence dashboard", providing real time insights to Customer care team based on his Demographics (DP6), his interactions (DP7), his browsing patterns (DP8) and Inferred insights (DP9). So this team would data from those specific Data Products.

BI team "Team-C" working on a "CXO Dashboard" which would require specific info from any of the above Data Product at any given point of time when Dashboard is accessed, would get it from those specific Data Products.

Data lake team "Team-D" can have a logical model (without persisting data?) using all these data products, and that logical model can be used as another Data Product - DP10 for other analytics use cases.

Data Governance "Team-E" actually won't exist as a separate team, rather roles like Data Stewards, will be part of the Domain and governance is done through code. See Data Governance as Code, read more about DataGovOps.

 To summarize, when we ensure that data follows key principles of DATSIS (Discoverable, Addressable, Trustworthy, Self-describing, Inter-operable & Secure) then w get flexibility and feasibility of moving and using data, without need of creating and maintaining pipelines to move the data to any centralized system, rather data is referenced directly from Source through Data Products through Data Workflows.

DISCLAIMER: Datamesh concept is still in nascent stage, community is growing now and not many references available as of now. It requires cultural and mindset shift more than a technology shift. If an organization is still a functional-led and not domain-led organization then datamesh initiative might result in failure, not because the concept is wrong but may be because organization is not yet ready for the adoption.

No alt text provided for this image

Azaz Rasool,

AI & Data Strategist,

Al-Rajhi Bank, Saudi Arabia








Additional references:

Data Mesh Principles and Logical Architecture (martinfowler.com)

(1) Data Mesh Paradigm Shift in Data Platform Architecture - YouTube

(1) Data Mesh Paradigm Shift in Data Platform Architecture - Arif Wider - DDD Europe 2020 - YouTube

Data Governance as Code. How DataGovOps Balances Agility and… | by DataKitchen | data-ops | Medium

Data Mesh Applied. Subtitle: Moving step-by-step from mono… | by Sven Balnojan | Towards Data Science

Data Mesh Pain Points. Why to think twice before implementing… | by Andriy Zabavskyy | Towards Data Science

Simon Katende

Global Markets Professional | Data & AI Specialist | Passed CFA Level 3 | Hybrid Athlete | Speaker

1y

👏 Thank you for sharing this insightful article on Data Mesh and how it's challenging the traditional data warehousing and lake approaches. It's fascinating to see how this paradigm shift is bringing about a more decentralized and scalable way of managing data in organizations. 🚀 As someone who's been working with data for several years, I can definitely see the potential benefits of adopting a Data Mesh architecture. The emphasis on domain-driven design and cross-functional teams resonates with me, as I've seen firsthand the challenges that arise when data is siloed or managed by a centralized team. 💻 Overall, a great read that highlights the importance of staying ahead of the curve when it comes to data management. I look forward to seeing how organizations continue to evolve their approach to data architecture in the coming years. 🤓 #DataMesh #DataArchitecture #DataManagement #DevOps #Agile #DigitalTransformation

Like
Reply
Azaz Rasool

juggling | its fun when you have a clear view of your north star 💫

3y
Like
Reply
Saurabh Banerjee

Sr. Director - Data Engineering at Publicis.Sapient

3y

Hi Azaz, Very well articulated! How would this impact enterprise data warehouses? Who would own and maintain Facts and dimension tables which aggregate data from multiple domains?

Like
Reply
Markus Wissing

Practice Leader Enterprise Architecture at TecAlliance

3y

Hi Azaz, I like the idea to make the business owner of transactional business applications responsible to "deliver" access to the corresponding data and form a so-called "data product" (I borrowed this term from the DataMesh Community). To my understanding, this means at least 2 things for these business owners: A) They need to invest in additional IT capabilities that formerly resided in the data lake/DWH team. As a business owner of a transactional business solution (e,g, ERP-based solution), you cannot simply build this data product and the corresponding access layer directly on the IT stack of your transactional solution. You often need to hold the data on a secondary data layer decoupled from the transactional system. This requires additional funding for the team, with no direct business value to the transactional process. This leads me to the second aspect B) Incentivisation and focus: The business owners of the transactional systems have full focus on running the transactional process as smoothly as possible. This has always priority 1. Anything else is prio 2 for them. They often have no direct incentive to deliver a data product with the right data quality to downstream teams, because for them the "data product" is just a by-product and not a first-class citizen In some industries, legal regulations are forcing teams to rethink. In these cases, the incentive to manage data with the right quality right away in the transactional system and make it easily accesible is simply a risk mitigation act. I have hopes that the DataMesh approach manages to prioritize these management aspects to drive a change in business operations. On the other hand, I fear that IT vendors in the data management sector will jump on the term DataMesh to sell the next IT silver bullet, which means the DataMesh term is at risk to be seen as just another IT framework. Best Markus

Kausik Ghosh

Story Teller | Product Owner - Risk, Compliance & Financial Crime

3y

Very nicely compiled👍.Thanks ★ Azaz Rasool

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics