OEM Material Insights: ESG Analytics Platform Built on Microsoft Fabric
Business Context
Modern OEM manufacturers in automotive, aerospace, and electronics face mounting pressure to make procurement decisions that balance cost, quality, and sustainability. EU regulations (CSRD, Supply Chain Due Diligence) demand transparency on supplier environmental and governance performance, while investors and customers increasingly scrutinize supply chain responsibility. Traditional ERP systems track cost and quality but lack integration with sustainability metrics—creating blind spots where purchasing decisions may inadvertently concentrate risk in countries with weak environmental regulations or unstable governance.
This project addresses that gap by building an end-to-end ESG analytics platform on Microsoft Fabric that integrates procurement data with public sustainability datasets. The platform enables manufacturers to assess supplier sustainability (Yale Environmental Performance Index across 180+ countries), evaluate governance risks (World Bank indicators covering 200+ jurisdictions), analyze supply concentration for 80+ critical materials, and track procurement patterns correlated with ESG performance.
Target Use Cases: Battery material sourcing decisions for EV manufacturers (lithium, cobalt, nickel), critical material supply security for aerospace/defense (rare earth elements, titanium), multi-tier supplier ESG monitoring for electronics, and regulatory compliance reporting for any materials-intensive industry navigating ESG requirements.
In Short: Built an ESG analytics platform that helps manufacturers comply with EU regulations and reduce supply chain risks by integrating procurement data with environmental and governance metrics. The system combines ERP data with public sustainability datasets to provide real-time insights for sourcing decisions.
Business Value
This platform enables OEM manufacturers to:
- Navigate ESG Compliance: Meet EU regulatory reporting obligations (CSRD, Supply Chain Due Diligence) with auditable sustainability metrics
- Build Supply Chain Resilience: Identify geographic concentration risks for critical materials and diversify supplier base
- Make Informed Sourcing Decisions: Quantify environmental and governance trade-offs across supplier options
- Proactive Risk Management: Monitor governance indicators (political stability, corruption) to anticipate supply disruption risks
The integration of dimensional modeling with comprehensive data quality framework enables scalable, maintainable analytics rather than fragile one-off reports—critical for production enterprise environments.
For technical readers: The following sections detail the data engineering architecture and implementation specifics.
Technical Approach
Data Integration & Architecture
Integrated four disparate data sources into Microsoft Fabric using medallion architecture (bronze → silver → gold) ensuring data quality, auditability, and performance optimization:
- Azure SQL Database: ERP procurement transactions and supplier master data (Data Factory pipeline with incremental load parameters)
- Yale Environmental Performance Index: 180+ countries, 40+ environmental indicators across climate, biodiversity, and environmental health
- World Bank Governance Indicators: 200+ countries across six governance dimensions (accountability, stability, effectiveness, regulatory quality, rule of law, corruption)
- EU Critical Raw Materials Supply Data: 80+ materials with supply concentration by country (automated HTTP ingestion from GitHub)
Bronze Layer: Parallel ingestion of raw data into Delta tables preserving audit trail Silver Layer: PySpark-based standardization (column naming, type casting, unpivoting wide-format ESG data) Gold Layer: Star schema dimensional model optimized for DirectLake semantic model
Dimensional Modeling
Designed star schema optimized for DirectLake semantic model:
3 Fact Tables:
fact_procurement- Procurement transactions with spend metricsfact_supply_share- Material supply concentration by country/stagefact_epi_score- Environmental performance scores by country/indicator
5 Dimension Tables:
gold_dim_country- 180-200+ countries with ISO codes, ESG metadata (role-playing dimension for HQ vs. production country)gold_dim_material- 80+ materials with commodity group hierarchygold_dim_date- Calendar dimension with temporal attributesgold_dim_indicator- EPI and WGI indicator metadatagold_dim_stage- Extraction vs. Processing for supply chain analysis
Microsoft Fabric Warehouse mirrors dimensional model using Microsoft.Build.Sql, providing optimized storage for DirectLake semantic model with near-real-time refresh and minimal latency.
Data Quality Framework
Implemented comprehensive data quality framework addressing the universal challenge of inconsistent entity naming across disparate data sources:
Deterministic Surrogate Keys: Used xxhash64 algorithm for stable, reproducible surrogate keys ensuring dimensional integrity across pipeline reruns without external key management systems.
Alias Resolution with Confidence Scoring: Built mapping tables with 100+ country aliases and material name variations (addressing “USA” vs. “United States”, “Türkiye” vs “Turkey”), each with confidence scores (0-1 scale). High confidence ≥0.90 for direct matches, medium ≥0.70 for alternate spellings, low ≥0.50 for fuzzy matches, unmapped <0.50 flagged for review.
Audit Trail & Quality Categorization:
- Quality tier assignment based on mapping confidence
- Placeholder dimensions preserve unmapped values (no data loss)
- Audit tables (
gold_unmapped_procurement_audit,gold_unmapped_supply_audit) track unresolved items - Coverage matrix documents which entities appear in which data sources
This framework enables transparent data quality monitoring and ensures business users understand confidence levels when interpreting analytics—critical for production analytics systems.
Pipeline Orchestration & Semantic Model
Orchestration Pipeline coordinates end-to-end data flow:
- Bronze ingestion (parallel loading of 4 data sources)
- Silver cleaning (PySpark transformations with error handling)
- Gold modeling (dimensional table generation with quality framework)
- Warehouse sync (Gold layer → SQL warehouse for DirectLake)
DirectLake Semantic Model: Star schema (5 dimensions, 3 facts, 8 active relationships) with role-playing dimension for gold_dim_country. DirectLake mode queries Delta tables directly without import, enabling near-real-time refresh with minimal latency.
Testing & Code Quality
Implemented unit testing framework (rare for BI projects) demonstrating software engineering rigor:
- Test Suite:
/tests/test_key_generation.pyvalidates surrogate key consistency and uniqueness;/tests/test_data_quality.pyperforms schema validation, null checks, duplicate detection - Modular Code: Reusable transformation functions in
/src/transformations/(key_generation.py, data_quality.py) enabling testability and maintainability - Configuration: pytest.ini with test markers (unit, integration, slow, smoke) and requirements-test.txt for dependencies
Tech Stack: Python 3.12+, Java 11+, PySpark 3.4+, pytest
Implementation Status
Completed ✅
Core Data Platform:
- Medallion architecture (bronze → silver → gold) with Delta tables
- Data quality framework with confidence scoring and audit trails
- Dimensional model (3 facts, 5 dimensions) with DirectLake semantic model
- Unit testing infrastructure with pytest
Data Engineering:
- Orchestration pipeline with parallel ingestion (4 data sources)
- Deterministic surrogate key generation (xxhash64)
- Alias resolution with 100+ country/material mappings
- Role-playing dimension for supplier geography
In Development 🚧
Power BI Dashboard - Coming Soon
Interactive ESG-integrated procurement intelligence dashboard providing:
- Supplier Sustainability Assessment: Environmental performance evaluation using Yale EPI metrics across climate, biodiversity, and environmental health categories
- Governance Risk Analysis: Political and regulatory risk assessment using World Bank indicators (rule of law, corruption, political stability)
- Supply Concentration Risk: Geographic concentration analysis for 80+ critical materials identifying single-source dependencies
- Procurement Pattern Analysis: Spending correlation with ESG metrics revealing sustainable sourcing opportunities
- Data Quality Transparency: Mapping confidence scores and unmapped value visibility
Platform Enhancements:
- Incremental load implementation (parameters configured, logic in development)
- Row-level security for multi-tenant access control
- DAX measures documentation
- Automated external data refresh (EPI/WGI updates)
Skills Demonstrated
-
✅ Modern Data Platform Architecture — End-to-end Microsoft Fabric implementation (Lakehouse, Warehouse, Data Factory, PySpark, DirectLake) demonstrating cloud-native analytics stack expertise across Fortune 500-adopted integrated platform
-
✅ Production-Grade Data Engineering — Medallion architecture (bronze-silver-gold) with deterministic surrogate keys (xxhash64), comprehensive data quality framework with confidence scoring, and reusable transformation functions. Demonstrates understanding of maintainable, scalable systems beyond proof-of-concept prototypes
-
✅ Analytics Engineering — Star schema dimensional modeling with role-playing dimensions, DirectLake optimization for near-real-time analytics, and appropriate fact/dimension granularity for enterprise BI architecture
-
✅ Data Quality Governance — Alias resolution with confidence scoring and audit trail tables addressing the universal challenge of inconsistent entity naming across disparate data sources. Framework directly transferable to any multi-source integration project
-
✅ Software Engineering Rigor — Unit testing framework with pytest (rare for BI projects), modular code in
/src/transformations/, and professional development practices demonstrating commitment to code quality and maintainability -
✅ Domain Expertise — ESG analytics integration (Yale Environmental Performance Index, World Bank Governance Indicators, EU supply chain data) translated into procurement decision support. Demonstrates ability to bridge technical implementation with business context for cross-functional collaboration
Relevant For: Data Engineering, Analytics Engineering, and BI Platform roles at manufacturing companies (automotive/EV, aerospace, electronics) and ESG-focused organizations requiring supply chain due diligence capabilities. Directly applicable to organizations like Volvo, Northvolt, Saab, and SSAB navigating ESG compliance and supply chain resilience.
Project Resources
- GitHub Repository: EU-Critical-Raw-Materials-CRM-Reports-Datasets - Cleaned EU supply share data
- Yale Environmental Performance Index - Environmental performance data source
- World Bank Worldwide Governance Indicators - Governance data source
- Technologies: Microsoft Fabric, PySpark 3.4+, Power BI DirectLake, Azure SQL Database, pytest