Data Products
A data product is a logical data unit on the data platform that is actively managed by a team (owner).
A data product exposes its data through output ports, ensuring downstream users or systems can easily consume it. An output port is specified by a data contract, which defines the schema, semantics, and quality for using the data, along with the terms of use.
Data Product Types
Entropy Data supports the following data product types:
- Source-aligned Data Product: Data Products that are closely aligned to the entities or events generated in corresponding operational systems without significant transformations.
- Aggregated Data Product: Data Products that are built on top of one or more source-aligned data products and provide aggregated or transformed data for multiple use cases.
- Consumer-aligned Data Product: Data Products that are designed to meet the specific needs of a particular consumer or group of consumers, often involving significant transformations or aggregations.
In addition, the following systems can be added that also provide or consume data, while they are usually not considered as data products:
- Data Consumer: A system that consumes data from data products, but does not provide data itself. Examples: reports, dashboards, notebooks, BI tools
- Application: A system or software that implements business processes and generates or consumes data. Applications typically have an API. Examples: operational systems, microservices, databases, CRM, MDM, source systems, external data providers
With these, you can model the data landscape of your organization and understand how data flows between different systems, both in operational and in the analytical realm.
Data Product Status
The status in data products is informative. They do not affect UI or workflows.
Semantically, the statuses are:
- proposed: This data product is currently planned and in discussions
- in development: This data product is not yet available on production
- active: This is an active data product that can be used
- deprecated: Still active, but do not use any more. Existing consumers will need to migrate to another data product.
- retired: This data product is no longer active and should not have any consumer
Note: We currently discuss if the status should be used to control the visibility of data products in the Marketplace, or a specific flag should be used.
Output Ports
A data product can have zero, one, or multiple output ports. An output port is the technical endpoint to a specific dataset.
An output port is usually the combination of:
- data model (e.g., one or multiple tables, PII, non-PII)
- version (v1, v2)
- server technology (Databricks, Snowflake, S3, Kafka, etc.)
- environment (prod, dev, test)
The output port has a server, to which a data consumer can connect to access the data (e.g., the hostname, database, and schema name in Snowflake).
An output port can be specified by a data contract, which defines the schema, semantics, and quality for using the data, along with the terms of use.
Data consumers (users, teams, and other data products) request access to a specific output port.
Input Ports
Input ports represent upstream data products or applications that provide the source data for the data product.
To add an input port, request access add or request access for the consuming data product. The input port will be created automatically.
Assets (optional)
Internal components, such as data pipelines, raw and intermediary tables, ingestion methods, test code, and infrastructure details that are not relevant for data consumers are usually not part of the data product in Entropy Data. However, these assets can be assigned to data products or output ports for documentation, navigation, search, and lineage purposes.
Costs (optional)
Infrastructure costs and other expenses related to the data product can be assigned to the data product to track the costs for building and running the data product. The cost information can be used for data product controlling to evaluate the business value of the data product.
Data Product Specification and Open Data Product Standard
A data product can be edited in the YAML editor or through the API as JSON for automated provisioning. It can follow the Data Product Specification or the Open Data Product Standard.
Migrating from Data Product Specification to Open Data Product Standard
If you are using the Data Product Specification, you can easily migrate to the Open Data Product Standard by using the specificationType
request parameter in your GET API calls. PUT will automatically detect the specificationType and migrate the data product accordingly.
Furthermore, you can use the YAML editor to migrate from the Data Product Specification to the Open Data Product Standard. Just chose "Migrate to ODPS" in the YAML editor of an existing data product.
For new Data Products, you can use the YAML editor to create a new data product with the Open Data Product Standard.
Configuring the Open Data Product Standard
If you are using the self-hosted version of Entropy Data, you can configure which specification type(s) you want to use in Entropy Data. The environment variable is called APPLICATION_DATAPRODUCT_SPECIFICATIONTYPES
. Find more information about the environment variables in Environment Variables.
You can find more details about mapping between the formats on the Open Data Product Standard Details page.