Ingest Assets

Build your own integration to sync assets from your data platform to Entropy Data when prebuilt connectors don't fit your needs.

Overview

Asset ingestion synchronizes metadata about your physical data sources (tables, views, schemas, topics, etc.) from your data platform to Entropy Data. This enables you to:

  • Automatically generate data contracts from existing table structures
  • Import assets as data products with minimal manual effort
  • Link technical resources to business concepts (data products and output ports)
  • Track data lineage across your data platform

You can implement asset ingestion using:

  • SDK: Java library for building integrations (recommended for long-running applications)
  • REST API: Direct API access for any programming language

Understanding the Asset Data Model

An asset in Entropy Data consists of four main parts:

1. Info Object

The info object contains metadata about the asset:

  • source: Identifies the data source (e.g., snowflake, unity, purview, postgres). This is not about the data catalog, but rather the data source itself. Do not use values like openmetadata or collibra here, as they represent catalogs, not data sources.
  • type: Describes the asset type, prefixed by the source (e.g., snowflake_table, unity_schema, kafka_topic)
  • name: Name of the asset. It must not be the full qualified name, but rather the last segment (e.g., for a table CUSTOMERS, not SALES_DB.PUBLIC.CUSTOMERS)
  • qualifiedName (deprecated): Unique identifier, often used to extract connection information. Deprecated in favor of storing identifiers in the custom map. Still supported for backwards compatibility.
  • metadataSource (optional): Identifies the metadata catalog that provided this asset (e.g., openmetadata, collibra, purview). Use this to distinguish between the data source and the catalog.
  • metadataSourceId (optional): The unique identifier of the asset in the metadata source.
  • metadataSourceUrl (optional): A URL pointing to the asset in the metadata source.

2. Custom Properties

The custom map stores additional metadata and connection details as key-value pairs. Provide server details here so that they will be picked up when converting an asset to a data product. For example, for Snowflake, include account, database, and schema. Use the environment property to indicate which environment the asset belongs to (e.g., dev, staging, prod). You find a detailed description of the needed fields below.

3. Columns

For table-like assets, the columns array defines the schema structure. Use the types from the type system of the data source, e.g., for Snowflake use the SQL types supported by Snowflake. This enables automatic generation of data contracts. The types will become the physicalType in the data contract, and the logicalType will be automatically derived.

Each column supports the following fields:

  • name (required): The column name
  • type (required): The physical data type from the data source (e.g., VARCHAR(255), BIGINT, STRING)
  • description: Column description
  • required: Whether the column is required
  • unique: Whether values must be unique
  • primaryKey: Whether this is a primary key column
  • tags: Array of tags associated with the column
  • links: Links associated with the column
  • custom: Additional custom properties as key-value map

Nested Structures

Columns support nested structures for complex data types:

  • properties: For record/object types, an array of child columns defining the fields of the record
  • items: For array types, a single column definition describing the schema of the array elements
"columns": [
  {
    "name": "orderId",
    "type": "string",
    "required": true,
    "primaryKey": true
  },
  {
    "name": "shippingAddress",
    "type": "record",
    "description": "The shipping address",
    "properties": [
      { "name": "street", "type": "string" },
      { "name": "city", "type": "string" },
      { "name": "zip", "type": "string" }
    ]
  },
  {
    "name": "items",
    "type": "array",
    "description": "The order line items",
    "items": {
      "name": "items",
      "type": "record",
      "properties": [
        { "name": "productId", "type": "string" },
        { "name": "quantity", "type": "int" }
      ]
    }
  }
]

4. Relationships

Assets can have hierarchical relationships using the relationships array. Common patterns include:

  • Two-tier: Schema → Table
  • Three-tier: Database → Schema → Table

Indicate that relationship at the child element pointing to its parent using relationshipType: "parent".

Example: Databricks Catalog (Top-Level Parent)

{
  "id": "prod-catalog",
  "info": {
    "source": "databricks",
    "type": "databricks_catalog",
    "name": "production",
    "qualifiedName": "production"
  },
  "custom": {
    "catalog": "production",
    "environment": "prod"
  }
}

Example: Databricks Schema (Child of Catalog)

{
  "id": "prod-sales-schema",
  "info": {
    "source": "databricks",
    "type": "databricks_schema",
    "name": "sales",
    "qualifiedName": "production.sales"
  },
  "custom": {
    "catalog": "production",
    "schema": "sales",
    "environment": "prod"
  },
  "relationships": [
    {
      "assetId": "prod-catalog",
      "relationshipType": "parent"
    }
  ]
}

Example: Databricks Table (Child of Schema)

{
  "id": "prod-sales-customers",
  "info": {
    "source": "databricks",
    "type": "databricks_table",
    "name": "customers",
    "qualifiedName": "production.sales.customers"
  },
  "custom": {
    "catalog": "production",
    "schema": "sales",
    "environment": "prod"
  },
  "relationships": [
    {
      "assetId": "prod-sales-schema",
      "relationshipType": "parent"
    }
  ]
}

Mapping Your Data Source to Assets

Select the right values for source and type that represent your data platform. Use server types from the Open Data Contract Standard when available.

Platformsourcetype examplesCommon custom fieldsNotes
APIapiapi_endpointlocation
AWS Athenaathenaathena_database, athena_tableschema, catalog, stagingDir, regionName
AWS Glueglueglue_database, glue_tableaccount, database, location, formatAlso used for Glue Catalog
AWS Kinesiskinesiskinesis_streamstream, region, format
AWS Redshiftredshiftredshift_database, redshift_schema, redshift_tablehost, database, schema, region, account
AWS S3s3s3_bucket, s3_folder, s3_objectlocation, format, delimiter, endpointUrl
Azure Storageazureazure_container, azure_bloblocation, format, delimiter
Azure Synapsesynapsesynapse_database, synapse_schema, synapse_tablehost, port, database
ClickHouseclickhouseclickhouse_database, clickhouse_tablehost, port, database
Custom Platformcustomcustom_<type>Define your own properties
Databricksdatabricksdatabricks_catalog, databricks_schema, databricks_table, databricks_viewhost, catalog, schemaModern Databricks with Unity Catalog
Dremiodremiodremio_source, dremio_schema, dremio_tablehost, port, schema
DuckDBduckdbduckdb_schema, duckdb_tabledatabase, schema
Google BigQuerybigquerybigquery_project, bigquery_dataset, bigquery_table, bigquery_viewproject, dataset
Google Cloud SQLcloudsqlcloudsql_database, cloudsql_tablehost, port, database, schema
Google Pub/Subpubsubpubsub_topic, pubsub_subscriptionproject
IBM DB2db2db2_database, db2_schema, db2_tablehost, port, database, schema
Kafkakafkakafka_cluster, kafka_topichost, format
Microsoft Purviewpurviewpurview_database, purview_schema, purview_tableVaries by sourcePrefer underlying data source type (e.g., sqlserver) when possible
MySQLmysqlmysql_database, mysql_schema, mysql_tablehost, port, database
Oracleoracleoracle_database, oracle_schema, oracle_tablehost, port, serviceName
PostgreSQLpostgresqlpostgresql_database, postgresql_schema, postgresql_table, postgresql_viewhost, port, database, schema
Prestoprestopresto_catalog, presto_schema, presto_tablehost, catalog, schema
SFTPsftpsftp_folder, sftp_filelocation, format, delimiter
Snowflakesnowflakesnowflake_database, snowflake_schema, snowflake_table, snowflake_viewaccount, database, schema, warehouse
SQL Serversqlserversqlserver_database, sqlserver_schema, sqlserver_tablehost, port, database, schema
Trinotrinotrino_catalog, trino_schema, trino_tablehost, port, catalog, schema
Verticaverticavertica_database, vertica_schema, vertica_tablehost, port, database, schema

Naming Conventions

For source:

  • Use the server type from ODCS when available
  • Use lowercase, no spaces or special characters
  • Prefer data source name over catalog name (e.g., sqlserver not purview)

For type:

  • Follow the pattern: <source>_<asset_type>
  • Common asset types: database, catalog, schema, table, view, topic, bucket, folder, file
  • Examples: snowflake_table, kafka_topic, s3_bucket
  • Use the same terminology as your data platform (e.g., Databricks uses catalog, Kafka uses topic)

Complete Examples

Example 1: Snowflake Table Asset

{
  "id": "sales-customers",
  "info": {
    "source": "snowflake",
    "type": "snowflake_table",
    "name": "CUSTOMERS",
    "qualifiedName": "SALES_DB.PUBLIC.CUSTOMERS",
    "description": "Customer master data"
  },
  "custom": {
    "account": "mycompany",
    "database": "SALES_DB",
    "schema": "PUBLIC",
    "environment": "prod"
  },
  "columns": [
    {
      "name": "CUSTOMER_ID",
      "type": "NUMBER(38,0)",
      "description": "Unique customer identifier",
      "required": true
    },
    {
      "name": "EMAIL",
      "type": "VARCHAR(255)",
      "description": "Customer email address",
      "required": true
    },
    {
      "name": "FIRST_NAME",
      "type": "VARCHAR(100)",
      "description": "Customer first name",
      "required": false
    },
    {
      "name": "LAST_NAME",
      "type": "VARCHAR(100)",
      "description": "Customer last name",
      "required": false
    },
    {
      "name": "CREATED_AT",
      "type": "TIMESTAMP_NTZ(9)",
      "description": "Account creation timestamp",
      "required": true
    }
  ]
}

Example 2: Databricks Unity Table Asset

{
  "id": "prod-sales-customers",
  "info": {
    "source": "unity",
    "type": "unity_table",
    "name": "customers",
    "qualifiedName": "production.sales.customers",
    "description": "Customer dimension table"
  },
  "custom": {
    "host": "adb-1234567890.5.azuredatabricks.net",
    "path": "/mnt/production/sales/customers",
    "format": "delta",
    "environment": "prod"
  },
  "columns": [
    {
      "name": "customer_id",
      "type": "BIGINT",
      "description": "Unique customer identifier",
      "required": true
    },
    {
      "name": "email",
      "type": "STRING",
      "description": "Customer email address",
      "required": true
    },
    {
      "name": "signup_date",
      "type": "DATE",
      "description": "Date customer signed up",
      "required": true
    }
  ],
  "relationships": [
    {
      "assetId": "prod-sales-schema",
      "relationshipType": "parent"
    }
  ]
}

Example 3: Kafka Topic Asset

{
  "id": "orders-topic",
  "info": {
    "source": "kafka",
    "type": "kafka_topic",
    "name": "orders",
    "qualifiedName": "prod-cluster.orders",
    "description": "Order events stream"
  },
  "custom": {
    "bootstrap_servers": "kafka-1.example.com:9092,kafka-2.example.com:9092",
    "cluster": "prod-cluster",
    "partitions": "12",
    "replication_factor": "3",
    "environment": "prod"
  },
  "relationships": [
    {
      "assetId": "prod-kafka-cluster",
      "relationshipType": "parent"
    }
  ]
}

Example 4: Custom Data Platform

{
  "id": "customer-dataset",
  "info": {
    "source": "custom_platform",
    "type": "custom_dataset",
    "name": "customers",
    "qualifiedName": "custom://prod/sales/customers",
    "description": "Customer dataset"
  },
  "custom": {
    "location": "s3://my-bucket/datasets/customers",
    "format": "parquet",
    "environment": "prod"
  },
  "columns": [
    {
      "name": "id",
      "type": "STRING",
      "description": "Customer ID",
      "required": true
    },
    {
      "name": "name",
      "type": "STRING",
      "description": "Customer name",
      "required": true
    }
  ]
}

Next Steps

Once you've ingested your assets:

For questions or support, refer to the SDK documentation or API reference.