Export to Datahub or Amundsen

Overview

Metadata stored in Tokern Catalog especially PII and column-level lineage can be exported to Datahub or Amundsen.

Datahub

dbcat provides a Source plugin. The source plugin has to be configured in an ingestion recipe.

CatalogSource accepts the following configuration:

path: Path to SQLite database
user: user name of role in Postgres Catalog
password: password of role in Postgres Catalog
host: host name of Postgres Catalog
db: database name of role in Postgres Catalog
port: Port number of Postgres Catalog
secret: Secret Key to encrypt passwords and tokens in the Catalog
source_names: List of sources to export
include_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
exclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
include_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
exclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
include_source_name: True/False Specify if table names should include source or not in the format source.schema.table. Useful when there are multiple databases
env: Environment variable expected by Databub. Default is PROD

Installation

# Install required libraries in a virtualenv
pip install dbcat[datahub]

# Create an ingestion recipe (see below)

# Run recipe
datahub ingest -c contrib/datahub/export.yml

Example Recipes

Basic Recipe

The following configuration sets up Catalog Source with default configuration and the sink is to console:

source:
  type: dbcat.datahub.CatalogSource
sink:
  type: "console"

Postgres Catalog, specific source and include schema

source:
  type: dbcat.datahub.CatalogSource
  config:
    user: tokern
    password: passw0rd
    host: postgres
    database: tdb
    secret: my_secret_password
    source_names:
       - redshift_prod
       - bq_analysis
    include_schema:
       - events
sink:
  type: "console"

To configure sinks, refer to Datahub metadata ingestion documentation

Amundsen

dbcat provides a CatalogExtractor to extract metadata information. The Extractor can be used in an Amundsen metadata ingestion pipeline.

CatalogExtractor accepts the following configuration:

catalog_config: accepts a dictionary with connection parameters as described catalog configuration
source_names: List of sources to export
include_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
exclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
include_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
exclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists

Check out an example loader in Github project

Overview​

Datahub​

Installation​

Example Recipes​

Basic Recipe​

Postgres Catalog, specific source and include schema​

Amundsen​