Export to Datahub or Amundsen
Overview
Metadata stored in Tokern Catalog especially PII and column-level lineage can be exported to Datahub or Amundsen.
Datahub
dbcat provides a Source plugin. The source plugin has to be configured in an ingestion recipe.
CatalogSource accepts the following configuration:
path: Path to SQLite databaseuser: user name of role in Postgres Catalogpassword: password of role in Postgres Cataloghost: host name of Postgres Catalogdb: database name of role in Postgres Catalogport: Port number of Postgres Catalogsecret: Secret Key to encrypt passwords and tokens in the Catalogsource_names: List of sources to exportinclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsinclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsinclude_source_name: True/False Specify if table names should include source or not in the formatsource.schema.table. Useful when there are multiple databasesenv: Environment variable expected by Databub. Default is PROD
Installation
# Install required libraries in a virtualenv
pip install dbcat[datahub]
# Create an ingestion recipe (see below)
# Run recipe
datahub ingest -c contrib/datahub/export.yml
Example Recipes
Basic Recipe
The following configuration sets up Catalog Source with default configuration and the sink is to console:
source:
type: dbcat.datahub.CatalogSource
sink:
type: "console"
Postgres Catalog, specific source and include schema
source:
type: dbcat.datahub.CatalogSource
config:
user: tokern
password: passw0rd
host: postgres
database: tdb
secret: my_secret_password
source_names:
- redshift_prod
- bq_analysis
include_schema:
- events
sink:
type: "console"
To configure sinks, refer to Datahub metadata ingestion documentation
Amundsen
dbcat provides a CatalogExtractor to extract metadata information. The Extractor can be used in an Amundsen
metadata ingestion pipeline.
CatalogExtractor accepts the following configuration:
catalog_config: accepts a dictionary with connection parameters as described catalog configurationsource_names: List of sources to exportinclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_schema_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsinclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_table_regex: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
Check out an example loader in Github project