Export to Datahub or Amundsen
Overview
Metadata stored in Tokern Catalog especially PII and column-level lineage can be exported to Datahub or Amundsen.
Datahub
dbcat provides a Source plugin. The source plugin has to be configured in an ingestion recipe.
CatalogSource accepts the following configuration:
path
: Path to SQLite databaseuser
: user name of role in Postgres Catalogpassword
: password of role in Postgres Cataloghost
: host name of Postgres Catalogdb
: database name of role in Postgres Catalogport
: Port number of Postgres Catalogsecret
: Secret Key to encrypt passwords and tokens in the Catalogsource_names
: List of sources to exportinclude_schema_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_schema_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsinclude_table_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_table_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsinclude_source_name
: True/False Specify if table names should include source or not in the formatsource.schema.table
. Useful when there are multiple databasesenv
: Environment variable expected by Databub. Default is PROD
Installation
# Install required libraries in a virtualenv
pip install dbcat[datahub]
# Create an ingestion recipe (see below)
# Run recipe
datahub ingest -c contrib/datahub/export.yml
Example Recipes
Basic Recipe
The following configuration sets up Catalog Source with default configuration and the sink is to console:
source:
type: dbcat.datahub.CatalogSource
sink:
type: "console"
Postgres Catalog, specific source and include schema
source:
type: dbcat.datahub.CatalogSource
config:
user: tokern
password: passw0rd
host: postgres
database: tdb
secret: my_secret_password
source_names:
- redshift_prod
- bq_analysis
include_schema:
- events
sink:
type: "console"
To configure sinks, refer to Datahub metadata ingestion documentation
Amundsen
dbcat provides a CatalogExtractor
to extract metadata information. The Extractor can be used in an Amundsen
metadata ingestion pipeline.
CatalogExtractor accepts the following configuration:
catalog_config
: accepts a dictionary with connection parameters as described catalog configurationsource_names
: List of sources to exportinclude_schema_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_schema_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsinclude_table_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_listsexclude_table_regex
: List of regular expressions that specify which schemata to include. Refer include_exclude_lists
Check out an example loader in Github project