Skip to content

Latest commit

 

History

History

unity_catalog

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Unity Catalog Connector

This connector extracts technical metadata from Unity Catalog using the Unity Catalog API.

Setup

Create an access token in the Databrick workspace > User setting > Developer > Access tokens.

To extract data lineage from Unity Catalog, you'll need to enable system.access schema and grant required permissions to the user. Please also read and understand the feature's limitations.

Make sure to grant the user BROWSE (or SELECT) privilege to all tables in order to retrieve the complete lineage graph. See this section for more details.

Config File

Create a YAML config file based on the following template.

Required Configurations

hostname: <cluster_or_warehouse_hostname>
http_path: <http_path>
token: <access_token>

See this page for details on how to set the values for hostname and http_path.

Optional Configurations

Output Destination

See Output Config for more information.

Filtering

See Filter Configurations for more information on the optional filter config.

Source URL

By default, each table is associated with a Unity Catalog URL derived from the hostname config.

You can override this by specifying your own URL built from the catalog, schema, and table names:

source_url: https://example.com/view/{catalog}/{schema}/{table}

Query Logs

By default, the Unity Catalog connector will fetch a full day's query logs from yesterday, to be analyzed for additional metadata, such as dataset usage and lineage information. To backfill log data, one can set lookback_days to the desired value. To turn off query log fetching, set lookback_days to 0.

query_log:
  # (Optional) Number of days of query logs to fetch. Default to 1. If 0, the no query logs will be fetched.
  lookback_days: <days>
    
  # (Optional) A list of users whose queries will be excluded from the log fetching.
  excluded_usernames:
    - <user_name1>
    - <user_name2>

  # (Optional) Limit the number of results returned in one page of query log history. The default is 100.
  max_results: <count>
Process Query Config

See Process Query for more information on the optional process_query_config config.

Warehouse ID

Note: we encourage using cluster, this connector will deprecate the SQL warehouse support.

To run the queries using a specific warehouse, simply add its ID in the configuration file:

warehouse_id: <warehouse_id>

If no warehouse id nor cluster path is provided, the connector automatically uses the first discovered warehouse.

Testing

Follow the Installation instructions to install metaphor-connectors in your environment (or virtualenv). Make sure to include either all or unity_catalog extra.

Run the following command to test the connector locally:

metaphor unity_catalog <config_file>

Manually verify the output after the command finishes.