This connector extracts technical metadata from a BigQuery project using Python Client for Google BigQuery.
We recommend creating a dedicated GCP service account with limited permissions for the connector:
- Go to Roles in Google Cloud Console. Make sure the appropriate project is selected from the project dropdown.
- Click
Create Role
and use the following settings to create a new custom role for running metadata jobs and fetching logs:- Enter a role title, e.g.
Metaphor Job
. - Enter a description, e.g.
Metadata collection job role for metaphor app
. - Choose role stage
General Availability
. - Click
ADD PERMISSIONS
and add:bigquery.config.get
,bigquery.jobs.create
,bigquery.jobs.get
. - Click
CREATE
to create the custom role.
- Enter a role title, e.g.
- Go to Service Accounts in Google Cloud Console. Make sure the appropriate project is selected from the project dropdown.
- Click
Create Service Account
and use the following settings to create a new account:- Enter a service account name, e.g.,
metaphor-bigquery
. - Enter a description, e.g.
Metadata collection for Metaphor app
- Click
CREATE AND CONTINUE
. - Select
BigQuery Metadata Viewer
,Private Logs Viewer
andMetaphor Job
as the roles and clickCONTINUE
. - Click
DONE
to complete the process as there's no need to grant user access to the service account.
- Enter a service account name, e.g.,
Alternatively, if one prefer to avoid predefined roles and use custom role only, here is the complete list of permissions needed:
- bigquery.config.get
- bigquery.datasets.get
- bigquery.datasets.getIamPolicy
- bigquery.jobs.create
- bigquery.jobs.get
- bigquery.models.getMetadata
- bigquery.models.list
- bigquery.routines.get
- bigquery.routines.list
- bigquery.tables.get
- bigquery.tables.getIamPolicy
- bigquery.tables.list
- logging.logEntries.list
- logging.logs.list
- logging.privateLogEntries.list
- resourcemanager.projects.get
Once the service account is created, you need to create a service account key for authentication:
- Click the newly created account from Service Accounts, then choose
KEYS
tab. - Click
ADD KEY
>Create new key
. - Select
JSON
key type and clickCREATE
. - Save the generated key file to a secure location.
Your key file may look like:
{
"type": "service_account",
"project_id": "<project_id>",
"private_key_id": "<key_id>",
"private_key": "-----BEGIN PRIVATE KEY-----\n<key>\n-----END PRIVATE KEY-----",
"client_email": "<email>@<project_id>.iam.gserviceaccount.com",
"client_id": "<client_id>",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/<client_cert_url>"
}
Create a YAML config file based on the following template.
project_ids:
- <bigquery_project_id>
To connect to BigQuery, either the keyfile path or credentials from the JSON keyfile must be set in the config as following:
key_path: <path_to_JSON_key_file>
or
credentials:
project_id: <project_id>
private_key_id: <private_key_id>
private_key: <private_key_value>
client_email: <client_email>
client_id: <client_id>
Note:
private_key
is a multi-line string and must be specified using YAML's literal style, i.e. using a pipe (|
). You can use this command to extract and format the value from the JSON key:cat <JSON_key_file> | jq -r '.private_key' | sed 's/^/ /'Copy & paste the value into the config file as such:
credentials: private_key: | <paste here>
See Output Config for more information.
See Filter Config for more information on the optional filter
config.
NOTE: While the filter config uses
database -> schema -> table/view
hierarchy, BigQuery usesproject -> database -> table/view
hierarchy. Please use project id at the top level in the include/exclude of the filter config.
See Tag Matcher Config for more information on the optional tag_matcher
config.
By default, all BigQuery API calls are made using the project ID specified in the credentials. However, sometimes you may wish to run BigQuery jobs using a different project ID, e.g. when querying the DDLs for tables & views. You can do so by specifying the job_project_id
config as follows,
job_project_id: <project_id>
The max number of concurrent requests to the google cloud API can be configured as follows,
max_concurrency: <max_number_of_queries> # Default to 5
By default, the BigQuery connector will fetch a full day's query logs (AuditMetadata) from yesterday, to be analyzed for additional metadata, such as dataset usage and lineage information. To backfill log data, one can set lookback_days
to the desired value. To turn off query log fetching, set lookback_days
to 0.
query_log:
# (Optional) Number of days of query logs to fetch. Default to 1. If 0, the no query logs will be fetched.
lookback_days: <days>
# (Optional) A list of users whose queries will be excluded from the log fetching.
excluded_usernames:
- <user_name1>
- <user_name2>
# (Optional) Exclude queries issued by service accounts. Default to false.
exclude_service_accounts: <boolean>
# (Optional) The number of query logs to fetch from BigQuery in one batch. Max 1000, default to 1000.
fetch_size: <number_of_logs>
# (Optional) Fetch the full query SQL from job API if it's truncated in the audit metadata log, default True.
fetch_job_query_if_truncated: <boolean>
# Maximum allowed requests per minute to the log entries API, default to 60
max_requests_per_minute: <requests_per_minute>
See Process Query for more information on the optional process_query_config
config.
Make sure to use BigQuery project ID when setting the database
field in the filter configuration. For example:
project_id:
- <bigquery_project_id>
filter:
includes:
<bigquery_project_id>:
schema1:
schema2:
excludes:
<bigquery_project_id>:
schema1:
- table1
Follow the Installation instructions to install metaphor-connectors
in your environment (or virtualenv). Make sure to include either all
or bigquery
extra.
Run the following command to test the connector locally:
metaphor bigquery <config_file>
Manually verify the output after the run finishes.