Exports Amazon Neptune property graph data to CSV or JSON, or RDF graph data to Turtle.
- Neptune-Export service
- Best practices
- Exporting to the Bulk Loader CSV Format
- Exporting the Results of User-Supplied Queries
- Exporting an RDF Graph
- Exporting to an Amazon Kinesis Data Stream
- Building neptune-export
- Security
- Deploying neptune-export as an AWS Lambda Function
You can now deploy neptune-export as a service inside your Neptune VPC. Use these CloudFormation templates to install the Neptune-Export service.
neptune-export is also runnable as an adhoc command. See prerequisites.
neptune-export cannot guarantee the consistency of exported data if you export from a Neptune cluster whose data is changing while the export is taking place. Therefore, we recommend exporting from a clone of your cluster. This ensures the export takes place against a static version of the data at the point in time the database was cloned. Further, exporting from a clone ensures the export doesn’t impact the query performance of the original cluster.
neptune-export makes it easy to export from a clone. Simply supply a --clone-cluster
option with the command. You can also use the --clone-cluster-replica-count
option to specify the number of read replicas to be added to the cloned cluster, and the --clone-cluster-instance-type
parameter to tell neptune-export which instance type – e.g. db.r5.2xlarge
– to use for each instance in the cloned cluster (by default, neptune-export will use the same instance type as the primary in the original cluster.)
If you clone your cluster using the --clone-cluster
option, neptune-export will ignore any --concurrency
option supplied in the params, and will instead work out a concurrency setting based on the number of instances in the cloned cluster and their instance types.
If you use the cluster cloning features of neptune-export, you must ensure the AWS Identity and Access Management identity with which the process runs can perform the following actions:
- rds:AddTagsToResource
- rds:DescribeDBClusters
- rds:DescribeDBInstances
- rds:ListTagsForResource
- rds:DescribeDBClusterParameters
- rds:DescribeDBParameters
- rds:ModifyDBParameterGroup
- rds:ModifyDBClusterParameterGroup
- rds:RestoreDBClusterToPointInTime
- rds:DeleteDBInstance
- rds:DeleteDBClusterParameterGroup
- rds:DeleteDBParameterGroup
- rds:DeleteDBCluster
- rds:CreateDBInstance
- rds:CreateDBClusterParameterGroup
- rds:CreateDBParameterGroup
Use the export-pg-from-config
command in preference to export-pg
when exporting property graphs from Neptune. The export-pg
command makes two passes over your data: the first to generate metadata, the second to create the data files. This first pass takes place on a single thread, and for very large datasets can take many hours – often much longer than the export itself.
The preferred approach is to generate the metadata once using create-pg-config
, store the config file in S3, and then refer to it from export-pg-from-config
using the --config-file
option.
When performing a parallel export (--concurrency
is larger than one), neptune-export must first query your database to determine the number of nodes and edges to be exported. These numbers are then used to calculate ranges for each query in a set of parallel queries. Counting the nodes and edges in a large dataset can take many minutes.
neptune-export now includes --approx-node-count
and --approx-edge-count
options that allow you to supply estimates for the number of nodes and edges you expect to export. By specifying approximate counts you can reduce the export time, because neptune-export will no longer have to query the database to count the nodes and edges.
The numbers you supply need only be approximate – it doesn’t matter if you’re within ten percent of the real counts. One way of calculating these numbers is to use the counts from a previous export, adjusted based on the approximate number of additions and deletions that have taken place in the interim.
When exporting to the CSV format used by the Amazon Neptune bulk loader, neptune-export generates CSV files based on a schema derived from scanning your graph. This schema is persisted in a JSON file. There are three ways in which you can use the tool to generate bulk load files:
export-pg
– This command makes two passes over your data: the first to generate the schema, the second to create the data files. By scanning all nodes and edges in the first pass, the tool captures the superset of properties for each label, identifies the datatype for each property, and identifies any properties for which at least one vertex or edge has multiple values. If exporting to CSV, these latter properties are exported to CSV as array types. If exporting to JSON, these property values are exported as array nodes.create-pg-config
– This command makes a single pass over your data to generate the schema config file.export-pg-from-config
– This command makes a single pass over your data to create the CSV or JSON files. It uses a preexisting schema config file.
export-pg
and create-pg-config
both generate schema JSON files describing the properties associated with each node and edge label. By default, these commands will scan the entire database. For large datasets, this can take a long time.
Both commands also allow you to sample a range of nodes and edges in order to create this schema. If you are confident that sampling your data will yield the same schema as scanning the entire dataset, specify the --sample
option with these commands. If, however, you have reason to believe the same property on different nodes or edges could yield different datatypes, or different cardinalities, or that nodes or edges with the same labels could contain different sets of properties, you should consider retaining the default behaviour of a full scan.
Once you have generated a schema file, either with export-pg
or create-pg-config
, you can reuse it for subsequent exports in export-pg-from-config
. You can also modify the file to restrict the labels and properties that will be exported.
All three commands allow you to supply vertex and edge label filters.
- If you supply label filters to the
export-pg
command, the schema file and the exported data files will contain data only for the labels specified in the filters. - If you supply label filters to the
create-pg-config
command, the schema file will contain data only for the labels specified in the filters. - If you supply label filters to the
export-pg-from-config
command, the exported data files will contain data for the intersection of labels in the config file and the labels specified in the command filters.
For some offline use cases you may want to export only the structural data in the graph: that is, just the labels and IDs of vertices and edges. export-pg
allows you to specify a --tokens-only
option with the value nodes
, edges
or both
. A token-only export does not generate a schema, nor does it export any property data: for vertices it simply exports ~id
and ~label
; for edges, it exports ~id
, ~from
, ~to
and ~label
. You can still use label filters to determine exactly which vertices and edges will be exported.
The export-pg
and export-pg-from-config
commands support parallel export. You can supply a concurrency level, which determines the number of client threads used to perform the parallel export, and, optionally, a range or batch size, which determines how many nodes or edges will be queried by each thread at a time. If you specify a concurrency level, but don't supply a range, the tool will calculate a range such that each thread queries (1/concurrency level) * number of nodes/edges nodes or edges.
If using parallel export, we recommend setting the concurrency level to the number of vCPUs on your Neptune instance.
You can load balance requests across multiple instances in your cluster (or even multiple clusters) by supplying multiple --endpoint
options.
neptune-export uses long-running queries to generate the schema and the data files. You may need to increase the neptune_query_timeout
DB parameter in order to run the tool against large datasets.
For large datasets, we recommend running this tool against a standalone database instance that has been restored from a snapshot of your database.
The latest version of neptune-export uses the GraphBinary serialization format introduced in Gremlin 3.4.x. Previous versions of neptune-export used Gryo. To revert to using Gryo, supply --serializer GRYO_V3D0
.
neptune-export attempts to use the JVM system default text encoding for all output files. This can be configured manually if needed by setting the file.encoding
system property.
java -Dfile.encoding=UTF8 -jar neptune-export.jar ...
neptune-export's export-pg-from-queries
command allows you to supply groups of Gremlin queries and export the results to CSV or JSON.
Every user-supplied query should return a resultset whose every result comprises a Map. Typically, these are queries that return a valueMap()
or a projection created using project().by().by()...
.
Queries are grouped into named groups. All the queries in a named group should return the same columns. Named groups allow you to 'shard' large queries and execute them in parallel (using the --concurrency
option). Note that query sharding is not done automatically, so if you just supply one query, you will get no benefit from increasing the concurrency level past one. The resulting CSV or JSON files will be written to a directory named after the group.
If there is a possibility that individual rows in a query's resultset will contain different keys, use the --two-pass-analysis
flag to force neptune-export to determine the superset of keys or column headers for the query.
You can supply multiple named groups using multiple --queries
options. Each group comprises a name, an equals sign, and then a semi-colon-delimited list of Gremlin queries. Surround the list of queries in double quotes. For example:
-q person="g.V().hasLabel('Person').range(0,100000).valueMap();g.V().hasLabel('Person').range(100000,-1).valueMap()"
Sharding queries for concurrent execution can create a large number of queries, especially with a high concurrency level. In order to avoid inputting all of these queries as command-line arguments, you can also supply them in a JSON file with the --queriesFile
option. The JSON file should be formatted like this:
[
{
"name": "NamedQueryGroup",
"queries": ["list", "of", "sharded", "queries", "in", "group"]
},
...
]
This file can be given as a local path, or over https or s3.
The --split-queries
option may be used to automatically shard queries. When invoked, the tool will calculate ranges in the same manner as the export-pg
command's parallel export, and then split each query into --concurrency
number of shards.
The sharded queries use injected range()
steps at the beginning of the query to divide the ranges. For example, g.V().hasLabel("person")
may be sharded as:
g.V().range(0, 250).hasLabel("person")
g.V().range(250, 500).hasLabel("person")
g.V().range(500, 750).hasLabel("person")
g.V().range(750, -1).hasLabel("person")
This range()
-based sharding may not be uniformly balanced, and may lead produce different results with certain queries. Any gremlin steps which operate on the entire input stream at once (such as order()
, dedup()
, and group()
) should be used with caution as this sharding inevitably alters their inputs.
For any queries which are incompatible with range()
-based sharding, or in situations where more precise balancing is required, it is recommended to avoid using --split-queries
and instead provide a --queriesFile
with pre-sharded queries.
If using parallel export, we recommend setting the concurrency level to the number of vCPUs on your Neptune instance. When neptune-export executes named groups of queries in parallel, it simply flattens all the queries into a queue, and spins up a pool of worker threads according to the concurrency level you have specified using --concurrency
. Worker threads continue to take queries from the queue until the queue is exhausted.
Queries whose results contain very large rows can sometimes trigger a CorruptedFrameException
. If this happens, you can either adjust the batch size (--batch-size
) to reduce the number of results returned to the client in a batch (the default is 64), or increase the frame size (--max-content-length
, default value 65536).
At present neptune-export supports exporting an RDF dataset to Turtle, NQuads, and NTriples with a single-threaded long-running query.
The default scope for export-rdf
is to export the entire dataset (union of all named graphs). Use the --named-graph <NamedGraphURI>
argument to limit the scope to a single named graph. This can only be used with the default graph
scope (--rdf-export-scope graph
).
To export the results from a SPARQL query, use the --rdf-export-scope query
and --sparql <SPARQL Query>
arguments.
By default, neptune-export connects to your database using SSL. If your target does not support SSL connections, use the --disable-ssl
flag.
(SSL used to be an opt-in feature for neptune-export, with a --use-ssl
option for turning SSL on. This behaviour has now changed: SSL is on by default, but can be turned off using --disable-ssl
. The --use-ssl
option now no longer has any effect.)
If you are using a load balancer or a proxy server (such as HAProxy), you must use SSL termination and have your own SSL certificate on the proxy server.
neptune-export supports exporting from databases that have IAM database authentication enabled. Supply the --use-iam-auth
option with each command. Remember to set the SERVICE_REGION environment variable – e.g. export SERVICE_REGION=us-east-1
.
neptune-export also supports connecting through a load balancer to a Neptune database with IAM DB authetication enabled. However, this feature is only currently supported for property graphs, with support for RDF graphs coming soon.
If you are connecting through a load balancer, and have IAM DB authentication enabled, you must also supply either an --nlb-endpoint
option (if using a network load balancer) or an --alb-endpoint
option (if using an application load balancer), and an --lb-port
.
For details on using a load balancer with a database with IAM DB authentication enabled, see Connecting to Amazon Neptune from Clients Outside the Neptune VPC.
When exporting to an Amazon Kinesis Data Stream, records are aggregated by default – that is, multiple exported records are packed into a single Kinesis Data Streams record. In your stream client your will need to deaggregate the records. If you are using a Python client, you can use the record deaggregation module from the Kinesis Aggregation/Deaggregation Modules for Python.
You can turn off stream record aggregation when you export to a Kinesis Data Stream using the --disable-stream-aggregation
option.
To build the jar, run:
mvn clean install
The neptune-export jar can be deployed as an AWS Lambda function. To access Neptune, you will either have to configure the function to access resources inside your VPC, or expose the Neptune endpoints via a load balancer.
Be mindful of the AWS Lambda limits, particularly with regard to function timeouts (max 15 minutes) and /tmp directory storage (512 MB). Large exports can easily exceed these limits.
When deployed as a Lambda function, neptune-export will automatically copy the export files to an S3 bucket of your choosing. Optionally, it can also write a completion file to a separate S3 location (useful for triggering additional Lambda functions). You must configure your function with an IAM role that has write access to these S3 locations.
The Lambda function expects a number of parameters, which you can supply either as environment variables or via a JSON input parameter. Fields in the JSON input parameter override any environment variables you have set up.
Environment Variable | JSON Field | Description | |
---|---|---|---|
COMMAND |
command |
neptune-export command and command-line options: e.g. export-pg -e <neptune_endpoint> |
Mandatory |
OUTPUT_S3_PATH |
outputS3Path |
S3 location to which exported files will be written | Mandatory |
CONFIG_FILE_S3_PATH |
configFileS3Path |
S3 location of a JSON config file to be used when exporting a property graph from a config file | Optional |
COMPLETION_FILE_S3_PATH |
completionFileS3Path |
S3 location to which a completion file should be written once all export files have been copied to S3 | Optional |
SSE_KMS_KEY_ID |
sseKmsKeyId |
ID of the customer managed AWS-KMS symmetric encryption key to used for server-side encryption when exporting to S3 | Optional |
A CDK Wrapper around Neptune Export and Neptune ML CloudFormation stacks to run fake news detection jobs.