Tap and Target performance benchmarks #906

aaronsteers · 2022-08-11T18:20:58Z

aaronsteers
Aug 11, 2022

I'm opening this discussion to share general performance benchmarks. We can use these as case studies and reference points as we work on SDK-level performance improvements.

aaronsteers · 2022-08-11T18:22:22Z

aaronsteers
Aug 11, 2022
Author

This benchmark data was provided by @BuzzCutNorman using tap-mssql and target-postgres (both non-SDK).

Test: Full Stack Overflow
Data size: 50 GB Stack Over Flow database
Start: 8/10 7:36 PM
End: 8/11 6:35 AM
Duration: 11 hours

Table 	    Start DTTM	 End DTTM	     Columns	Rows	    Duration/Minutes
Badges	    08-10 18:36	 08-10 19:07	 4	        8042005	    31
Comments	08-10 19:07	 08-10 21:55	 6	        24534730	168
LinkTypes	08-10 21:55	 08-10 21:55	 2	        2	        0
PostLinks	08-10 21:55	 08-10 22:01	 5	        1421208	    6
PostTypes	08-10 22:01	 08-10 22:01	 2	        8	        0
Posts	    08-10 22:01	 08-11 00:20	 20	        17142169	139
Users	    08-11 00:20	 08-11 00:34	 14	        2465713	    14
UsersTest	08-11 00:34	 08-11 00:34	 14	        5000	    0
VoteTypes	08-11 00:34	 08-11 00:34	 1	        15	        0
Votes	    08-11 00:34	 08-11 05:35	 6	        52928720	301

1 reply

BuzzCutNorman Aug 26, 2022

Just putting the above information into an easier to read format.

Table	Start DTTM	End DTTM	Columns	Rows	Duration/Minutes
Badges	08-10 18:36	08-10 19:07	4	8042005	31
Comments	08-10 19:07	08-10 21:55	6	24534730	168
LinkTypes	08-10 21:55	08-10 21:55	2	2	0
PostLinks	08-10 21:55	08-10 22:01	5	1421208	6
PostTypes	08-10 22:01	08-10 22:01	2	8	0
Posts	08-10 22:01	08-11 00:20	20	17142169	139
Users	08-11 00:20	08-11 00:34	14	2465713	14
UsersTest	08-11 00:34	08-11 00:34	14	5000	0
VoteTypes	08-11 00:34	08-11 00:34	1	15	0
Votes	08-11 00:34	08-11 05:35	6	52928720	301

BuzzCutNorman · 2022-08-26T17:56:39Z

BuzzCutNorman
Aug 26, 2022

Summary

I have been trying out some SQLAlchemy options to see how they effect extract and load times and thought I would share what I have come across so far. The biggest boost came from the target having executemany enabled. SQLAlchemy I think utilizes executemany_mode=values_only for postgres by default so I had to force executemany off to test how big a benefit it and values_plus_batch had. The benefit was a 10 min reduction in extract and load times.

Next up is trying some of the SQLAlchemy option against the larger tables in the StackOverflow2013 database.

I will call out one item not mentioned below that I didn't notice until late in my testing. The SDK by default turns on streaming results.

Setup

Source:

VM 4 x vCPU, 16 GB memory
Windows 2019 standard
MS SQL Server 2019
50 GB StackOverflow2013

Target:

VM 4 x vCPU 16 GB memory
Windows 2019 standard
Postgres 14
Note: The Users table was dropped from the database in between runs

Meltano:

VM 2 x vCPU, 8 GB memory
Windows 2022 standard core
Python 3.9
Meltano version: 2.4.0
Tap: SDK(0.8.0) mssql+pyodbc
Target: SDK(0.8.0) postgresql+psycopg2
Command: meltano run

Results

Database	Table	Columns	Rows	Duration/Minutes	Notes
StackOverflow2013	Users	4	2465713	26	No tap or target Excute Helpers
StackOverflow2013	Users	4	2465713	25	Tap fast_executemany=True
StackOverflow2013	Users	4	2465713	26	Tap fast_executemany=Ture, result.fetchmany(5000)
StackOverflow2013	Users	4	2465713	15	Target executemany_mode=values_only
StackOverflow2013	Users	4	2465713	14	Target executemany_mode=values_plus_batch
StackOverflow2013	Users	4	2465713	15	Target executemany_mode=values_plus_batch, Dilect Specfic insert()
StackOverflow2013	Users	4	2465713	15	Tap fast_executemany=Ture, result.fetchmany(5000); Target executemany_mode=values_only
StackOverflow2013	Users	4	2465713	16	Tap fast_executemany=Ture, result.fetchmany(10000); Target executemany_mode=values_only, Dilect Specfic insert()
StackOverflow2013	Users	4	2465713	17	Tap fast_executemany=Ture, result.fetchmany(10000); Target executemany_mode=values_plus_batch, Dilect Specfic insert(, Max Sink Size 1000
StackOverflow2013	Users	4	2465713	15	Tap fast_executemany=Ture, result.fetchone(); Target executemany_mode=values_only, Dilect Specfic insert()
StackOverflow2013	Users	4	2465713	14	Tap fast_executemany=Ture, result.fetchone(); Target executemany_mode=values_plus_batch, Dilect Specfic insert()
StackOverflow2013	Users	4	2465713	15	Tap fast_executemany=Ture, result.fetchone(); Target executemany_mode=values_plus_batch, Dilect Specfic insert()

Legend:
fast_executemany
executemany_mode
result.fetchmany
result.fetchone
Dialect Specific insert()
Max Sink Size 1000

Other Links:
Fetching Slowness - Core
Large Result Set Examples

6 replies

aaronsteers Aug 26, 2022
Author

@BuzzCutNorman - Do you have any quick steps to load the source DB?

Going a little meta here, but a series of load/prep steps like this would be great...

# Seed the raw data
meltano run tap-stackoverflow-sampledata target-mssql-sampledata

# Now start profiling...
time meltano run tap-mssql-sampledata target-postgres-perftest

Not sure if this is super hard or super easy but wanted to ask directly, as this would be nice to have as a set of reproducible steps.

BuzzCutNorman Aug 26, 2022

Honestly I have never documented them. The point you in the right direction version that only works for MS SQL is I manually download a zipped up version that has been prepared for MS SQL, extract it, move the files to the correct directories, attach the database, then bring the database online. Here is a link to the zip file: How to Download the Stack Overflow Database.

I will see if I can recreate what I download using Meltano. They utilize an application soddi to generate the database from xml files that are downloaded from this site: https://archive.org/details/stackexchange. Do you know if tap-stackexchange would be able to grab this same data?

aaronsteers Aug 27, 2022
Author

I did some digging. In this particular network, it looks like those samples are more on the Windows OS side. From a fresh google search though, I stumbled on the Kaggle dataset and this Google Bigquery Dataset.

I wonder if tap-bigquery could be a generic onboarding point for loading the dataset directly into a target of the users' choice. If memory serves, the creation of a creds file is a tiny bit tricky, but otherwise BigQuery is fairly accessible to a broad audience. I'm also open to other methods - or even alternative data sources of a similar volume.

@cjohnhanson - I remember you had a hackday project with tap-socrata - do you know of any large datasets that would offer a similar value for benchmarking across tap/target combos?

BuzzCutNorman Aug 31, 2022

@aaronsteers the use of the Kaggle dataset would be nice since others could utilize tap-bigquery to direct the data to any Meltano target. My only thought so far was the original files are in xml format so maybe create a tap for them using tap-csv as guide.

BuzzCutNorman Sep 21, 2022

@aaronsteers after thinking some more I decided to work on a tap to extract the StackOverflow data from the original xml files. I thought this was the best path for two reasons.

If you have a tap that deals with the original xml files you can use Meltano to start and maintain a generic onboarding point like you mentioned earlier.
A tap accessing local files I am guessing has the potential to generate more throughput to a target. In my mind greater the tap throughput the better the test for your target. Plus it is a quick process. Install the tap, download a set of files to a directory, give the tap the directory path, meltano run , and you are off and testing.

I am doing some full load test to postgresql using a my first working version.

edgarrmondragon · 2022-09-15T21:02:47Z

edgarrmondragon
Sep 15, 2022
Maintainer

Would getting good metrics logging in the SDK help with this?

I imagine it would at least help to compare record-by-record vs batch by looking at a record count timeseries in e.g. Prometheus. For example, backpressure would become apparent.

1 reply

kgpayne Sep 21, 2022

This would be so cool in graphic form in the CLI UI 😍

Run Time: 120s
Records Extracted by Tap
###########################  1012345
Records Persisted by Target
###############  812345

aaronsteers · 2022-10-24T22:09:33Z

aaronsteers
Oct 24, 2022
Author

Based on the new (still in progress) tap-snowflake implementation:

Batch disabled:
- 1.5M records (tpch_sf1.orders): 9m58s (598 seconds), or approx. 2.5K records per second
Batch enabled:
- 1.5M records (tpch_sf1.orders): 18.5 seconds, or approx. 81K records per second
- 15M records (tpch_sf10.orders): 2.5 min (150 seconds), or approx. 100K records per second
- 150M records (tpch_sf100.orders): 11 min 13s (674 seconds), or approx. 223K records per second

Environment and project details

Fork from the MDS-in-a-box repo, added tap-snowflake from the WIP PR.
GitHub Codespaces using default container size.
Snowflake environment on smallest warehouse size (x-small).
Dataset is orders from the TPC-H (tpch_sf##) databases, provided by Snowflake.
- tpch_sf1.orders: 1.5M rows (exactly), compressed size approx 40 MB
- tpch_sf10.orders: 15.0M rows (exactly), compressed size approx 425 MB
- tpch_sf100.orders: 150.0M rows (exactly), compressed size approx 4.3 GB
- tpch_sf1000.orders: 1.5B rows (exactly), compressed size approx 49 GB
Meltano project config here, including readme: feat: add tap-snowflake batch test process aaronsteers/meltano-demo-in-a-box#1
In all tests, data is finally stored as JSONL locally on disk - although the JSON structure and file compression techniques are different between the two approaches.
For the batch tests, the file creation process was performed natively by Snowflake, and the tap downloaded them as compressed files after Snowflake finished creating them.

Findings and Caveats

Findings

For a moderate sized dataset having 1.5M records, BATCH data extraction was 32x faster than native record-by-record processing.
The performance benefit is more significant for larger datasets than for smaller or medium-sized ones. For a similar execution duration (10min vs 11min), the batch method processed records almost 90x more records per minute than the record-by-record process.

Caveats

When using BATCH message type, records are not touched at all by the tap. No data validation, cleansing, or transformation of any type is performed.
As noted below, this does not include the process of loading data to the target system.
The dataset was chosen as a prime candidate exactly because it was high volume and did not have any parent-child relationship with other streams. Processing of parent-child streams is significantly slower, due to a lower ratio of records per batch.
When number of records per stream is very small (in the 100s or 1000s), performance of batch can be much worse than record-by-record processing, due to file processing overhead.

Next steps

These performance metrics are only for extraction to local disk. As a next step, we would perform a similar test to analyze load performance back into Snowflake, with and without batching enabled.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tap and Target performance benchmarks #906

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Tap and Target performance benchmarks #906

aaronsteers Aug 11, 2022

Replies: 4 comments · 8 replies

aaronsteers Aug 11, 2022 Author

BuzzCutNorman Aug 26, 2022

BuzzCutNorman Aug 26, 2022

Summary

Setup

Source:

Target:

Meltano:

Results

aaronsteers Aug 26, 2022 Author

BuzzCutNorman Aug 26, 2022

aaronsteers Aug 27, 2022 Author

BuzzCutNorman Aug 31, 2022

BuzzCutNorman Sep 21, 2022

edgarrmondragon Sep 15, 2022 Maintainer

kgpayne Sep 21, 2022

aaronsteers Oct 24, 2022 Author

Environment and project details

Findings and Caveats

Findings

Caveats

Next steps

aaronsteers
Aug 11, 2022

Replies: 4 comments 8 replies

aaronsteers
Aug 11, 2022
Author

BuzzCutNorman
Aug 26, 2022

aaronsteers Aug 26, 2022
Author

aaronsteers Aug 27, 2022
Author

edgarrmondragon
Sep 15, 2022
Maintainer

aaronsteers
Oct 24, 2022
Author